Atlas Cluster Power Recovery (Graceful Shutdown/Startup)
Purpose
- Provide a safe operator flow for planned power events and cold-boot recovery.
- Avoid the Flux/Gitea bootstrap deadlock by using a local bootstrap fallback path.
- Break the Harbor self-hosting deadlock by seeding Harbor runtime images from a control-host bundle.
- Refuse bootstrap when UPS charge is too low, and fall back to fast shutdown if a second outage hits mid-recovery.
Bootstrapping risk to remember
- Flux source is Git over SSH to `scm.bstein.dev` (Gitea).
- Gitea itself is a Flux-managed workload and depends on storage + database.
- Harbor is also critical, but it is not part of the first recovery stage because Harbor serves its own runtime images.
- On cold boot, if Flux cannot fetch source before Gitea is up, reconciliation can stall.
- Recovery path: bring control plane and workers up, then locally apply minimal platform stack (`core -> helm -> longhorn -> metallb -> traefik -> vault-csi -> vault-injector -> vault -> postgres -> gitea`), then seed Harbor images onto the Harbor node from a control-host bundle, then resume/reconcile Flux. Harbor is a later recovery stage after storage, Vault, Postgres, and Gitea are back.
Script
-`scripts/cluster_power_recovery.sh`
-`scripts/cluster_power_console.sh`
- Modes:
-`prepare`
-`shutdown`
-`harbor-seed`
-`startup`
-`status`
- Default is dry-run. Add `--execute` to actually perform actions.
- The flow suspends Flux Kustomizations/HelmReleases during shutdown to prevent churn.
- Worker drain is no longer best-effort only. The script now escalates from normal drain, to `--force`, to `--disable-eviction` once the configured timeout is exhausted.
- During startup, if Flux source is not `Ready`, local bootstrap fallback is applied first using the repo snapshot under `~/hecate-repo`.
- Longhorn is reconciled before Vault/Postgres/Gitea so storage-backed services are not racing the volume layer.
- Harbor is reconciled after the first critical stateful services.
- Harbor bootstrap is now designed around a control-host bundle:
- Build the Harbor bundle locally with `scripts/build_harbor_bootstrap_bundle.sh`.
- Stage it on the operator host at `~/.local/share/hecate/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`.
- Use `harbor-seed --execute` or a full `startup --execute` to stream/import that bundle onto `titan-05`.
- The Harbor bundle remains arm64-only because Harbor is pinned to arm64 nodes. The node-helper image is multi-arch because Hecate uses it across both arm64 and amd64 nodes during prepare/shutdown operations.
- Hecate uses a temporary privileged helper pod for host-side operations. The helper image is prewarmed with `prepare --execute` so later shutdown/startup steps do not stall on image pulls.
- The script persists outage state in `~/.local/state/cluster_power_recovery.state` by default. If startup is attempted during an outage window and power becomes unstable again, rerunning startup with insufficient UPS charge will flip into the emergency shutdown path instead of continuing to bootstrap.
- In dry-run mode, the script now skips the live API wait step so preview runs do not stall on an offline cluster.