Atlas Cluster Power Recovery (Graceful Shutdown/Startup)
Purpose
- Provide a safe operator flow for planned power events and cold-boot recovery.
- Avoid the Flux/Gitea bootstrap deadlock by using a local bootstrap fallback path.
- Break the Harbor self-hosting deadlock by seeding Harbor runtime images from a control-host bundle.
- Refuse bootstrap when UPS charge is too low, and fall back to fast shutdown if a second outage hits mid-recovery.
Bootstrapping risk to remember
- Flux source is Git over SSH to `scm.bstein.dev` (Gitea).
- Gitea itself is a Flux-managed workload and depends on storage + database.
- Harbor is also critical, but it is not part of the first recovery stage because Harbor serves its own runtime images.
- On cold boot, if Flux cannot fetch source before Gitea is up, reconciliation can stall.
- Recovery path: bring control plane and workers up, then locally apply minimal platform stack (`core -> helm -> longhorn -> metallb -> traefik -> vault-csi -> vault-injector -> vault -> postgres -> gitea`), then seed Harbor images onto the Harbor node from a control-host bundle, then resume/reconcile Flux. Harbor is a later recovery stage after storage, Vault, Postgres, and Gitea are back.
Script
-`scripts/cluster_power_recovery.sh`
-`scripts/cluster_power_console.sh`
- Modes:
-`prepare`
-`shutdown`
-`harbor-seed`
-`startup`
-`status`
- Default is dry-run. Add `--execute` to actually perform actions.
- Worker drain is no longer best-effort only. The script now escalates from normal drain, to `--force`, to `--disable-eviction` once the configured timeout is exhausted.
- The Harbor bundle remains arm64-only because Harbor is pinned to arm64 nodes. The node-helper image is multi-arch because Ananke uses it across both arm64 and amd64 nodes during prepare/shutdown operations.
- Ananke uses a temporary privileged helper pod for host-side operations. The helper image is prewarmed with `prepare --execute` so later shutdown/startup steps do not stall on image pulls.
- The script persists outage state in `~/.local/share/ananke/cluster_power_recovery.state` by default. If startup is attempted during an outage window and power becomes unstable again, rerunning startup with insufficient UPS charge will flip into the emergency shutdown path instead of continuing to bootstrap.
- Startup completion is strict now:
- all non-optional Flux kustomizations must be `Ready=True`
- external service checklist must pass (defaults include Gitea, Grafana, Harbor)