Atlas Cluster Power Recovery (Graceful Shutdown/Startup) Purpose - Provide a safe operator flow for planned power events and cold-boot recovery. - Avoid the Flux/Gitea bootstrap deadlock by using a local bootstrap fallback path. - Refuse bootstrap when UPS charge is too low, and fall back to fast shutdown if a second outage hits mid-recovery. Bootstrapping risk to remember - Flux source is Git over SSH to `scm.bstein.dev` (Gitea). - Gitea itself is a Flux-managed workload and depends on storage + database. - Harbor is also critical, but it is not part of the first recovery stage because Harbor currently serves its own runtime images. - On cold boot, if Flux cannot fetch source before Gitea is up, reconciliation can stall. - Recovery path: bring control plane and workers up, then locally apply minimal platform stack (`core -> helm -> longhorn -> metallb -> traefik -> vault-csi -> vault-injector -> vault -> postgres -> gitea`), then resume/reconcile Flux. Harbor is a later recovery stage after storage, Vault, Postgres, and Gitea are back. Script - `scripts/cluster_power_recovery.sh` - `scripts/cluster_power_console.sh` - Modes: - `shutdown` - `startup` - Default is dry-run. Add `--execute` to actually perform actions. Dry-run examples - Shutdown preview: - `scripts/cluster_power_recovery.sh shutdown --skip-etcd-snapshot --skip-drain` - Startup preview: - `scripts/cluster_power_recovery.sh startup` Execute examples - Planned shutdown: - `scripts/cluster_power_recovery.sh shutdown --execute` - Planned startup (canonical branch): - `scripts/cluster_power_recovery.sh startup --execute --force-flux-branch main` Manual remote console examples - From `titan-24` with a local checkout: - `~/Development/titan-iac/scripts/cluster_power_console.sh shutdown --execute` - `~/Development/titan-iac/scripts/cluster_power_console.sh startup --execute --force-flux-branch main` - From `titan-db`, if the checkout is not present locally, the console wrapper can delegate to `titan-24`: - `~/Development/titan-iac/scripts/cluster_power_console.sh --delegate-host titan-24 shutdown --execute` - `~/Development/titan-iac/scripts/cluster_power_console.sh --delegate-host titan-24 startup --execute --force-flux-branch main` Useful options - `--control-planes titan-0a,titan-0b,titan-0c` - `--workers ` (otherwise the script tries API discovery first, then falls back to the static atlas worker inventory) - `--expected-flux-branch main` - `--force-flux-branch main` - `--skip-local-bootstrap` (not recommended for cold-start recovery) - `--skip-harbor-bootstrap` (skip the Harbor recovery stage if you know Harbor should stay deferred) - `--min-startup-battery 35` - `--ups-host ups@localhost` - `--require-ups-battery` - `--drain-timeout 180` - `--emergency-drain-timeout 45` - `--recovery-state-file ~/.local/state/cluster_power_recovery.state` Operational notes - The flow suspends Flux Kustomizations/HelmReleases during shutdown to prevent churn. - Worker drain is no longer best-effort only. The script now escalates from normal drain, to `--force`, to `--disable-eviction` once the configured timeout is exhausted. - During startup, if Flux source is not `Ready`, local bootstrap fallback is applied first. - Longhorn is reconciled before Vault/Postgres/Gitea so storage-backed services are not racing the volume layer. - Harbor is reconciled after the first critical stateful services. Treat Harbor bootstrap as requiring either cached Harbor runtime images on the scheduled node or a separate bootstrap source for those images. - The script persists outage state in `~/.local/state/cluster_power_recovery.state` by default. If startup is attempted during an outage window and power becomes unstable again, rerunning startup with insufficient UPS charge will flip into the emergency shutdown path instead of continuing to bootstrap. - In dry-run mode, the script now skips the live API wait step so preview runs do not stall on an offline cluster. - After bootstrap, Flux resources are resumed and reconciled. - Keep this runbook aligned with `clusters/atlas/flux-system/gotk-sync.yaml`.