Atlas Cluster Power Recovery (Graceful Shutdown/Startup) Purpose - Provide a safe operator flow for planned power events and cold-boot recovery. - Avoid the Flux/Gitea bootstrap deadlock by using a local bootstrap fallback path. - Break the Harbor self-hosting deadlock by seeding Harbor runtime images from a control-host bundle. - Refuse bootstrap when UPS charge is too low, and fall back to fast shutdown if a second outage hits mid-recovery. Bootstrapping risk to remember - Flux source is Git over SSH to `scm.bstein.dev` (Gitea). - Gitea itself is a Flux-managed workload and depends on storage + database. - Harbor is also critical, but it is not part of the first recovery stage because Harbor serves its own runtime images. - On cold boot, if Flux cannot fetch source before Gitea is up, reconciliation can stall. - Recovery path: bring control plane and workers up, then locally apply minimal platform stack (`core -> helm -> longhorn -> metallb -> traefik -> vault-csi -> vault-injector -> vault -> postgres -> gitea`), then seed Harbor images onto the Harbor node from a control-host bundle, then resume/reconcile Flux. Harbor is a later recovery stage after storage, Vault, Postgres, and Gitea are back. Script - `scripts/cluster_power_recovery.sh` - `scripts/cluster_power_console.sh` - Modes: - `prepare` - `shutdown` - `harbor-seed` - `startup` - `status` - Default is dry-run. Add `--execute` to actually perform actions. Dry-run examples - Shutdown preview: - `scripts/cluster_power_recovery.sh shutdown --skip-etcd-snapshot --skip-drain` - Startup preview: - `scripts/cluster_power_recovery.sh startup` - Harbor seed preview: - `scripts/cluster_power_recovery.sh harbor-seed` Execute examples - Prepare helper image on every node: - `scripts/cluster_power_recovery.sh prepare --execute` - Seed Harbor runtime images onto `titan-05` from the control-host bundle: - `scripts/cluster_power_recovery.sh harbor-seed --execute` - Planned shutdown: - `scripts/cluster_power_recovery.sh shutdown --execute` - Planned startup (canonical branch): - `scripts/cluster_power_recovery.sh startup --execute --force-flux-branch main` Manual remote console examples - Canonical operator hosts: - `titan-db` - `titan-24` - Both hosts now have: - `~/hecate-tools/cluster_power_recovery.sh` - `~/hecate-tools/cluster_power_console.sh` - `~/hecate-tools/bootstrap/recovery-config.env` - `~/hecate-tools/bootstrap/harbor-bootstrap-images.txt` - `~/hecate-tools/kubeconfig` - `~/hecate-cluster-power` - `~/bin/hecate-cluster-power` - `~/hecate-repo/{infrastructure,services,scripts}` - Both hosts also keep the Harbor bootstrap bundle at: - `~/.local/share/hecate/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst` - Remote usage: - `ssh titan-db` - `~/hecate-cluster-power status` - `~/hecate-cluster-power prepare --execute` - `~/hecate-cluster-power shutdown --execute` - `~/hecate-cluster-power startup --execute --force-flux-branch main` - `ssh titan-24` - `~/hecate-cluster-power status` - `~/hecate-cluster-power prepare --execute` - `~/hecate-cluster-power shutdown --execute` - `~/hecate-cluster-power startup --execute --force-flux-branch main` Useful options - `--expected-flux-branch main` - `--force-flux-branch main` - `--skip-local-bootstrap` (not recommended for cold-start recovery) - `--skip-harbor-bootstrap` (skip the Harbor recovery stage if you know Harbor should stay deferred) - `--skip-harbor-seed` (skip bundle import if Harbor images are already cached on the target node) - `--skip-helper-prewarm` - `--min-startup-battery 35` - `--ups-host pyrphoros@localhost` - `--require-ups-battery` - `--drain-timeout 180` - `--emergency-drain-timeout 45` - `--recovery-state-file ~/.local/share/hecate/cluster_power_recovery.state` - `--harbor-bundle-file ~/.local/share/hecate/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst` Controlled drill checklist (recommended) - Operator host: use `titan-db` as canonical control host for the drill. - On-site coordination: - Have on-site operator ready before shutdown starts. - Confirm they will manually power cluster nodes back on after shutdown completes. - Confirm who will announce "all nodes powered on" to resume startup. - Preflight on `titan-db`: - `mkdir -p ~/hecate-logs` - `~/hecate-cluster-power status` and verify: - `ups_host=pyrphoros@localhost` - `ups_battery` is numeric - `flux_source_ready=True` - Warm helper image just before shutdown: - `~/hecate-cluster-power prepare --execute` - Run in a persistent shell and capture logs: - `tmux new -s hecate-drill` - `script -q -a ~/hecate-logs/hecate-drill-$(date +%Y%m%d-%H%M%S).log` - Execute controlled shutdown with telemetry enforcement: - `~/hecate-cluster-power shutdown --execute --require-ups-battery` - After on-site power-on confirmation, execute startup: - `~/hecate-cluster-power startup --execute --force-flux-branch main --require-ups-battery` - Post-check: - `~/hecate-cluster-power status` - Verify critical services (`longhorn`, `vault`, `postgres`, `gitea`, `harbor`, `pegasus`) and no widespread pull/crash failures. Operational notes - The flow suspends Flux Kustomizations/HelmReleases during shutdown to prevent churn. - Worker drain is no longer best-effort only. The script now escalates from normal drain, to `--force`, to `--disable-eviction` once the configured timeout is exhausted. - During startup, if Flux source is not `Ready`, local bootstrap fallback is applied first using the repo snapshot under `~/hecate-repo`. - Longhorn is reconciled before Vault/Postgres/Gitea so storage-backed services are not racing the volume layer. - Harbor is reconciled after the first critical stateful services. - Harbor bootstrap is now designed around a control-host bundle: - Build the Harbor bundle locally with `scripts/build_harbor_bootstrap_bundle.sh`. - Stage it on the operator host at `~/.local/share/hecate/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`. - Use `harbor-seed --execute` or a full `startup --execute` to stream/import that bundle onto `titan-05`. - The Harbor bundle remains arm64-only because Harbor is pinned to arm64 nodes. The node-helper image is multi-arch because Hecate uses it across both arm64 and amd64 nodes during prepare/shutdown operations. - Hecate uses a temporary privileged helper pod for host-side operations. The helper image is prewarmed with `prepare --execute` so later shutdown/startup steps do not stall on image pulls. - The script persists outage state in `~/.local/state/cluster_power_recovery.state` by default. If startup is attempted during an outage window and power becomes unstable again, rerunning startup with insufficient UPS charge will flip into the emergency shutdown path instead of continuing to bootstrap. - In dry-run mode, the script now skips the live API wait step so preview runs do not stall on an offline cluster. - Dry-run mode no longer mutates outage recovery state. - `harbor-seed --execute` was validated by: - prewarming the helper image across all nodes - streaming the Harbor bootstrap bundle to `titan-05` - importing Harbor runtime images into host `containerd` - successfully running a Harbor-backed canary pod (`harbor-canary-ok`) - After bootstrap, Flux resources are resumed and reconciled. - Keep this runbook aligned with `clusters/atlas/flux-system/gotk-sync.yaml`.