8.6 KiB
8.6 KiB
Atlas Cluster Power Recovery (Graceful Shutdown/Startup)
Purpose
- Provide a safe operator flow for planned power events and cold-boot recovery.
- Avoid the Flux/Gitea bootstrap deadlock by using a local bootstrap fallback path.
- Break the Harbor self-hosting deadlock by seeding Harbor runtime images from a control-host bundle.
- Refuse bootstrap when UPS charge is too low, and fall back to fast shutdown if a second outage hits mid-recovery.
Bootstrapping risk to remember
- Flux source is Git over SSH to
scm.bstein.dev(Gitea). - Gitea itself is a Flux-managed workload and depends on storage + database.
- Harbor is also critical, but it is not part of the first recovery stage because Harbor serves its own runtime images.
- On cold boot, if Flux cannot fetch source before Gitea is up, reconciliation can stall.
- Recovery path: bring control plane and workers up, then locally apply minimal platform stack (
core -> helm -> longhorn -> metallb -> traefik -> vault-csi -> vault-injector -> vault -> postgres -> gitea), then seed Harbor images onto the Harbor node from a control-host bundle, then resume/reconcile Flux. Harbor is a later recovery stage after storage, Vault, Postgres, and Gitea are back.
Script
scripts/cluster_power_recovery.shscripts/cluster_power_console.sh- Modes:
prepareshutdownharbor-seedstartupstatus
- Default is dry-run. Add
--executeto actually perform actions.
Dry-run examples
- Shutdown preview:
scripts/cluster_power_recovery.sh shutdown --skip-etcd-snapshot --skip-drain
- Startup preview:
scripts/cluster_power_recovery.sh startup
- Harbor seed preview:
scripts/cluster_power_recovery.sh harbor-seed
Execute examples
- Prepare helper image on every node:
scripts/cluster_power_recovery.sh prepare --execute
- Seed Harbor runtime images onto
titan-05from the control-host bundle:scripts/cluster_power_recovery.sh harbor-seed --execute
- Planned shutdown:
scripts/cluster_power_recovery.sh shutdown --execute
- Planned startup (canonical branch):
scripts/cluster_power_recovery.sh startup --execute --force-flux-branch main
Manual remote console examples
- Canonical operator hosts:
titan-dbtethys(titan-24)
- Both hosts now have:
~/ananke-tools/cluster_power_recovery.sh~/ananke-tools/cluster_power_console.sh~/ananke-tools/bootstrap/recovery-config.env~/ananke-tools/bootstrap/harbor-bootstrap-images.txt~/ananke-tools/kubeconfig~/ananke-cluster-power~/bin/ananke-cluster-power~/ananke-repo/{infrastructure,services,scripts}
- Both hosts also keep the Harbor bootstrap bundle at:
~/.local/share/ananke/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst
- Remote usage:
ssh titan-db~/ananke-cluster-power status~/ananke-cluster-power prepare --execute~/ananke-cluster-power shutdown --execute~/ananke-cluster-power startup --execute --force-flux-branch mainssh tethys~/ananke-cluster-power status~/ananke-cluster-power prepare --execute~/ananke-cluster-power shutdown --execute~/ananke-cluster-power startup --execute --force-flux-branch main
Useful options
--shutdown-mode host-poweroff|cluster-only--expected-flux-branch main--expected-flux-url ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git--force-flux-url ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git--force-flux-branch main--allow-flux-source-mutation(required with--force-flux-url; breakglass only)--skip-local-bootstrap(not recommended for cold-start recovery)--skip-harbor-bootstrap(skip the Harbor recovery stage if you know Harbor should stay deferred)--skip-harbor-seed(skip bundle import if Harbor images are already cached on the target node)--skip-helper-prewarm--min-startup-battery 35--ups-host pyrphoros@localhost--require-ups-battery--drain-timeout 180--emergency-drain-timeout 45--flux-ready-timeout 1200--startup-checklist-timeout 900--startup-stability-window 180--startup-stability-timeout 900--recovery-state-file ~/.local/share/ananke/cluster_power_recovery.state--harbor-bundle-file ~/.local/share/ananke/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst
Controlled drill checklist (recommended)
- Operator host: use
titan-dbas canonical control host for the drill. - On-site coordination:
- Have on-site operator ready before shutdown starts.
- Confirm they will manually power cluster nodes back on after shutdown completes.
- Confirm who will announce "all nodes powered on" to resume startup.
- Preflight on
titan-db:mkdir -p ~/ananke-logs~/ananke-cluster-power statusand verify:ups_host=pyrphoros@localhostups_batteryis numericflux_source_ready=True
- Warm helper image just before shutdown:
~/ananke-cluster-power prepare --execute
- Run in a persistent shell and capture logs:
tmux new -s ananke-drillscript -q -a ~/ananke-logs/ananke-drill-$(date +%Y%m%d-%H%M%S).log
- Execute controlled shutdown with telemetry enforcement:
~/ananke-cluster-power shutdown --execute --require-ups-battery
- After on-site power-on confirmation, execute startup:
~/ananke-cluster-power startup --execute --force-flux-branch main --require-ups-battery
- Post-check:
~/ananke-cluster-power status- Verify critical services (
longhorn,vault,postgres,gitea,harbor,pegasus) and no widespread pull/crash failures.
Operational notes
- The flow suspends Flux Kustomizations/HelmReleases during shutdown to prevent churn.
- Shutdown behavior is explicit:
host-poweroffschedules host poweroff after service stop.cluster-onlystopsk3s/k3s-agentwithout powering hosts off.
- Worker drain is no longer best-effort only. The script now escalates from normal drain, to
--force, to--disable-evictiononce the configured timeout is exhausted. - Startup fails fast if Flux source URL/branch drift from expected values (unless branch override is explicitly requested with
--force-flux-branch). - Flux desired-state source remains
titan-iac.git. Ananke orchestrates runtime recovery and should not be used as the normal Flux source repo. - During startup, if Flux source is not
Ready, local bootstrap fallback is applied first using the repo snapshot under~/ananke-repo. - Longhorn is reconciled before Vault/Postgres/Gitea so storage-backed services are not racing the volume layer.
- Harbor is reconciled after the first critical stateful services.
- Harbor bootstrap is now designed around a control-host bundle:
- Build the Harbor bundle locally with
scripts/build_harbor_bootstrap_bundle.sh. - Stage it on the operator host at
~/.local/share/ananke/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst. - Use
harbor-seed --executeor a fullstartup --executeto stream/import that bundle ontotitan-05.
- Build the Harbor bundle locally with
- The Harbor bundle remains arm64-only because Harbor is pinned to arm64 nodes. The node-helper image is multi-arch because Ananke uses it across both arm64 and amd64 nodes during prepare/shutdown operations.
- Ananke uses a temporary privileged helper pod for host-side operations. The helper image is prewarmed with
prepare --executeso later shutdown/startup steps do not stall on image pulls. - The script persists outage state in
~/.local/share/ananke/cluster_power_recovery.stateby default. If startup is attempted during an outage window and power becomes unstable again, rerunning startup with insufficient UPS charge will flip into the emergency shutdown path instead of continuing to bootstrap. - Startup completion is strict now:
- all non-optional Flux kustomizations must be
Ready=True - external service checklist must pass (defaults include Gitea, Grafana, Harbor)
- generated ingress reachability checks must pass (default accepted codes:
200,301,302,307,308,401,403,404) - stability soak must pass with no crashloop/pull-failure churn
- all non-optional Flux kustomizations must be
- If Flux hits immutable one-off Job drift during reconcile, Ananke now attempts self-heal by pruning failed Flux-managed Jobs and retrying reconcile.
- In dry-run mode, the script now skips the live API wait step so preview runs do not stall on an offline cluster.
- Dry-run mode no longer mutates outage recovery state.
harbor-seed --executewas validated by:- prewarming the helper image across all nodes
- streaming the Harbor bootstrap bundle to
titan-05 - importing Harbor runtime images into host
containerd - successfully running a Harbor-backed canary pod (
harbor-canary-ok)
- After bootstrap, Flux resources are resumed and reconciled.
- Keep this runbook aligned with
clusters/atlas/flux-system/gotk-sync.yaml.