titan-iac/knowledge/runbooks/cluster-power-recovery.md

4.0 KiB

Atlas Cluster Power Recovery (Graceful Shutdown/Startup)

Purpose

  • Provide a safe operator flow for planned power events and cold-boot recovery.
  • Avoid the Flux/Gitea bootstrap deadlock by using a local bootstrap fallback path.
  • Refuse bootstrap when UPS charge is too low, and fall back to fast shutdown if a second outage hits mid-recovery.

Bootstrapping risk to remember

  • Flux source is Git over SSH to scm.bstein.dev (Gitea).
  • Gitea itself is a Flux-managed workload and depends on storage + database.
  • Harbor is also critical, but it is not part of the first recovery stage because Harbor currently serves its own runtime images.
  • On cold boot, if Flux cannot fetch source before Gitea is up, reconciliation can stall.
  • Recovery path: bring control plane and workers up, then locally apply minimal platform stack (core -> helm -> longhorn -> metallb -> traefik -> vault-csi -> vault-injector -> vault -> postgres -> gitea), then resume/reconcile Flux. Harbor is a later recovery stage after storage, Vault, Postgres, and Gitea are back.

Script

  • scripts/cluster_power_recovery.sh
  • scripts/cluster_power_console.sh
  • Modes:
    • shutdown
    • startup
  • Default is dry-run. Add --execute to actually perform actions.

Dry-run examples

  • Shutdown preview:
    • scripts/cluster_power_recovery.sh shutdown --skip-etcd-snapshot --skip-drain
  • Startup preview:
    • scripts/cluster_power_recovery.sh startup

Execute examples

  • Planned shutdown:
    • scripts/cluster_power_recovery.sh shutdown --execute
  • Planned startup (canonical branch):
    • scripts/cluster_power_recovery.sh startup --execute --force-flux-branch main

Manual remote console examples

  • From titan-24 with a local checkout:
    • ~/Development/titan-iac/scripts/cluster_power_console.sh shutdown --execute
    • ~/Development/titan-iac/scripts/cluster_power_console.sh startup --execute --force-flux-branch main
  • From titan-db, if the checkout is not present locally, the console wrapper can delegate to titan-24:
    • ~/Development/titan-iac/scripts/cluster_power_console.sh --delegate-host titan-24 shutdown --execute
    • ~/Development/titan-iac/scripts/cluster_power_console.sh --delegate-host titan-24 startup --execute --force-flux-branch main

Useful options

  • --control-planes titan-0a,titan-0b,titan-0c
  • --workers <csv> (otherwise the script tries API discovery first, then falls back to the static atlas worker inventory)
  • --expected-flux-branch main
  • --force-flux-branch main
  • --skip-local-bootstrap (not recommended for cold-start recovery)
  • --skip-harbor-bootstrap (skip the Harbor recovery stage if you know Harbor should stay deferred)
  • --min-startup-battery 35
  • --ups-host ups@localhost
  • --require-ups-battery
  • --drain-timeout 180
  • --emergency-drain-timeout 45
  • --recovery-state-file ~/.local/state/cluster_power_recovery.state

Operational notes

  • The flow suspends Flux Kustomizations/HelmReleases during shutdown to prevent churn.
  • Worker drain is no longer best-effort only. The script now escalates from normal drain, to --force, to --disable-eviction once the configured timeout is exhausted.
  • During startup, if Flux source is not Ready, local bootstrap fallback is applied first.
  • Longhorn is reconciled before Vault/Postgres/Gitea so storage-backed services are not racing the volume layer.
  • Harbor is reconciled after the first critical stateful services. Treat Harbor bootstrap as requiring either cached Harbor runtime images on the scheduled node or a separate bootstrap source for those images.
  • The script persists outage state in ~/.local/state/cluster_power_recovery.state by default. If startup is attempted during an outage window and power becomes unstable again, rerunning startup with insufficient UPS charge will flip into the emergency shutdown path instead of continuing to bootstrap.
  • In dry-run mode, the script now skips the live API wait step so preview runs do not stall on an offline cluster.
  • After bootstrap, Flux resources are resumed and reconciled.
  • Keep this runbook aligned with clusters/atlas/flux-system/gotk-sync.yaml.