2026-04-06 00:22:54 -03:00
Atlas Cluster Power Recovery (Graceful Shutdown/Startup)
Purpose
- Provide a safe operator flow for planned power events and cold-boot recovery.
- Avoid the Flux/Gitea bootstrap deadlock by using a local bootstrap fallback path.
2026-04-06 04:59:37 -03:00
- Break the Harbor self-hosting deadlock by seeding Harbor runtime images from a control-host bundle.
2026-04-06 00:22:54 -03:00
- Refuse bootstrap when UPS charge is too low, and fall back to fast shutdown if a second outage hits mid-recovery.
Bootstrapping risk to remember
- Flux source is Git over SSH to `scm.bstein.dev` (Gitea).
- Gitea itself is a Flux-managed workload and depends on storage + database.
2026-04-06 04:59:37 -03:00
- Harbor is also critical, but it is not part of the first recovery stage because Harbor serves its own runtime images.
2026-04-06 00:22:54 -03:00
- On cold boot, if Flux cannot fetch source before Gitea is up, reconciliation can stall.
2026-04-06 04:59:37 -03:00
- Recovery path: bring control plane and workers up, then locally apply minimal platform stack (`core -> helm -> longhorn -> metallb -> traefik -> vault-csi -> vault-injector -> vault -> postgres -> gitea` ), then seed Harbor images onto the Harbor node from a control-host bundle, then resume/reconcile Flux. Harbor is a later recovery stage after storage, Vault, Postgres, and Gitea are back.
2026-04-06 00:22:54 -03:00
Script
- `scripts/cluster_power_recovery.sh`
- `scripts/cluster_power_console.sh`
- Modes:
2026-04-06 04:59:37 -03:00
- `prepare`
2026-04-06 00:22:54 -03:00
- `shutdown`
2026-04-06 04:59:37 -03:00
- `harbor-seed`
2026-04-06 00:22:54 -03:00
- `startup`
2026-04-06 04:59:37 -03:00
- `status`
2026-04-06 00:22:54 -03:00
- Default is dry-run. Add `--execute` to actually perform actions.
Dry-run examples
- Shutdown preview:
- `scripts/cluster_power_recovery.sh shutdown --skip-etcd-snapshot --skip-drain`
- Startup preview:
- `scripts/cluster_power_recovery.sh startup`
2026-04-06 04:59:37 -03:00
- Harbor seed preview:
- `scripts/cluster_power_recovery.sh harbor-seed`
2026-04-06 00:22:54 -03:00
Execute examples
2026-04-06 04:59:37 -03:00
- Prepare helper image on every node:
- `scripts/cluster_power_recovery.sh prepare --execute`
- Seed Harbor runtime images onto `titan-05` from the control-host bundle:
- `scripts/cluster_power_recovery.sh harbor-seed --execute`
2026-04-06 00:22:54 -03:00
- Planned shutdown:
- `scripts/cluster_power_recovery.sh shutdown --execute`
- Planned startup (canonical branch):
- `scripts/cluster_power_recovery.sh startup --execute --force-flux-branch main`
Manual remote console examples
2026-04-06 04:59:37 -03:00
- Canonical operator hosts:
- `titan-db`
2026-04-07 12:30:28 -03:00
- `tethys` (`titan-24` )
2026-04-06 04:59:37 -03:00
- Both hosts now have:
2026-04-07 12:30:28 -03:00
- `~/ananke-tools/cluster_power_recovery.sh`
- `~/ananke-tools/cluster_power_console.sh`
- `~/ananke-tools/bootstrap/recovery-config.env`
- `~/ananke-tools/bootstrap/harbor-bootstrap-images.txt`
- `~/ananke-tools/kubeconfig`
- `~/ananke-cluster-power`
- `~/bin/ananke-cluster-power`
- `~/ananke-repo/{infrastructure,services,scripts}`
2026-04-06 04:59:37 -03:00
- Both hosts also keep the Harbor bootstrap bundle at:
2026-04-07 12:30:28 -03:00
- `~/.local/share/ananke/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`
2026-04-06 04:59:37 -03:00
- Remote usage:
- `ssh titan-db`
2026-04-07 12:30:28 -03:00
- `~/ananke-cluster-power status`
- `~/ananke-cluster-power prepare --execute`
- `~/ananke-cluster-power shutdown --execute`
- `~/ananke-cluster-power startup --execute --force-flux-branch main`
- `ssh tethys`
- `~/ananke-cluster-power status`
- `~/ananke-cluster-power prepare --execute`
- `~/ananke-cluster-power shutdown --execute`
- `~/ananke-cluster-power startup --execute --force-flux-branch main`
2026-04-06 00:22:54 -03:00
Useful options
2026-04-07 12:30:28 -03:00
- `--shutdown-mode host-poweroff|cluster-only`
2026-04-06 00:22:54 -03:00
- `--expected-flux-branch main`
2026-04-07 12:30:28 -03:00
- `--expected-flux-url ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git`
- `--force-flux-url ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git`
2026-04-06 00:22:54 -03:00
- `--force-flux-branch main`
2026-04-07 12:30:28 -03:00
- `--allow-flux-source-mutation` (required with `--force-flux-url` ; breakglass only)
2026-04-06 00:22:54 -03:00
- `--skip-local-bootstrap` (not recommended for cold-start recovery)
- `--skip-harbor-bootstrap` (skip the Harbor recovery stage if you know Harbor should stay deferred)
2026-04-06 04:59:37 -03:00
- `--skip-harbor-seed` (skip bundle import if Harbor images are already cached on the target node)
- `--skip-helper-prewarm`
2026-04-06 00:22:54 -03:00
- `--min-startup-battery 35`
2026-04-06 04:59:37 -03:00
- `--ups-host pyrphoros@localhost`
2026-04-06 00:22:54 -03:00
- `--require-ups-battery`
- `--drain-timeout 180`
- `--emergency-drain-timeout 45`
2026-04-07 12:30:28 -03:00
- `--flux-ready-timeout 1200`
- `--startup-checklist-timeout 900`
- `--startup-stability-window 180`
- `--startup-stability-timeout 900`
- `--recovery-state-file ~/.local/share/ananke/cluster_power_recovery.state`
- `--harbor-bundle-file ~/.local/share/ananke/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`
2026-04-06 04:59:37 -03:00
Controlled drill checklist (recommended)
- Operator host: use `titan-db` as canonical control host for the drill.
- On-site coordination:
- Have on-site operator ready before shutdown starts.
- Confirm they will manually power cluster nodes back on after shutdown completes.
- Confirm who will announce "all nodes powered on" to resume startup.
- Preflight on `titan-db` :
2026-04-07 12:30:28 -03:00
- `mkdir -p ~/ananke-logs`
- `~/ananke-cluster-power status` and verify:
2026-04-06 04:59:37 -03:00
- `ups_host=pyrphoros@localhost`
- `ups_battery` is numeric
- `flux_source_ready=True`
- Warm helper image just before shutdown:
2026-04-07 12:30:28 -03:00
- `~/ananke-cluster-power prepare --execute`
2026-04-06 04:59:37 -03:00
- Run in a persistent shell and capture logs:
2026-04-07 12:30:28 -03:00
- `tmux new -s ananke-drill`
- `script -q -a ~/ananke-logs/ananke-drill-$(date +%Y%m%d-%H%M%S).log`
2026-04-06 04:59:37 -03:00
- Execute controlled shutdown with telemetry enforcement:
2026-04-07 12:30:28 -03:00
- `~/ananke-cluster-power shutdown --execute --require-ups-battery`
2026-04-06 04:59:37 -03:00
- After on-site power-on confirmation, execute startup:
2026-04-07 12:30:28 -03:00
- `~/ananke-cluster-power startup --execute --force-flux-branch main --require-ups-battery`
2026-04-06 04:59:37 -03:00
- Post-check:
2026-04-07 12:30:28 -03:00
- `~/ananke-cluster-power status`
2026-04-06 04:59:37 -03:00
- Verify critical services (`longhorn` , `vault` , `postgres` , `gitea` , `harbor` , `pegasus` ) and no widespread pull/crash failures.
2026-04-06 00:22:54 -03:00
Operational notes
- The flow suspends Flux Kustomizations/HelmReleases during shutdown to prevent churn.
2026-04-07 12:30:28 -03:00
- Shutdown behavior is explicit:
- `host-poweroff` schedules host poweroff after service stop.
- `cluster-only` stops `k3s` /`k3s-agent` without powering hosts off.
2026-04-06 00:22:54 -03:00
- Worker drain is no longer best-effort only. The script now escalates from normal drain, to `--force` , to `--disable-eviction` once the configured timeout is exhausted.
2026-04-07 12:30:28 -03:00
- Startup fails fast if Flux source URL/branch drift from expected values (unless branch override is explicitly requested with `--force-flux-branch` ).
- Flux desired-state source remains `titan-iac.git` . Ananke orchestrates runtime recovery and should not be used as the normal Flux source repo.
- During startup, if Flux source is not `Ready` , local bootstrap fallback is applied first using the repo snapshot under `~/ananke-repo` .
2026-04-06 00:22:54 -03:00
- Longhorn is reconciled before Vault/Postgres/Gitea so storage-backed services are not racing the volume layer.
2026-04-06 04:59:37 -03:00
- Harbor is reconciled after the first critical stateful services.
- Harbor bootstrap is now designed around a control-host bundle:
- Build the Harbor bundle locally with `scripts/build_harbor_bootstrap_bundle.sh` .
2026-04-07 12:30:28 -03:00
- Stage it on the operator host at `~/.local/share/ananke/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst` .
2026-04-06 04:59:37 -03:00
- Use `harbor-seed --execute` or a full `startup --execute` to stream/import that bundle onto `titan-05` .
2026-04-07 12:30:28 -03:00
- The Harbor bundle remains arm64-only because Harbor is pinned to arm64 nodes. The node-helper image is multi-arch because Ananke uses it across both arm64 and amd64 nodes during prepare/shutdown operations.
- Ananke uses a temporary privileged helper pod for host-side operations. The helper image is prewarmed with `prepare --execute` so later shutdown/startup steps do not stall on image pulls.
- The script persists outage state in `~/.local/share/ananke/cluster_power_recovery.state` by default. If startup is attempted during an outage window and power becomes unstable again, rerunning startup with insufficient UPS charge will flip into the emergency shutdown path instead of continuing to bootstrap.
- Startup completion is strict now:
- all non-optional Flux kustomizations must be `Ready=True`
- external service checklist must pass (defaults include Gitea, Grafana, Harbor)
- generated ingress reachability checks must pass (default accepted codes: `200,301,302,307,308,401,403,404` )
- stability soak must pass with no crashloop/pull-failure churn
- If Flux hits immutable one-off Job drift during reconcile, Ananke now attempts self-heal by pruning failed Flux-managed Jobs and retrying reconcile.
2026-04-06 00:22:54 -03:00
- In dry-run mode, the script now skips the live API wait step so preview runs do not stall on an offline cluster.
2026-04-06 04:59:37 -03:00
- Dry-run mode no longer mutates outage recovery state.
- `harbor-seed --execute` was validated by:
- prewarming the helper image across all nodes
- streaming the Harbor bootstrap bundle to `titan-05`
- importing Harbor runtime images into host `containerd`
- successfully running a Harbor-backed canary pod (`harbor-canary-ok` )
2026-04-06 00:22:54 -03:00
- After bootstrap, Flux resources are resumed and reconciled.
- Keep this runbook aligned with `clusters/atlas/flux-system/gotk-sync.yaml` .