hecate: add controlled drill checklist to runbook
This commit is contained in:
parent
d880fac673
commit
e269829dc6
@ -3,21 +3,25 @@ Atlas Cluster Power Recovery (Graceful Shutdown/Startup)
|
|||||||
Purpose
|
Purpose
|
||||||
- Provide a safe operator flow for planned power events and cold-boot recovery.
|
- Provide a safe operator flow for planned power events and cold-boot recovery.
|
||||||
- Avoid the Flux/Gitea bootstrap deadlock by using a local bootstrap fallback path.
|
- Avoid the Flux/Gitea bootstrap deadlock by using a local bootstrap fallback path.
|
||||||
|
- Break the Harbor self-hosting deadlock by seeding Harbor runtime images from a control-host bundle.
|
||||||
- Refuse bootstrap when UPS charge is too low, and fall back to fast shutdown if a second outage hits mid-recovery.
|
- Refuse bootstrap when UPS charge is too low, and fall back to fast shutdown if a second outage hits mid-recovery.
|
||||||
|
|
||||||
Bootstrapping risk to remember
|
Bootstrapping risk to remember
|
||||||
- Flux source is Git over SSH to `scm.bstein.dev` (Gitea).
|
- Flux source is Git over SSH to `scm.bstein.dev` (Gitea).
|
||||||
- Gitea itself is a Flux-managed workload and depends on storage + database.
|
- Gitea itself is a Flux-managed workload and depends on storage + database.
|
||||||
- Harbor is also critical, but it is not part of the first recovery stage because Harbor currently serves its own runtime images.
|
- Harbor is also critical, but it is not part of the first recovery stage because Harbor serves its own runtime images.
|
||||||
- On cold boot, if Flux cannot fetch source before Gitea is up, reconciliation can stall.
|
- On cold boot, if Flux cannot fetch source before Gitea is up, reconciliation can stall.
|
||||||
- Recovery path: bring control plane and workers up, then locally apply minimal platform stack (`core -> helm -> longhorn -> metallb -> traefik -> vault-csi -> vault-injector -> vault -> postgres -> gitea`), then resume/reconcile Flux. Harbor is a later recovery stage after storage, Vault, Postgres, and Gitea are back.
|
- Recovery path: bring control plane and workers up, then locally apply minimal platform stack (`core -> helm -> longhorn -> metallb -> traefik -> vault-csi -> vault-injector -> vault -> postgres -> gitea`), then seed Harbor images onto the Harbor node from a control-host bundle, then resume/reconcile Flux. Harbor is a later recovery stage after storage, Vault, Postgres, and Gitea are back.
|
||||||
|
|
||||||
Script
|
Script
|
||||||
- `scripts/cluster_power_recovery.sh`
|
- `scripts/cluster_power_recovery.sh`
|
||||||
- `scripts/cluster_power_console.sh`
|
- `scripts/cluster_power_console.sh`
|
||||||
- Modes:
|
- Modes:
|
||||||
|
- `prepare`
|
||||||
- `shutdown`
|
- `shutdown`
|
||||||
|
- `harbor-seed`
|
||||||
- `startup`
|
- `startup`
|
||||||
|
- `status`
|
||||||
- Default is dry-run. Add `--execute` to actually perform actions.
|
- Default is dry-run. Add `--execute` to actually perform actions.
|
||||||
|
|
||||||
Dry-run examples
|
Dry-run examples
|
||||||
@ -25,42 +29,105 @@ Dry-run examples
|
|||||||
- `scripts/cluster_power_recovery.sh shutdown --skip-etcd-snapshot --skip-drain`
|
- `scripts/cluster_power_recovery.sh shutdown --skip-etcd-snapshot --skip-drain`
|
||||||
- Startup preview:
|
- Startup preview:
|
||||||
- `scripts/cluster_power_recovery.sh startup`
|
- `scripts/cluster_power_recovery.sh startup`
|
||||||
|
- Harbor seed preview:
|
||||||
|
- `scripts/cluster_power_recovery.sh harbor-seed`
|
||||||
|
|
||||||
Execute examples
|
Execute examples
|
||||||
|
- Prepare helper image on every node:
|
||||||
|
- `scripts/cluster_power_recovery.sh prepare --execute`
|
||||||
|
- Seed Harbor runtime images onto `titan-05` from the control-host bundle:
|
||||||
|
- `scripts/cluster_power_recovery.sh harbor-seed --execute`
|
||||||
- Planned shutdown:
|
- Planned shutdown:
|
||||||
- `scripts/cluster_power_recovery.sh shutdown --execute`
|
- `scripts/cluster_power_recovery.sh shutdown --execute`
|
||||||
- Planned startup (canonical branch):
|
- Planned startup (canonical branch):
|
||||||
- `scripts/cluster_power_recovery.sh startup --execute --force-flux-branch main`
|
- `scripts/cluster_power_recovery.sh startup --execute --force-flux-branch main`
|
||||||
|
|
||||||
Manual remote console examples
|
Manual remote console examples
|
||||||
- From `titan-24` with a local checkout:
|
- Canonical operator hosts:
|
||||||
- `~/Development/titan-iac/scripts/cluster_power_console.sh shutdown --execute`
|
- `titan-db`
|
||||||
- `~/Development/titan-iac/scripts/cluster_power_console.sh startup --execute --force-flux-branch main`
|
- `titan-24`
|
||||||
- From `titan-db`, if the checkout is not present locally, the console wrapper can delegate to `titan-24`:
|
- Both hosts now have:
|
||||||
- `~/Development/titan-iac/scripts/cluster_power_console.sh --delegate-host titan-24 shutdown --execute`
|
- `~/hecate-tools/cluster_power_recovery.sh`
|
||||||
- `~/Development/titan-iac/scripts/cluster_power_console.sh --delegate-host titan-24 startup --execute --force-flux-branch main`
|
- `~/hecate-tools/cluster_power_console.sh`
|
||||||
|
- `~/hecate-tools/bootstrap/recovery-config.env`
|
||||||
|
- `~/hecate-tools/bootstrap/harbor-bootstrap-images.txt`
|
||||||
|
- `~/hecate-tools/kubeconfig`
|
||||||
|
- `~/hecate-cluster-power`
|
||||||
|
- `~/bin/hecate-cluster-power`
|
||||||
|
- `~/hecate-repo/{infrastructure,services,scripts}`
|
||||||
|
- Both hosts also keep the Harbor bootstrap bundle at:
|
||||||
|
- `~/.local/share/hecate/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`
|
||||||
|
- Remote usage:
|
||||||
|
- `ssh titan-db`
|
||||||
|
- `~/hecate-cluster-power status`
|
||||||
|
- `~/hecate-cluster-power prepare --execute`
|
||||||
|
- `~/hecate-cluster-power shutdown --execute`
|
||||||
|
- `~/hecate-cluster-power startup --execute --force-flux-branch main`
|
||||||
|
- `ssh titan-24`
|
||||||
|
- `~/hecate-cluster-power status`
|
||||||
|
- `~/hecate-cluster-power prepare --execute`
|
||||||
|
- `~/hecate-cluster-power shutdown --execute`
|
||||||
|
- `~/hecate-cluster-power startup --execute --force-flux-branch main`
|
||||||
|
|
||||||
Useful options
|
Useful options
|
||||||
- `--control-planes titan-0a,titan-0b,titan-0c`
|
|
||||||
- `--workers <csv>` (otherwise the script tries API discovery first, then falls back to the static atlas worker inventory)
|
|
||||||
- `--expected-flux-branch main`
|
- `--expected-flux-branch main`
|
||||||
- `--force-flux-branch main`
|
- `--force-flux-branch main`
|
||||||
- `--skip-local-bootstrap` (not recommended for cold-start recovery)
|
- `--skip-local-bootstrap` (not recommended for cold-start recovery)
|
||||||
- `--skip-harbor-bootstrap` (skip the Harbor recovery stage if you know Harbor should stay deferred)
|
- `--skip-harbor-bootstrap` (skip the Harbor recovery stage if you know Harbor should stay deferred)
|
||||||
|
- `--skip-harbor-seed` (skip bundle import if Harbor images are already cached on the target node)
|
||||||
|
- `--skip-helper-prewarm`
|
||||||
- `--min-startup-battery 35`
|
- `--min-startup-battery 35`
|
||||||
- `--ups-host ups@localhost`
|
- `--ups-host pyrphoros@localhost`
|
||||||
- `--require-ups-battery`
|
- `--require-ups-battery`
|
||||||
- `--drain-timeout 180`
|
- `--drain-timeout 180`
|
||||||
- `--emergency-drain-timeout 45`
|
- `--emergency-drain-timeout 45`
|
||||||
- `--recovery-state-file ~/.local/state/cluster_power_recovery.state`
|
- `--recovery-state-file ~/.local/share/hecate/cluster_power_recovery.state`
|
||||||
|
- `--harbor-bundle-file ~/.local/share/hecate/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`
|
||||||
|
|
||||||
|
Controlled drill checklist (recommended)
|
||||||
|
- Operator host: use `titan-db` as canonical control host for the drill.
|
||||||
|
- On-site coordination:
|
||||||
|
- Have on-site operator ready before shutdown starts.
|
||||||
|
- Confirm they will manually power cluster nodes back on after shutdown completes.
|
||||||
|
- Confirm who will announce "all nodes powered on" to resume startup.
|
||||||
|
- Preflight on `titan-db`:
|
||||||
|
- `mkdir -p ~/hecate-logs`
|
||||||
|
- `~/hecate-cluster-power status` and verify:
|
||||||
|
- `ups_host=pyrphoros@localhost`
|
||||||
|
- `ups_battery` is numeric
|
||||||
|
- `flux_source_ready=True`
|
||||||
|
- Warm helper image just before shutdown:
|
||||||
|
- `~/hecate-cluster-power prepare --execute`
|
||||||
|
- Run in a persistent shell and capture logs:
|
||||||
|
- `tmux new -s hecate-drill`
|
||||||
|
- `script -q -a ~/hecate-logs/hecate-drill-$(date +%Y%m%d-%H%M%S).log`
|
||||||
|
- Execute controlled shutdown with telemetry enforcement:
|
||||||
|
- `~/hecate-cluster-power shutdown --execute --require-ups-battery`
|
||||||
|
- After on-site power-on confirmation, execute startup:
|
||||||
|
- `~/hecate-cluster-power startup --execute --force-flux-branch main --require-ups-battery`
|
||||||
|
- Post-check:
|
||||||
|
- `~/hecate-cluster-power status`
|
||||||
|
- Verify critical services (`longhorn`, `vault`, `postgres`, `gitea`, `harbor`, `pegasus`) and no widespread pull/crash failures.
|
||||||
|
|
||||||
Operational notes
|
Operational notes
|
||||||
- The flow suspends Flux Kustomizations/HelmReleases during shutdown to prevent churn.
|
- The flow suspends Flux Kustomizations/HelmReleases during shutdown to prevent churn.
|
||||||
- Worker drain is no longer best-effort only. The script now escalates from normal drain, to `--force`, to `--disable-eviction` once the configured timeout is exhausted.
|
- Worker drain is no longer best-effort only. The script now escalates from normal drain, to `--force`, to `--disable-eviction` once the configured timeout is exhausted.
|
||||||
- During startup, if Flux source is not `Ready`, local bootstrap fallback is applied first.
|
- During startup, if Flux source is not `Ready`, local bootstrap fallback is applied first using the repo snapshot under `~/hecate-repo`.
|
||||||
- Longhorn is reconciled before Vault/Postgres/Gitea so storage-backed services are not racing the volume layer.
|
- Longhorn is reconciled before Vault/Postgres/Gitea so storage-backed services are not racing the volume layer.
|
||||||
- Harbor is reconciled after the first critical stateful services. Treat Harbor bootstrap as requiring either cached Harbor runtime images on the scheduled node or a separate bootstrap source for those images.
|
- Harbor is reconciled after the first critical stateful services.
|
||||||
|
- Harbor bootstrap is now designed around a control-host bundle:
|
||||||
|
- Build the Harbor bundle locally with `scripts/build_harbor_bootstrap_bundle.sh`.
|
||||||
|
- Stage it on the operator host at `~/.local/share/hecate/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`.
|
||||||
|
- Use `harbor-seed --execute` or a full `startup --execute` to stream/import that bundle onto `titan-05`.
|
||||||
|
- The Harbor bundle remains arm64-only because Harbor is pinned to arm64 nodes. The node-helper image is multi-arch because Hecate uses it across both arm64 and amd64 nodes during prepare/shutdown operations.
|
||||||
|
- Hecate uses a temporary privileged helper pod for host-side operations. The helper image is prewarmed with `prepare --execute` so later shutdown/startup steps do not stall on image pulls.
|
||||||
- The script persists outage state in `~/.local/state/cluster_power_recovery.state` by default. If startup is attempted during an outage window and power becomes unstable again, rerunning startup with insufficient UPS charge will flip into the emergency shutdown path instead of continuing to bootstrap.
|
- The script persists outage state in `~/.local/state/cluster_power_recovery.state` by default. If startup is attempted during an outage window and power becomes unstable again, rerunning startup with insufficient UPS charge will flip into the emergency shutdown path instead of continuing to bootstrap.
|
||||||
- In dry-run mode, the script now skips the live API wait step so preview runs do not stall on an offline cluster.
|
- In dry-run mode, the script now skips the live API wait step so preview runs do not stall on an offline cluster.
|
||||||
|
- Dry-run mode no longer mutates outage recovery state.
|
||||||
|
- `harbor-seed --execute` was validated by:
|
||||||
|
- prewarming the helper image across all nodes
|
||||||
|
- streaming the Harbor bootstrap bundle to `titan-05`
|
||||||
|
- importing Harbor runtime images into host `containerd`
|
||||||
|
- successfully running a Harbor-backed canary pod (`harbor-canary-ok`)
|
||||||
- After bootstrap, Flux resources are resumed and reconciled.
|
- After bootstrap, Flux resources are resumed and reconciled.
|
||||||
- Keep this runbook aligned with `clusters/atlas/flux-system/gotk-sync.yaml`.
|
- Keep this runbook aligned with `clusters/atlas/flux-system/gotk-sync.yaml`.
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user