hecate: add controlled drill checklist to runbook

This commit is contained in:
Brad Stein 2026-04-06 04:59:37 -03:00
parent 65de56b2ac
commit c1dc50cace

View File

@ -77,13 +77,38 @@ Useful options
- `--skip-harbor-seed` (skip bundle import if Harbor images are already cached on the target node)
- `--skip-helper-prewarm`
- `--min-startup-battery 35`
- `--ups-host ups@localhost`
- `--ups-host pyrphoros@localhost`
- `--require-ups-battery`
- `--drain-timeout 180`
- `--emergency-drain-timeout 45`
- `--recovery-state-file ~/.local/share/hecate/cluster_power_recovery.state`
- `--harbor-bundle-file ~/.local/share/hecate/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`
Controlled drill checklist (recommended)
- Operator host: use `titan-db` as canonical control host for the drill.
- On-site coordination:
- Have on-site operator ready before shutdown starts.
- Confirm they will manually power cluster nodes back on after shutdown completes.
- Confirm who will announce "all nodes powered on" to resume startup.
- Preflight on `titan-db`:
- `mkdir -p ~/hecate-logs`
- `~/hecate-cluster-power status` and verify:
- `ups_host=pyrphoros@localhost`
- `ups_battery` is numeric
- `flux_source_ready=True`
- Warm helper image just before shutdown:
- `~/hecate-cluster-power prepare --execute`
- Run in a persistent shell and capture logs:
- `tmux new -s hecate-drill`
- `script -q -a ~/hecate-logs/hecate-drill-$(date +%Y%m%d-%H%M%S).log`
- Execute controlled shutdown with telemetry enforcement:
- `~/hecate-cluster-power shutdown --execute --require-ups-battery`
- After on-site power-on confirmation, execute startup:
- `~/hecate-cluster-power startup --execute --force-flux-branch main --require-ups-battery`
- Post-check:
- `~/hecate-cluster-power status`
- Verify critical services (`longhorn`, `vault`, `postgres`, `gitea`, `harbor`, `pegasus`) and no widespread pull/crash failures.
Operational notes
- The flow suspends Flux Kustomizations/HelmReleases during shutdown to prevent churn.
- Worker drain is no longer best-effort only. The script now escalates from normal drain, to `--force`, to `--disable-eviction` once the configured timeout is exhausted.
@ -98,6 +123,7 @@ Operational notes
- Hecate uses a temporary privileged helper pod for host-side operations. The helper image is prewarmed with `prepare --execute` so later shutdown/startup steps do not stall on image pulls.
- The script persists outage state in `~/.local/state/cluster_power_recovery.state` by default. If startup is attempted during an outage window and power becomes unstable again, rerunning startup with insufficient UPS charge will flip into the emergency shutdown path instead of continuing to bootstrap.
- In dry-run mode, the script now skips the live API wait step so preview runs do not stall on an offline cluster.
- Dry-run mode no longer mutates outage recovery state.
- `harbor-seed --execute` was validated by:
- prewarming the helper image across all nodes
- streaming the Harbor bootstrap bundle to `titan-05`