From c1dc50caceb843a5df27cc95d310e86c62623667 Mon Sep 17 00:00:00 2001 From: Brad Stein Date: Mon, 6 Apr 2026 04:59:37 -0300 Subject: [PATCH] hecate: add controlled drill checklist to runbook --- knowledge/runbooks/cluster-power-recovery.md | 28 +++++++++++++++++++- 1 file changed, 27 insertions(+), 1 deletion(-) diff --git a/knowledge/runbooks/cluster-power-recovery.md b/knowledge/runbooks/cluster-power-recovery.md index ddcf3888..eb4ab395 100644 --- a/knowledge/runbooks/cluster-power-recovery.md +++ b/knowledge/runbooks/cluster-power-recovery.md @@ -77,13 +77,38 @@ Useful options - `--skip-harbor-seed` (skip bundle import if Harbor images are already cached on the target node) - `--skip-helper-prewarm` - `--min-startup-battery 35` -- `--ups-host ups@localhost` +- `--ups-host pyrphoros@localhost` - `--require-ups-battery` - `--drain-timeout 180` - `--emergency-drain-timeout 45` - `--recovery-state-file ~/.local/share/hecate/cluster_power_recovery.state` - `--harbor-bundle-file ~/.local/share/hecate/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst` +Controlled drill checklist (recommended) +- Operator host: use `titan-db` as canonical control host for the drill. +- On-site coordination: + - Have on-site operator ready before shutdown starts. + - Confirm they will manually power cluster nodes back on after shutdown completes. + - Confirm who will announce "all nodes powered on" to resume startup. +- Preflight on `titan-db`: + - `mkdir -p ~/hecate-logs` + - `~/hecate-cluster-power status` and verify: + - `ups_host=pyrphoros@localhost` + - `ups_battery` is numeric + - `flux_source_ready=True` +- Warm helper image just before shutdown: + - `~/hecate-cluster-power prepare --execute` +- Run in a persistent shell and capture logs: + - `tmux new -s hecate-drill` + - `script -q -a ~/hecate-logs/hecate-drill-$(date +%Y%m%d-%H%M%S).log` +- Execute controlled shutdown with telemetry enforcement: + - `~/hecate-cluster-power shutdown --execute --require-ups-battery` +- After on-site power-on confirmation, execute startup: + - `~/hecate-cluster-power startup --execute --force-flux-branch main --require-ups-battery` +- Post-check: + - `~/hecate-cluster-power status` + - Verify critical services (`longhorn`, `vault`, `postgres`, `gitea`, `harbor`, `pegasus`) and no widespread pull/crash failures. + Operational notes - The flow suspends Flux Kustomizations/HelmReleases during shutdown to prevent churn. - Worker drain is no longer best-effort only. The script now escalates from normal drain, to `--force`, to `--disable-eviction` once the configured timeout is exhausted. @@ -98,6 +123,7 @@ Operational notes - Hecate uses a temporary privileged helper pod for host-side operations. The helper image is prewarmed with `prepare --execute` so later shutdown/startup steps do not stall on image pulls. - The script persists outage state in `~/.local/state/cluster_power_recovery.state` by default. If startup is attempted during an outage window and power becomes unstable again, rerunning startup with insufficient UPS charge will flip into the emergency shutdown path instead of continuing to bootstrap. - In dry-run mode, the script now skips the live API wait step so preview runs do not stall on an offline cluster. +- Dry-run mode no longer mutates outage recovery state. - `harbor-seed --execute` was validated by: - prewarming the helper image across all nodes - streaming the Harbor bootstrap bundle to `titan-05`