From c1dc50caceb843a5df27cc95d310e86c62623667 Mon Sep 17 00:00:00 2001
From: Brad Stein <Brad.Stein@gmail.com>
Date: Mon, 6 Apr 2026 04:59:37 -0300
Subject: [PATCH] hecate: add controlled drill checklist to runbook

---
 knowledge/runbooks/cluster-power-recovery.md | 28 +++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/knowledge/runbooks/cluster-power-recovery.md b/knowledge/runbooks/cluster-power-recovery.md
index ddcf3888..eb4ab395 100644
--- a/knowledge/runbooks/cluster-power-recovery.md
+++ b/knowledge/runbooks/cluster-power-recovery.md
@@ -77,13 +77,38 @@ Useful options
 - `--skip-harbor-seed` (skip bundle import if Harbor images are already cached on the target node)
 - `--skip-helper-prewarm`
 - `--min-startup-battery 35`
-- `--ups-host ups@localhost`
+- `--ups-host pyrphoros@localhost`
 - `--require-ups-battery`
 - `--drain-timeout 180`
 - `--emergency-drain-timeout 45`
 - `--recovery-state-file ~/.local/share/hecate/cluster_power_recovery.state`
 - `--harbor-bundle-file ~/.local/share/hecate/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`
 
+Controlled drill checklist (recommended)
+- Operator host: use `titan-db` as canonical control host for the drill.
+- On-site coordination:
+  - Have on-site operator ready before shutdown starts.
+  - Confirm they will manually power cluster nodes back on after shutdown completes.
+  - Confirm who will announce "all nodes powered on" to resume startup.
+- Preflight on `titan-db`:
+  - `mkdir -p ~/hecate-logs`
+  - `~/hecate-cluster-power status` and verify:
+    - `ups_host=pyrphoros@localhost`
+    - `ups_battery` is numeric
+    - `flux_source_ready=True`
+- Warm helper image just before shutdown:
+  - `~/hecate-cluster-power prepare --execute`
+- Run in a persistent shell and capture logs:
+  - `tmux new -s hecate-drill`
+  - `script -q -a ~/hecate-logs/hecate-drill-$(date +%Y%m%d-%H%M%S).log`
+- Execute controlled shutdown with telemetry enforcement:
+  - `~/hecate-cluster-power shutdown --execute --require-ups-battery`
+- After on-site power-on confirmation, execute startup:
+  - `~/hecate-cluster-power startup --execute --force-flux-branch main --require-ups-battery`
+- Post-check:
+  - `~/hecate-cluster-power status`
+  - Verify critical services (`longhorn`, `vault`, `postgres`, `gitea`, `harbor`, `pegasus`) and no widespread pull/crash failures.
+
 Operational notes
 - The flow suspends Flux Kustomizations/HelmReleases during shutdown to prevent churn.
 - Worker drain is no longer best-effort only. The script now escalates from normal drain, to `--force`, to `--disable-eviction` once the configured timeout is exhausted.
@@ -98,6 +123,7 @@ Operational notes
 - Hecate uses a temporary privileged helper pod for host-side operations. The helper image is prewarmed with `prepare --execute` so later shutdown/startup steps do not stall on image pulls.
 - The script persists outage state in `~/.local/state/cluster_power_recovery.state` by default. If startup is attempted during an outage window and power becomes unstable again, rerunning startup with insufficient UPS charge will flip into the emergency shutdown path instead of continuing to bootstrap.
 - In dry-run mode, the script now skips the live API wait step so preview runs do not stall on an offline cluster.
+- Dry-run mode no longer mutates outage recovery state.
 - `harbor-seed --execute` was validated by:
   - prewarming the helper image across all nodes
   - streaming the Harbor bootstrap bundle to `titan-05`