hecate: add controlled drill checklist to runbook
This commit is contained in:
parent
65de56b2ac
commit
c1dc50cace
@ -77,13 +77,38 @@ Useful options
|
||||
- `--skip-harbor-seed` (skip bundle import if Harbor images are already cached on the target node)
|
||||
- `--skip-helper-prewarm`
|
||||
- `--min-startup-battery 35`
|
||||
- `--ups-host ups@localhost`
|
||||
- `--ups-host pyrphoros@localhost`
|
||||
- `--require-ups-battery`
|
||||
- `--drain-timeout 180`
|
||||
- `--emergency-drain-timeout 45`
|
||||
- `--recovery-state-file ~/.local/share/hecate/cluster_power_recovery.state`
|
||||
- `--harbor-bundle-file ~/.local/share/hecate/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`
|
||||
|
||||
Controlled drill checklist (recommended)
|
||||
- Operator host: use `titan-db` as canonical control host for the drill.
|
||||
- On-site coordination:
|
||||
- Have on-site operator ready before shutdown starts.
|
||||
- Confirm they will manually power cluster nodes back on after shutdown completes.
|
||||
- Confirm who will announce "all nodes powered on" to resume startup.
|
||||
- Preflight on `titan-db`:
|
||||
- `mkdir -p ~/hecate-logs`
|
||||
- `~/hecate-cluster-power status` and verify:
|
||||
- `ups_host=pyrphoros@localhost`
|
||||
- `ups_battery` is numeric
|
||||
- `flux_source_ready=True`
|
||||
- Warm helper image just before shutdown:
|
||||
- `~/hecate-cluster-power prepare --execute`
|
||||
- Run in a persistent shell and capture logs:
|
||||
- `tmux new -s hecate-drill`
|
||||
- `script -q -a ~/hecate-logs/hecate-drill-$(date +%Y%m%d-%H%M%S).log`
|
||||
- Execute controlled shutdown with telemetry enforcement:
|
||||
- `~/hecate-cluster-power shutdown --execute --require-ups-battery`
|
||||
- After on-site power-on confirmation, execute startup:
|
||||
- `~/hecate-cluster-power startup --execute --force-flux-branch main --require-ups-battery`
|
||||
- Post-check:
|
||||
- `~/hecate-cluster-power status`
|
||||
- Verify critical services (`longhorn`, `vault`, `postgres`, `gitea`, `harbor`, `pegasus`) and no widespread pull/crash failures.
|
||||
|
||||
Operational notes
|
||||
- The flow suspends Flux Kustomizations/HelmReleases during shutdown to prevent churn.
|
||||
- Worker drain is no longer best-effort only. The script now escalates from normal drain, to `--force`, to `--disable-eviction` once the configured timeout is exhausted.
|
||||
@ -98,6 +123,7 @@ Operational notes
|
||||
- Hecate uses a temporary privileged helper pod for host-side operations. The helper image is prewarmed with `prepare --execute` so later shutdown/startup steps do not stall on image pulls.
|
||||
- The script persists outage state in `~/.local/state/cluster_power_recovery.state` by default. If startup is attempted during an outage window and power becomes unstable again, rerunning startup with insufficient UPS charge will flip into the emergency shutdown path instead of continuing to bootstrap.
|
||||
- In dry-run mode, the script now skips the live API wait step so preview runs do not stall on an offline cluster.
|
||||
- Dry-run mode no longer mutates outage recovery state.
|
||||
- `harbor-seed --execute` was validated by:
|
||||
- prewarming the helper image across all nodes
|
||||
- streaming the Harbor bootstrap bundle to `titan-05`
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user