docs: replace legacy hecate README with ananke runbook
This commit is contained in:
parent
26d934b675
commit
d19862285a
190
README.md
190
README.md
@ -1,151 +1,97 @@
|
||||
# Hecate
|
||||
# ananke
|
||||
|
||||
Hecate is the host-level bootstrap and power-protection service for Titan.
|
||||
`ananke` is the host-side power + bootstrap orchestrator for Titan.
|
||||
|
||||
It runs on `titan-db` and handles:
|
||||
- Staged **startup** (including Flux/Gitea bootstrap deadlock fallback)
|
||||
- Graceful **shutdown**
|
||||
- UPS-driven automatic shutdown decisions based on discharge/runtime
|
||||
- Multi-UPS operation via multiple Hecate instances (for example `titan-db` + `tethys`)
|
||||
- Full hardware poweroff sequencing after graceful Kubernetes shutdown
|
||||
It runs outside Kubernetes (systemd on host), so it can:
|
||||
- shut the cluster down gracefully before battery/runtime redlines
|
||||
- bring the cluster back after power returns
|
||||
- recover common Flux/Kustomize startup deadlocks
|
||||
- validate service health from the outside before declaring startup done
|
||||
|
||||
## Why host-level
|
||||
## Why `ananke`
|
||||
|
||||
A service inside Kubernetes cannot start a cluster that is fully down.
|
||||
Hecate runs outside the cluster under systemd, so it can always orchestrate bring-up.
|
||||
I wanted a name that fits Titan/mythology, but also describes what this service actually does.
|
||||
|
||||
## Commands
|
||||
In Greek myth, **Ananke** is inevitability/necessity. That matches this tool: when power events happen, graceful sequencing is not optional.
|
||||
|
||||
- `hecate startup --config /etc/hecate/hecate.yaml --execute --force-flux-branch main`
|
||||
- `hecate shutdown --config /etc/hecate/hecate.yaml --execute`
|
||||
- `hecate daemon --config /etc/hecate/hecate.yaml`
|
||||
- `hecate status --config /etc/hecate/hecate.yaml`
|
||||
UPS names in this cluster are also part of the story:
|
||||
- `Statera`: powers `titan-23`, `titan-24`, `titan-jh`
|
||||
- `Pyrphoros`: powers all other nodes
|
||||
|
||||
Key startup guards:
|
||||
- Startup is blocked on hosts configured as `coordination.role: peer` (unless `--allow-peer-startup` is used intentionally).
|
||||
- `--auto-peer-failover` makes peer hosts hand off startup to the coordinator first, then run local startup only if the coordinator is unreachable.
|
||||
- Startup is blocked while UPS is on battery by default (unless `--allow-on-battery` or `coordination.allow_startup_on_battery: true` is set).
|
||||
- Startup is blocked when a shutdown intent is active (`/var/lib/hecate/intent.json`).
|
||||
- Stale shutdown intents are auto-cleared after `coordination.startup_guard_max_age_seconds`, so old outage residue cannot permanently deadlock startup.
|
||||
- Startup checks configured `coordination.peer_hosts` intents to avoid peer/coordinator split-brain startup races.
|
||||
- Startup waits for time sync in `strict` or `quorum` mode (`startup.time_sync_mode`, `startup.time_sync_quorum`).
|
||||
- Startup can block until storage is healthy (`startup.require_storage_ready` + critical PVC checks).
|
||||
- Startup can block until external probes pass (`startup.require_post_start_probes` + `startup.post_start_probes`).
|
||||
- Startup refreshes and can use a cached bootstrap manifest set under `/var/lib/hecate/bootstrap-cache` when local fallback paths fail.
|
||||
- Vault unseal now falls back to a local cached key file (`startup.vault_unseal_key_file`) if `vault-init` cannot be read yet.
|
||||
- Optional off-site break-glass retrieval can be configured with `startup.vault_unseal_breakglass_command` (for example, an SSH `cat` command to a remote key escrow host).
|
||||
## Breakglass reminder
|
||||
|
||||
## Manual install on titan-db
|
||||
Vault unseal breakglass is wired for remote retrieval (magic mirror host). If local key retrieval fails, Ananke can use the configured breakglass command.
|
||||
|
||||
## What “startup complete” means now
|
||||
|
||||
Ananke does **not** stop at “Flux says Ready”. Startup only completes when all configured gates pass:
|
||||
- Flux source drift guard passes (`expected_flux_source_url` + branch expectation)
|
||||
- Flux kustomizations are healthy
|
||||
- controller convergence is healthy (deployments/statefulsets/daemonsets)
|
||||
- external service checklist passes (for example Gitea + Grafana health endpoints)
|
||||
- stability soak window passes (no regressions, no CrashLoop/ImagePull failures)
|
||||
|
||||
If any gate fails, startup is blocked with a concrete reason.
|
||||
|
||||
## Command quick sheet
|
||||
|
||||
From `titan-db` (coordinator):
|
||||
|
||||
```bash
|
||||
git clone git@gitea-admin:bstein/hecate.git
|
||||
cd hecate
|
||||
sudo HECATE_ENABLE_BOOTSTRAP=1 ./scripts/install.sh
|
||||
sudoedit /etc/hecate/hecate.yaml
|
||||
sudo systemctl restart hecate.service
|
||||
sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
|
||||
sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main
|
||||
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
|
||||
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason emergency-power --mode poweroff --skip-drain --skip-etcd-snapshot
|
||||
```
|
||||
|
||||
The installer is idempotent:
|
||||
- Re-runs safely on every update
|
||||
- Preserves existing `/etc/hecate/hecate.yaml`
|
||||
- Automatically migrates legacy defaults (for example `default_budget_seconds: 300` and `runtime_safety_factor: 1.10`)
|
||||
- Ensures required dependencies are installed (`kubectl`, `nut-*`, `ssh`, `go`, etc.)
|
||||
- Installs/refreshes systemd units and enables boot-time self-update
|
||||
- Applies declarative NUT + udev UPS configuration by default (can be tuned via env vars)
|
||||
|
||||
Installer knobs (optional):
|
||||
- `HECATE_ENABLE_BOOTSTRAP=1` enables `hecate-bootstrap.service` on this host.
|
||||
- `HECATE_ENABLE_BOOTSTRAP=0` disables it; default `auto` enables bootstrap by default.
|
||||
- `HECATE_MANAGE_NUT=0` skips writing NUT/udev files.
|
||||
- `HECATE_FORCE_CONFIG_TEMPLATE=coordinator|peer|example` overwrites `/etc/hecate/hecate.yaml` from a known template during install.
|
||||
- `HECATE_NUT_UPS_NAME` (default inferred from `/etc/hecate/hecate.yaml` target, fallback `pyrphoros`)
|
||||
- `HECATE_NUT_VENDOR_ID` / `HECATE_NUT_PRODUCT_ID` (defaults `0764` / `0601`)
|
||||
- `HECATE_NUT_MONITOR_USER` / `HECATE_NUT_MONITOR_PASSWORD` (defaults `monuser` / `hecateupsmon`)
|
||||
|
||||
Bootstrap now (without reboot):
|
||||
From `titan-24` (`tethys` peer):
|
||||
|
||||
```bash
|
||||
sudo systemctl start hecate-bootstrap.service
|
||||
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
|
||||
```
|
||||
|
||||
## Preconditions on titan-db
|
||||
Systemd:
|
||||
|
||||
- `kubectl` installed and configured (`kubeconfig` path in config)
|
||||
- SSH reachability to all cluster nodes
|
||||
- Remote sudo rights to run:
|
||||
- `systemctl start/stop k3s`
|
||||
- `systemctl start/stop k3s-agent`
|
||||
- UPS telemetry available via NUT (`upsc`)
|
||||
```bash
|
||||
sudo systemctl status ananke.service
|
||||
sudo systemctl start ananke-bootstrap.service
|
||||
sudo systemctl start ananke-update.service
|
||||
```
|
||||
|
||||
Optional SSH jump/bastion:
|
||||
- Set `ssh_jump_host` (and optional `ssh_jump_user`) to route node SSH through a jump host like `titan-jh`; Hecate now falls back to direct SSH automatically if jump routing is unavailable.
|
||||
- Set `ssh_port`, `ssh_config_file`, `ssh_identity_file`, and `ssh_node_hosts` so root-run systemd actions can actually reach node SSH daemons during cold-start recovery.
|
||||
- Use `ssh_node_users` for per-node username overrides (for example `titan-24: tethys`).
|
||||
- Use `ssh_managed_nodes` to limit host-level SSH start/stop actions to nodes Hecate can actually authenticate to.
|
||||
## Shutdown modes (explicit)
|
||||
|
||||
## Multi-UPS topology
|
||||
`ananke shutdown` now supports explicit mode selection:
|
||||
- `--mode config`: use config default (`shutdown.poweroff_enabled`)
|
||||
- `--mode cluster-only`: stop cluster services only (no host poweroff)
|
||||
- `--mode poweroff`: include host poweroff path
|
||||
|
||||
Recommended:
|
||||
- `titan-db` runs Hecate as the shutdown coordinator with UPS `Pyrphoros` (`pyrphoros@localhost`).
|
||||
- `tethys` runs Hecate as a peer with UPS `Statera` (`statera@localhost`) and forwards shutdown triggers to `titan-db`.
|
||||
- The bootstrap unit now runs on both roles; peer role uses auto-failover handoff to coordinator before local fallback startup.
|
||||
- If forwarding fails, fallback local shutdown can remain enabled.
|
||||
- Use `coordination.role: coordinator` on `titan-db` and `coordination.role: peer` on `tethys`.
|
||||
This removes ambiguity during drills.
|
||||
|
||||
## Config
|
||||
## Config file
|
||||
|
||||
See `configs/hecate.example.yaml`.
|
||||
Primary path:
|
||||
- `/etc/ananke/ananke.yaml`
|
||||
|
||||
Break-glass unseal fallback knobs:
|
||||
- `startup.vault_unseal_breakglass_command`: optional shell command that prints the unseal key to stdout.
|
||||
- `startup.vault_unseal_breakglass_timeout_seconds`: timeout for the command (default `15`).
|
||||
- `startup.shutdown_cooldown_seconds`: cooldown window after shutdown completion before startup proceeds (default `45`).
|
||||
Core settings to keep accurate:
|
||||
- `expected_flux_branch`
|
||||
- `expected_flux_source_url`
|
||||
- `startup.service_checklist`
|
||||
- `startup.service_checklist_stability_seconds`
|
||||
- `startup.ignore_unavailable_nodes` (for planned temporary node outages)
|
||||
- `coordination.role`, `coordination.peer_hosts`
|
||||
|
||||
UPS auto-shutdown trigger uses:
|
||||
- runtime threshold = `runtime_safety_factor * estimated_shutdown_budget`
|
||||
- default safety factor `1.25`
|
||||
- debounce across multiple polls to avoid noise
|
||||
- emergency trigger budget defaults to `shutdown.emergency_budget_seconds` and is learned from historical UPS-triggered shutdown runs once enough samples exist
|
||||
- UPS-triggered shutdown executes the emergency fast path by default (`shutdown.emergency_skip_drain: true`, `shutdown.emergency_skip_etcd_snapshot: true`)
|
||||
## Install / update
|
||||
|
||||
Estimated shutdown budgets are derived from historical successful shutdown runs (`/var/lib/hecate/runs.json`) with config fallbacks:
|
||||
- `estimated_shutdown_budget_seconds`: full/manual shutdown path
|
||||
- `estimated_emergency_shutdown_budget_seconds`: UPS/emergency path
|
||||
```bash
|
||||
sudo ./scripts/install.sh
|
||||
```
|
||||
|
||||
Power metrics:
|
||||
- Hecate exposes Prometheus metrics on `:9560/metrics` by default.
|
||||
- This is intended for a dedicated Grafana power dashboard and a high-level overview row.
|
||||
Installer behavior:
|
||||
- builds and installs `/usr/local/bin/ananke`
|
||||
- installs `ananke*.service` units
|
||||
- migrates and enforces current `ananke` config/state paths
|
||||
|
||||
## Notes
|
||||
|
||||
- Default behavior for `startup` and `shutdown` is dry-run unless `--execute` is set.
|
||||
- Hecate tracks intent in `/var/lib/hecate/intent.json` (`normal`, `startup_in_progress`, `shutting_down`, `shutdown_complete`) to avoid startup/shutdown fighting each other.
|
||||
- Startup now waits out the recent-shutdown cooldown window instead of failing immediately when shutdown completed moments ago.
|
||||
- In multi-instance setups, set `coordination.peer_hosts` on each host (for example `titan-db` <-> `titan-24`) so startup guards account for remote intent too.
|
||||
- `hecate-bootstrap.service` is enabled to run at host boot and perform staged startup automatically.
|
||||
- `HECATE_ENABLE_BOOTSTRAP=1` forces bootstrap on, `HECATE_ENABLE_BOOTSTRAP=0` forces it off, and `auto` enables by default.
|
||||
- `hecate-update.timer` runs on boot and periodically to pull latest `main` and reinstall Hecate declaratively.
|
||||
- Peer startup fallback now checks coordinator intent/bootstrap activity before allowing local startup.
|
||||
- Automatic etcd recovery can run during startup if API never becomes reachable (`startup.auto_etcd_restore_on_api_failure`).
|
||||
|
||||
## Etcd Recovery
|
||||
|
||||
- Manual: `hecate etcd-restore --config /etc/hecate/hecate.yaml --execute`
|
||||
- Optional snapshot override: `--snapshot /var/lib/rancher/k3s/server/db/snapshots/<name>`
|
||||
- Startup can automatically invoke the same restore path after API timeout using:
|
||||
- `startup.auto_etcd_restore_on_api_failure: true`
|
||||
- `startup.etcd_restore_control_plane: <control-plane-node>`
|
||||
- If control planes are configured with `--datastore-endpoint` (external DB), Hecate will skip etcd restore and retry control-plane startup instead.
|
||||
- Etcd restore now verifies snapshot existence, minimum size, listing presence, and SHA-256 before reset starts.
|
||||
|
||||
## Disruptive startup drills
|
||||
|
||||
Hecate includes scripted disruptive drills that intentionally break critical services and verify startup recovery paths:
|
||||
|
||||
- `scripts/hecate-drills.sh list`
|
||||
- `scripts/hecate-drills.sh run flux-gitea-deadlock --execute`
|
||||
- `scripts/hecate-drills.sh run foundation-recovery --execute`
|
||||
- `scripts/hecate-drills.sh run reconciliation-resume --execute`
|
||||
- `scripts/hecate-drills.sh run controlled-cycle --execute` (uses `HECATE_DRILL_SHUTDOWN_CONFIG`, defaults to `/tmp/hecate-drill-no-poweroff.yaml`)
|
||||
|
||||
These drills are intentionally **not** part of regular `go test ./...`.
|
||||
- Apply changes through Git/Flux manifests; avoid manual in-cluster edits for durable changes.
|
||||
- For controlled shutdown/startup drills, treat any manual intervention as a bug and fold the logic back into Ananke.
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user