# Hecate Hecate is the host-level bootstrap and power-protection service for Titan. It runs on `titan-db` and handles: - Staged **startup** (including Flux/Gitea bootstrap deadlock fallback) - Graceful **shutdown** - UPS-driven automatic shutdown decisions based on discharge/runtime - Multi-UPS operation via multiple Hecate instances (for example `titan-db` + `tethys`) - Full hardware poweroff sequencing after graceful Kubernetes shutdown ## Why host-level A service inside Kubernetes cannot start a cluster that is fully down. Hecate runs outside the cluster under systemd, so it can always orchestrate bring-up. ## Commands - `hecate startup --config /etc/hecate/hecate.yaml --execute --force-flux-branch main` - `hecate shutdown --config /etc/hecate/hecate.yaml --execute` - `hecate daemon --config /etc/hecate/hecate.yaml` - `hecate status --config /etc/hecate/hecate.yaml` Key startup guards: - Startup is blocked on hosts configured as `coordination.role: peer` (unless `--allow-peer-startup` is used intentionally). - `--auto-peer-failover` makes peer hosts hand off startup to the coordinator first, then run local startup only if the coordinator is unreachable. - Startup is blocked while UPS is on battery by default (unless `--allow-on-battery` or `coordination.allow_startup_on_battery: true` is set). - Startup is blocked when a shutdown intent is active (`/var/lib/hecate/intent.json`). ## Manual install on titan-db ```bash git clone git@gitea-admin:bstein/hecate.git cd hecate sudo HECATE_ENABLE_BOOTSTRAP=1 ./scripts/install.sh sudoedit /etc/hecate/hecate.yaml sudo systemctl restart hecate.service ``` The installer is idempotent: - Re-runs safely on every update - Preserves existing `/etc/hecate/hecate.yaml` - Automatically migrates legacy defaults (for example `default_budget_seconds: 300` and `runtime_safety_factor: 1.10`) - Ensures required dependencies are installed (`kubectl`, `nut-*`, `ssh`, `go`, etc.) - Installs/refreshes systemd units and enables boot-time self-update - Applies declarative NUT + udev UPS configuration by default (can be tuned via env vars) Installer knobs (optional): - `HECATE_ENABLE_BOOTSTRAP=1` enables `hecate-bootstrap.service` on this host. - `HECATE_ENABLE_BOOTSTRAP=0` disables it; default `auto` enables bootstrap by default. - `HECATE_MANAGE_NUT=0` skips writing NUT/udev files. - `HECATE_FORCE_CONFIG_TEMPLATE=coordinator|peer|example` overwrites `/etc/hecate/hecate.yaml` from a known template during install. - `HECATE_NUT_UPS_NAME` (default inferred from `/etc/hecate/hecate.yaml` target, fallback `pyrphoros`) - `HECATE_NUT_VENDOR_ID` / `HECATE_NUT_PRODUCT_ID` (defaults `0764` / `0601`) - `HECATE_NUT_MONITOR_USER` / `HECATE_NUT_MONITOR_PASSWORD` (defaults `monuser` / `hecateupsmon`) Bootstrap now (without reboot): ```bash sudo systemctl start hecate-bootstrap.service ``` ## Preconditions on titan-db - `kubectl` installed and configured (`kubeconfig` path in config) - SSH reachability to all cluster nodes - Remote sudo rights to run: - `systemctl start/stop k3s` - `systemctl start/stop k3s-agent` - UPS telemetry available via NUT (`upsc`) Optional SSH jump/bastion: - Set `ssh_jump_host` (and optional `ssh_jump_user`) to route node SSH through a jump host like `titan-jh`; Hecate now falls back to direct SSH automatically if jump routing is unavailable. - Set `ssh_port`, `ssh_config_file`, `ssh_identity_file`, and `ssh_node_hosts` so root-run systemd actions can actually reach node SSH daemons during cold-start recovery. - Use `ssh_node_users` for per-node username overrides (for example `titan-24: tethys`). - Use `ssh_managed_nodes` to limit host-level SSH start/stop actions to nodes Hecate can actually authenticate to. ## Multi-UPS topology Recommended: - `titan-db` runs Hecate as the shutdown coordinator with UPS `Pyrphoros` (`pyrphoros@localhost`). - `tethys` runs Hecate as a peer with UPS `Statera` (`statera@localhost`) and forwards shutdown triggers to `titan-db`. - The bootstrap unit now runs on both roles; peer role uses auto-failover handoff to coordinator before local fallback startup. - If forwarding fails, fallback local shutdown can remain enabled. - Use `coordination.role: coordinator` on `titan-db` and `coordination.role: peer` on `tethys`. ## Config See `configs/hecate.example.yaml`. UPS auto-shutdown trigger uses: - runtime threshold = `runtime_safety_factor * estimated_shutdown_budget` - default safety factor `1.25` - debounce across multiple polls to avoid noise Estimated shutdown budget is derived from historical successful shutdown runs (`/var/lib/hecate/runs.json`) with default fallback from config. Power metrics: - Hecate exposes Prometheus metrics on `:9560/metrics` by default. - This is intended for a dedicated Grafana power dashboard and a high-level overview row. ## Notes - Default behavior for `startup` and `shutdown` is dry-run unless `--execute` is set. - Hecate tracks intent in `/var/lib/hecate/intent.json` (`normal`, `startup_in_progress`, `shutting_down`, `shutdown_complete`) to avoid startup/shutdown fighting each other. - `hecate-bootstrap.service` is enabled to run at host boot and perform staged startup automatically. - `HECATE_ENABLE_BOOTSTRAP=1` forces bootstrap on, `HECATE_ENABLE_BOOTSTRAP=0` forces it off, and `auto` enables by default. - `hecate-update.timer` runs on boot and periodically to pull latest `main` and reinstall Hecate declaratively. - Peer startup fallback now checks coordinator intent/bootstrap activity before allowing local startup. - Automatic etcd recovery can run during startup if API never becomes reachable (`startup.auto_etcd_restore_on_api_failure`). ## Etcd Recovery - Manual: `hecate etcd-restore --config /etc/hecate/hecate.yaml --execute` - Optional snapshot override: `--snapshot /var/lib/rancher/k3s/server/db/snapshots/` - Startup can automatically invoke the same restore path after API timeout using: - `startup.auto_etcd_restore_on_api_failure: true` - `startup.etcd_restore_control_plane: ` ## Disruptive startup drills Hecate includes scripted disruptive drills that intentionally break critical services and verify startup recovery paths: - `scripts/hecate-drills.sh list` - `scripts/hecate-drills.sh run flux-gitea-deadlock --execute` - `scripts/hecate-drills.sh run foundation-recovery --execute` - `scripts/hecate-drills.sh run reconciliation-resume --execute` These drills are intentionally **not** part of regular `go test ./...`.