2026-04-03 01:43:16 -03:00
# Hecate
Hecate is the host-level bootstrap and power-protection service for Titan.
It runs on `titan-db` and handles:
- Staged **startup** (including Flux/Gitea bootstrap deadlock fallback)
- Graceful **shutdown**
- UPS-driven automatic shutdown decisions based on discharge/runtime
2026-04-03 14:46:03 -03:00
- Multi-UPS operation via multiple Hecate instances (for example `titan-db` + `tethys` )
- Full hardware poweroff sequencing after graceful Kubernetes shutdown
2026-04-03 01:43:16 -03:00
## Why host-level
A service inside Kubernetes cannot start a cluster that is fully down.
Hecate runs outside the cluster under systemd, so it can always orchestrate bring-up.
## Commands
- `hecate startup --config /etc/hecate/hecate.yaml --execute --force-flux-branch main`
- `hecate shutdown --config /etc/hecate/hecate.yaml --execute`
- `hecate daemon --config /etc/hecate/hecate.yaml`
- `hecate status --config /etc/hecate/hecate.yaml`
2026-04-04 12:44:15 -03:00
Key startup guards:
- Startup is blocked on hosts configured as `coordination.role: peer` (unless `--allow-peer-startup` is used intentionally).
2026-04-04 18:34:50 -03:00
- `--auto-peer-failover` makes peer hosts hand off startup to the coordinator first, then run local startup only if the coordinator is unreachable.
2026-04-04 12:44:15 -03:00
- Startup is blocked while UPS is on battery by default (unless `--allow-on-battery` or `coordination.allow_startup_on_battery: true` is set).
- Startup is blocked when a shutdown intent is active (`/var/lib/hecate/intent.json` ).
2026-04-05 02:03:56 -03:00
- Startup waits for time sync in `strict` or `quorum` mode (`startup.time_sync_mode` , `startup.time_sync_quorum` ).
- Startup can block until storage is healthy (`startup.require_storage_ready` + critical PVC checks).
- Startup can block until external probes pass (`startup.require_post_start_probes` + `startup.post_start_probes` ).
- Startup refreshes and can use a cached bootstrap manifest set under `/var/lib/hecate/bootstrap-cache` when local fallback paths fail.
- Vault unseal now falls back to a local cached key file (`startup.vault_unseal_key_file` ) if `vault-init` cannot be read yet.
2026-04-04 12:44:15 -03:00
2026-04-03 01:43:16 -03:00
## Manual install on titan-db
```bash
git clone git@gitea -admin:bstein/hecate.git
cd hecate
2026-04-03 14:46:03 -03:00
sudo HECATE_ENABLE_BOOTSTRAP=1 ./scripts/install.sh
2026-04-03 01:43:16 -03:00
sudoedit /etc/hecate/hecate.yaml
sudo systemctl restart hecate.service
```
2026-04-03 14:46:03 -03:00
The installer is idempotent:
- Re-runs safely on every update
- Preserves existing `/etc/hecate/hecate.yaml`
2026-04-04 18:37:17 -03:00
- Automatically migrates legacy defaults (for example `default_budget_seconds: 300` and `runtime_safety_factor: 1.10` )
2026-04-03 14:46:03 -03:00
- Ensures required dependencies are installed (`kubectl` , `nut-*` , `ssh` , `go` , etc.)
- Installs/refreshes systemd units and enables boot-time self-update
2026-04-03 15:17:26 -03:00
- Applies declarative NUT + udev UPS configuration by default (can be tuned via env vars)
Installer knobs (optional):
- `HECATE_ENABLE_BOOTSTRAP=1` enables `hecate-bootstrap.service` on this host.
2026-04-04 18:34:50 -03:00
- `HECATE_ENABLE_BOOTSTRAP=0` disables it; default `auto` enables bootstrap by default.
2026-04-03 15:17:26 -03:00
- `HECATE_MANAGE_NUT=0` skips writing NUT/udev files.
2026-04-04 18:48:51 -03:00
- `HECATE_FORCE_CONFIG_TEMPLATE=coordinator|peer|example` overwrites `/etc/hecate/hecate.yaml` from a known template during install.
2026-04-04 05:50:38 -03:00
- `HECATE_NUT_UPS_NAME` (default inferred from `/etc/hecate/hecate.yaml` target, fallback `pyrphoros` )
2026-04-03 15:17:26 -03:00
- `HECATE_NUT_VENDOR_ID` / `HECATE_NUT_PRODUCT_ID` (defaults `0764` / `0601` )
2026-04-04 05:50:38 -03:00
- `HECATE_NUT_MONITOR_USER` / `HECATE_NUT_MONITOR_PASSWORD` (defaults `monuser` / `hecateupsmon` )
2026-04-03 14:46:03 -03:00
2026-04-03 01:43:16 -03:00
Bootstrap now (without reboot):
```bash
sudo systemctl start hecate-bootstrap.service
```
## Preconditions on titan-db
- `kubectl` installed and configured (`kubeconfig` path in config)
- SSH reachability to all cluster nodes
- Remote sudo rights to run:
- `systemctl start/stop k3s`
- `systemctl start/stop k3s-agent`
- UPS telemetry available via NUT (`upsc` )
2026-04-04 12:44:15 -03:00
Optional SSH jump/bastion:
- Set `ssh_jump_host` (and optional `ssh_jump_user` ) to route node SSH through a jump host like `titan-jh` ; Hecate now falls back to direct SSH automatically if jump routing is unavailable.
2026-04-04 12:56:58 -03:00
- Set `ssh_port` , `ssh_config_file` , `ssh_identity_file` , and `ssh_node_hosts` so root-run systemd actions can actually reach node SSH daemons during cold-start recovery.
2026-04-04 12:44:15 -03:00
- Use `ssh_node_users` for per-node username overrides (for example `titan-24: tethys` ).
- Use `ssh_managed_nodes` to limit host-level SSH start/stop actions to nodes Hecate can actually authenticate to.
2026-04-03 14:46:03 -03:00
## Multi-UPS topology
Recommended:
2026-04-04 05:50:38 -03:00
- `titan-db` runs Hecate as the shutdown coordinator with UPS `Pyrphoros` (`pyrphoros@localhost` ).
- `tethys` runs Hecate as a peer with UPS `Statera` (`statera@localhost` ) and forwards shutdown triggers to `titan-db` .
2026-04-04 18:34:50 -03:00
- The bootstrap unit now runs on both roles; peer role uses auto-failover handoff to coordinator before local fallback startup.
2026-04-03 14:46:03 -03:00
- If forwarding fails, fallback local shutdown can remain enabled.
2026-04-04 12:44:15 -03:00
- Use `coordination.role: coordinator` on `titan-db` and `coordination.role: peer` on `tethys` .
2026-04-03 14:46:03 -03:00
2026-04-03 01:43:16 -03:00
## Config
See `configs/hecate.example.yaml` .
UPS auto-shutdown trigger uses:
- runtime threshold = `runtime_safety_factor * estimated_shutdown_budget`
2026-04-04 20:50:58 -03:00
- default safety factor `1.25`
2026-04-03 01:43:16 -03:00
- debounce across multiple polls to avoid noise
2026-04-05 00:15:09 -03:00
- emergency trigger budget defaults to `shutdown.emergency_budget_seconds` and is learned from historical UPS-triggered shutdown runs once enough samples exist
- UPS-triggered shutdown executes the emergency fast path by default (`shutdown.emergency_skip_drain: true` , `shutdown.emergency_skip_etcd_snapshot: true` )
2026-04-03 01:43:16 -03:00
2026-04-05 00:15:09 -03:00
Estimated shutdown budgets are derived from historical successful shutdown runs (`/var/lib/hecate/runs.json` ) with config fallbacks:
- `estimated_shutdown_budget_seconds` : full/manual shutdown path
- `estimated_emergency_shutdown_budget_seconds` : UPS/emergency path
2026-04-03 01:43:16 -03:00
2026-04-03 14:46:03 -03:00
Power metrics:
- Hecate exposes Prometheus metrics on `:9560/metrics` by default.
- This is intended for a dedicated Grafana power dashboard and a high-level overview row.
2026-04-03 01:43:16 -03:00
## Notes
- Default behavior for `startup` and `shutdown` is dry-run unless `--execute` is set.
2026-04-04 12:44:15 -03:00
- Hecate tracks intent in `/var/lib/hecate/intent.json` (`normal` , `startup_in_progress` , `shutting_down` , `shutdown_complete` ) to avoid startup/shutdown fighting each other.
2026-04-03 01:43:16 -03:00
- `hecate-bootstrap.service` is enabled to run at host boot and perform staged startup automatically.
2026-04-04 18:34:50 -03:00
- `HECATE_ENABLE_BOOTSTRAP=1` forces bootstrap on, `HECATE_ENABLE_BOOTSTRAP=0` forces it off, and `auto` enables by default.
2026-04-03 14:46:03 -03:00
- `hecate-update.timer` runs on boot and periodically to pull latest `main` and reinstall Hecate declaratively.
2026-04-04 20:50:58 -03:00
- Peer startup fallback now checks coordinator intent/bootstrap activity before allowing local startup.
- Automatic etcd recovery can run during startup if API never becomes reachable (`startup.auto_etcd_restore_on_api_failure` ).
## Etcd Recovery
- Manual: `hecate etcd-restore --config /etc/hecate/hecate.yaml --execute`
- Optional snapshot override: `--snapshot /var/lib/rancher/k3s/server/db/snapshots/<name>`
- Startup can automatically invoke the same restore path after API timeout using:
- `startup.auto_etcd_restore_on_api_failure: true`
- `startup.etcd_restore_control_plane: <control-plane-node>`
2026-04-04 20:56:16 -03:00
- If control planes are configured with `--datastore-endpoint` (external DB), Hecate will skip etcd restore and retry control-plane startup instead.
2026-04-04 05:50:38 -03:00
## Disruptive startup drills
Hecate includes scripted disruptive drills that intentionally break critical services and verify startup recovery paths:
- `scripts/hecate-drills.sh list`
- `scripts/hecate-drills.sh run flux-gitea-deadlock --execute`
- `scripts/hecate-drills.sh run foundation-recovery --execute`
- `scripts/hecate-drills.sh run reconciliation-resume --execute`
These drills are intentionally **not** part of regular `go test ./...` .