Hecate
Hecate is the host-level bootstrap and power-protection service for Titan.
It runs on titan-db and handles:
- Staged startup (including Flux/Gitea bootstrap deadlock fallback)
- Graceful shutdown
- UPS-driven automatic shutdown decisions based on discharge/runtime
- Multi-UPS operation via multiple Hecate instances (for example
titan-db+tethys) - Full hardware poweroff sequencing after graceful Kubernetes shutdown
Why host-level
A service inside Kubernetes cannot start a cluster that is fully down. Hecate runs outside the cluster under systemd, so it can always orchestrate bring-up.
Commands
hecate startup --config /etc/hecate/hecate.yaml --execute --force-flux-branch mainhecate shutdown --config /etc/hecate/hecate.yaml --executehecate daemon --config /etc/hecate/hecate.yamlhecate status --config /etc/hecate/hecate.yaml
Key startup guards:
- Startup is blocked on hosts configured as
coordination.role: peer(unless--allow-peer-startupis used intentionally). --auto-peer-failovermakes peer hosts hand off startup to the coordinator first, then run local startup only if the coordinator is unreachable.- Startup is blocked while UPS is on battery by default (unless
--allow-on-batteryorcoordination.allow_startup_on_battery: trueis set). - Startup is blocked when a shutdown intent is active (
/var/lib/hecate/intent.json). - Startup waits for time sync in
strictorquorummode (startup.time_sync_mode,startup.time_sync_quorum). - Startup can block until storage is healthy (
startup.require_storage_ready+ critical PVC checks). - Startup can block until external probes pass (
startup.require_post_start_probes+startup.post_start_probes). - Startup refreshes and can use a cached bootstrap manifest set under
/var/lib/hecate/bootstrap-cachewhen local fallback paths fail. - Vault unseal now falls back to a local cached key file (
startup.vault_unseal_key_file) ifvault-initcannot be read yet. - Optional off-site break-glass retrieval can be configured with
startup.vault_unseal_breakglass_command(for example, an SSHcatcommand to a remote key escrow host).
Manual install on titan-db
git clone git@gitea-admin:bstein/hecate.git
cd hecate
sudo HECATE_ENABLE_BOOTSTRAP=1 ./scripts/install.sh
sudoedit /etc/hecate/hecate.yaml
sudo systemctl restart hecate.service
The installer is idempotent:
- Re-runs safely on every update
- Preserves existing
/etc/hecate/hecate.yaml - Automatically migrates legacy defaults (for example
default_budget_seconds: 300andruntime_safety_factor: 1.10) - Ensures required dependencies are installed (
kubectl,nut-*,ssh,go, etc.) - Installs/refreshes systemd units and enables boot-time self-update
- Applies declarative NUT + udev UPS configuration by default (can be tuned via env vars)
Installer knobs (optional):
HECATE_ENABLE_BOOTSTRAP=1enableshecate-bootstrap.serviceon this host.HECATE_ENABLE_BOOTSTRAP=0disables it; defaultautoenables bootstrap by default.HECATE_MANAGE_NUT=0skips writing NUT/udev files.HECATE_FORCE_CONFIG_TEMPLATE=coordinator|peer|exampleoverwrites/etc/hecate/hecate.yamlfrom a known template during install.HECATE_NUT_UPS_NAME(default inferred from/etc/hecate/hecate.yamltarget, fallbackpyrphoros)HECATE_NUT_VENDOR_ID/HECATE_NUT_PRODUCT_ID(defaults0764/0601)HECATE_NUT_MONITOR_USER/HECATE_NUT_MONITOR_PASSWORD(defaultsmonuser/hecateupsmon)
Bootstrap now (without reboot):
sudo systemctl start hecate-bootstrap.service
Preconditions on titan-db
kubectlinstalled and configured (kubeconfigpath in config)- SSH reachability to all cluster nodes
- Remote sudo rights to run:
systemctl start/stop k3ssystemctl start/stop k3s-agent
- UPS telemetry available via NUT (
upsc)
Optional SSH jump/bastion:
- Set
ssh_jump_host(and optionalssh_jump_user) to route node SSH through a jump host liketitan-jh; Hecate now falls back to direct SSH automatically if jump routing is unavailable. - Set
ssh_port,ssh_config_file,ssh_identity_file, andssh_node_hostsso root-run systemd actions can actually reach node SSH daemons during cold-start recovery. - Use
ssh_node_usersfor per-node username overrides (for exampletitan-24: tethys). - Use
ssh_managed_nodesto limit host-level SSH start/stop actions to nodes Hecate can actually authenticate to.
Multi-UPS topology
Recommended:
titan-dbruns Hecate as the shutdown coordinator with UPSPyrphoros(pyrphoros@localhost).tethysruns Hecate as a peer with UPSStatera(statera@localhost) and forwards shutdown triggers totitan-db.- The bootstrap unit now runs on both roles; peer role uses auto-failover handoff to coordinator before local fallback startup.
- If forwarding fails, fallback local shutdown can remain enabled.
- Use
coordination.role: coordinatorontitan-dbandcoordination.role: peerontethys.
Config
See configs/hecate.example.yaml.
Break-glass unseal fallback knobs:
startup.vault_unseal_breakglass_command: optional shell command that prints the unseal key to stdout.startup.vault_unseal_breakglass_timeout_seconds: timeout for the command (default15).
UPS auto-shutdown trigger uses:
- runtime threshold =
runtime_safety_factor * estimated_shutdown_budget - default safety factor
1.25 - debounce across multiple polls to avoid noise
- emergency trigger budget defaults to
shutdown.emergency_budget_secondsand is learned from historical UPS-triggered shutdown runs once enough samples exist - UPS-triggered shutdown executes the emergency fast path by default (
shutdown.emergency_skip_drain: true,shutdown.emergency_skip_etcd_snapshot: true)
Estimated shutdown budgets are derived from historical successful shutdown runs (/var/lib/hecate/runs.json) with config fallbacks:
estimated_shutdown_budget_seconds: full/manual shutdown pathestimated_emergency_shutdown_budget_seconds: UPS/emergency path
Power metrics:
- Hecate exposes Prometheus metrics on
:9560/metricsby default. - This is intended for a dedicated Grafana power dashboard and a high-level overview row.
Notes
- Default behavior for
startupandshutdownis dry-run unless--executeis set. - Hecate tracks intent in
/var/lib/hecate/intent.json(normal,startup_in_progress,shutting_down,shutdown_complete) to avoid startup/shutdown fighting each other. hecate-bootstrap.serviceis enabled to run at host boot and perform staged startup automatically.HECATE_ENABLE_BOOTSTRAP=1forces bootstrap on,HECATE_ENABLE_BOOTSTRAP=0forces it off, andautoenables by default.hecate-update.timerruns on boot and periodically to pull latestmainand reinstall Hecate declaratively.- Peer startup fallback now checks coordinator intent/bootstrap activity before allowing local startup.
- Automatic etcd recovery can run during startup if API never becomes reachable (
startup.auto_etcd_restore_on_api_failure).
Etcd Recovery
- Manual:
hecate etcd-restore --config /etc/hecate/hecate.yaml --execute - Optional snapshot override:
--snapshot /var/lib/rancher/k3s/server/db/snapshots/<name> - Startup can automatically invoke the same restore path after API timeout using:
startup.auto_etcd_restore_on_api_failure: truestartup.etcd_restore_control_plane: <control-plane-node>
- If control planes are configured with
--datastore-endpoint(external DB), Hecate will skip etcd restore and retry control-plane startup instead.
Disruptive startup drills
Hecate includes scripted disruptive drills that intentionally break critical services and verify startup recovery paths:
scripts/hecate-drills.sh listscripts/hecate-drills.sh run flux-gitea-deadlock --executescripts/hecate-drills.sh run foundation-recovery --executescripts/hecate-drills.sh run reconciliation-resume --execute
These drills are intentionally not part of regular go test ./....