Hecate

Hecate is the host-level bootstrap and power-protection service for Titan.

It runs on titan-db and handles:

  • Staged startup (including Flux/Gitea bootstrap deadlock fallback)
  • Graceful shutdown
  • UPS-driven automatic shutdown decisions based on discharge/runtime
  • Multi-UPS operation via multiple Hecate instances (for example titan-db + tethys)
  • Full hardware poweroff sequencing after graceful Kubernetes shutdown

Why host-level

A service inside Kubernetes cannot start a cluster that is fully down. Hecate runs outside the cluster under systemd, so it can always orchestrate bring-up.

Commands

  • hecate startup --config /etc/hecate/hecate.yaml --execute --force-flux-branch main
  • hecate shutdown --config /etc/hecate/hecate.yaml --execute
  • hecate daemon --config /etc/hecate/hecate.yaml
  • hecate status --config /etc/hecate/hecate.yaml

Key startup guards:

  • Startup is blocked on hosts configured as coordination.role: peer (unless --allow-peer-startup is used intentionally).
  • --auto-peer-failover makes peer hosts hand off startup to the coordinator first, then run local startup only if the coordinator is unreachable.
  • Startup is blocked while UPS is on battery by default (unless --allow-on-battery or coordination.allow_startup_on_battery: true is set).
  • Startup is blocked when a shutdown intent is active (/var/lib/hecate/intent.json).
  • Startup waits for time sync in strict or quorum mode (startup.time_sync_mode, startup.time_sync_quorum).
  • Startup can block until storage is healthy (startup.require_storage_ready + critical PVC checks).
  • Startup can block until external probes pass (startup.require_post_start_probes + startup.post_start_probes).
  • Startup refreshes and can use a cached bootstrap manifest set under /var/lib/hecate/bootstrap-cache when local fallback paths fail.
  • Vault unseal now falls back to a local cached key file (startup.vault_unseal_key_file) if vault-init cannot be read yet.
  • Optional off-site break-glass retrieval can be configured with startup.vault_unseal_breakglass_command (for example, an SSH cat command to a remote key escrow host).

Manual install on titan-db

git clone git@gitea-admin:bstein/hecate.git
cd hecate
sudo HECATE_ENABLE_BOOTSTRAP=1 ./scripts/install.sh
sudoedit /etc/hecate/hecate.yaml
sudo systemctl restart hecate.service

The installer is idempotent:

  • Re-runs safely on every update
  • Preserves existing /etc/hecate/hecate.yaml
  • Automatically migrates legacy defaults (for example default_budget_seconds: 300 and runtime_safety_factor: 1.10)
  • Ensures required dependencies are installed (kubectl, nut-*, ssh, go, etc.)
  • Installs/refreshes systemd units and enables boot-time self-update
  • Applies declarative NUT + udev UPS configuration by default (can be tuned via env vars)

Installer knobs (optional):

  • HECATE_ENABLE_BOOTSTRAP=1 enables hecate-bootstrap.service on this host.
  • HECATE_ENABLE_BOOTSTRAP=0 disables it; default auto enables bootstrap by default.
  • HECATE_MANAGE_NUT=0 skips writing NUT/udev files.
  • HECATE_FORCE_CONFIG_TEMPLATE=coordinator|peer|example overwrites /etc/hecate/hecate.yaml from a known template during install.
  • HECATE_NUT_UPS_NAME (default inferred from /etc/hecate/hecate.yaml target, fallback pyrphoros)
  • HECATE_NUT_VENDOR_ID / HECATE_NUT_PRODUCT_ID (defaults 0764 / 0601)
  • HECATE_NUT_MONITOR_USER / HECATE_NUT_MONITOR_PASSWORD (defaults monuser / hecateupsmon)

Bootstrap now (without reboot):

sudo systemctl start hecate-bootstrap.service

Preconditions on titan-db

  • kubectl installed and configured (kubeconfig path in config)
  • SSH reachability to all cluster nodes
  • Remote sudo rights to run:
    • systemctl start/stop k3s
    • systemctl start/stop k3s-agent
  • UPS telemetry available via NUT (upsc)

Optional SSH jump/bastion:

  • Set ssh_jump_host (and optional ssh_jump_user) to route node SSH through a jump host like titan-jh; Hecate now falls back to direct SSH automatically if jump routing is unavailable.
  • Set ssh_port, ssh_config_file, ssh_identity_file, and ssh_node_hosts so root-run systemd actions can actually reach node SSH daemons during cold-start recovery.
  • Use ssh_node_users for per-node username overrides (for example titan-24: tethys).
  • Use ssh_managed_nodes to limit host-level SSH start/stop actions to nodes Hecate can actually authenticate to.

Multi-UPS topology

Recommended:

  • titan-db runs Hecate as the shutdown coordinator with UPS Pyrphoros (pyrphoros@localhost).
  • tethys runs Hecate as a peer with UPS Statera (statera@localhost) and forwards shutdown triggers to titan-db.
  • The bootstrap unit now runs on both roles; peer role uses auto-failover handoff to coordinator before local fallback startup.
  • If forwarding fails, fallback local shutdown can remain enabled.
  • Use coordination.role: coordinator on titan-db and coordination.role: peer on tethys.

Config

See configs/hecate.example.yaml.

Break-glass unseal fallback knobs:

  • startup.vault_unseal_breakglass_command: optional shell command that prints the unseal key to stdout.
  • startup.vault_unseal_breakglass_timeout_seconds: timeout for the command (default 15).

UPS auto-shutdown trigger uses:

  • runtime threshold = runtime_safety_factor * estimated_shutdown_budget
  • default safety factor 1.25
  • debounce across multiple polls to avoid noise
  • emergency trigger budget defaults to shutdown.emergency_budget_seconds and is learned from historical UPS-triggered shutdown runs once enough samples exist
  • UPS-triggered shutdown executes the emergency fast path by default (shutdown.emergency_skip_drain: true, shutdown.emergency_skip_etcd_snapshot: true)

Estimated shutdown budgets are derived from historical successful shutdown runs (/var/lib/hecate/runs.json) with config fallbacks:

  • estimated_shutdown_budget_seconds: full/manual shutdown path
  • estimated_emergency_shutdown_budget_seconds: UPS/emergency path

Power metrics:

  • Hecate exposes Prometheus metrics on :9560/metrics by default.
  • This is intended for a dedicated Grafana power dashboard and a high-level overview row.

Notes

  • Default behavior for startup and shutdown is dry-run unless --execute is set.
  • Hecate tracks intent in /var/lib/hecate/intent.json (normal, startup_in_progress, shutting_down, shutdown_complete) to avoid startup/shutdown fighting each other.
  • hecate-bootstrap.service is enabled to run at host boot and perform staged startup automatically.
  • HECATE_ENABLE_BOOTSTRAP=1 forces bootstrap on, HECATE_ENABLE_BOOTSTRAP=0 forces it off, and auto enables by default.
  • hecate-update.timer runs on boot and periodically to pull latest main and reinstall Hecate declaratively.
  • Peer startup fallback now checks coordinator intent/bootstrap activity before allowing local startup.
  • Automatic etcd recovery can run during startup if API never becomes reachable (startup.auto_etcd_restore_on_api_failure).

Etcd Recovery

  • Manual: hecate etcd-restore --config /etc/hecate/hecate.yaml --execute
  • Optional snapshot override: --snapshot /var/lib/rancher/k3s/server/db/snapshots/<name>
  • Startup can automatically invoke the same restore path after API timeout using:
    • startup.auto_etcd_restore_on_api_failure: true
    • startup.etcd_restore_control_plane: <control-plane-node>
  • If control planes are configured with --datastore-endpoint (external DB), Hecate will skip etcd restore and retry control-plane startup instead.

Disruptive startup drills

Hecate includes scripted disruptive drills that intentionally break critical services and verify startup recovery paths:

  • scripts/hecate-drills.sh list
  • scripts/hecate-drills.sh run flux-gitea-deadlock --execute
  • scripts/hecate-drills.sh run foundation-recovery --execute
  • scripts/hecate-drills.sh run reconciliation-resume --execute

These drills are intentionally not part of regular go test ./....

Description
atlas cluster UPS manager and start/stop orchestration
Readme 2.1 MiB
Languages
Go 94.2%
Shell 4.4%
Python 1.4%