ananke/README.md

5.5 KiB

Hecate

Hecate is the host-level bootstrap and power-protection service for Titan.

It runs on titan-db and handles:

  • Staged startup (including Flux/Gitea bootstrap deadlock fallback)
  • Graceful shutdown
  • UPS-driven automatic shutdown decisions based on discharge/runtime
  • Multi-UPS operation via multiple Hecate instances (for example titan-db + tethys)
  • Full hardware poweroff sequencing after graceful Kubernetes shutdown

Why host-level

A service inside Kubernetes cannot start a cluster that is fully down. Hecate runs outside the cluster under systemd, so it can always orchestrate bring-up.

Commands

  • hecate startup --config /etc/hecate/hecate.yaml --execute --force-flux-branch main
  • hecate shutdown --config /etc/hecate/hecate.yaml --execute
  • hecate daemon --config /etc/hecate/hecate.yaml
  • hecate status --config /etc/hecate/hecate.yaml

Key startup guards:

  • Startup is blocked on hosts configured as coordination.role: peer (unless --allow-peer-startup is used intentionally).
  • --auto-peer-failover makes peer hosts hand off startup to the coordinator first, then run local startup only if the coordinator is unreachable.
  • Startup is blocked while UPS is on battery by default (unless --allow-on-battery or coordination.allow_startup_on_battery: true is set).
  • Startup is blocked when a shutdown intent is active (/var/lib/hecate/intent.json).

Manual install on titan-db

git clone git@gitea-admin:bstein/hecate.git
cd hecate
sudo HECATE_ENABLE_BOOTSTRAP=1 ./scripts/install.sh
sudoedit /etc/hecate/hecate.yaml
sudo systemctl restart hecate.service

The installer is idempotent:

  • Re-runs safely on every update
  • Preserves existing /etc/hecate/hecate.yaml
  • Ensures required dependencies are installed (kubectl, nut-*, ssh, go, etc.)
  • Installs/refreshes systemd units and enables boot-time self-update
  • Applies declarative NUT + udev UPS configuration by default (can be tuned via env vars)

Installer knobs (optional):

  • HECATE_ENABLE_BOOTSTRAP=1 enables hecate-bootstrap.service on this host.
  • HECATE_ENABLE_BOOTSTRAP=0 disables it; default auto enables bootstrap by default.
  • HECATE_MANAGE_NUT=0 skips writing NUT/udev files.
  • HECATE_NUT_UPS_NAME (default inferred from /etc/hecate/hecate.yaml target, fallback pyrphoros)
  • HECATE_NUT_VENDOR_ID / HECATE_NUT_PRODUCT_ID (defaults 0764 / 0601)
  • HECATE_NUT_MONITOR_USER / HECATE_NUT_MONITOR_PASSWORD (defaults monuser / hecateupsmon)

Bootstrap now (without reboot):

sudo systemctl start hecate-bootstrap.service

Preconditions on titan-db

  • kubectl installed and configured (kubeconfig path in config)
  • SSH reachability to all cluster nodes
  • Remote sudo rights to run:
    • systemctl start/stop k3s
    • systemctl start/stop k3s-agent
  • UPS telemetry available via NUT (upsc)

Optional SSH jump/bastion:

  • Set ssh_jump_host (and optional ssh_jump_user) to route node SSH through a jump host like titan-jh; Hecate now falls back to direct SSH automatically if jump routing is unavailable.
  • Set ssh_port, ssh_config_file, ssh_identity_file, and ssh_node_hosts so root-run systemd actions can actually reach node SSH daemons during cold-start recovery.
  • Use ssh_node_users for per-node username overrides (for example titan-24: tethys).
  • Use ssh_managed_nodes to limit host-level SSH start/stop actions to nodes Hecate can actually authenticate to.

Multi-UPS topology

Recommended:

  • titan-db runs Hecate as the shutdown coordinator with UPS Pyrphoros (pyrphoros@localhost).
  • tethys runs Hecate as a peer with UPS Statera (statera@localhost) and forwards shutdown triggers to titan-db.
  • The bootstrap unit now runs on both roles; peer role uses auto-failover handoff to coordinator before local fallback startup.
  • If forwarding fails, fallback local shutdown can remain enabled.
  • Use coordination.role: coordinator on titan-db and coordination.role: peer on tethys.

Config

See configs/hecate.example.yaml.

UPS auto-shutdown trigger uses:

  • runtime threshold = runtime_safety_factor * estimated_shutdown_budget
  • default safety factor 1.10
  • debounce across multiple polls to avoid noise

Estimated shutdown budget is derived from historical successful shutdown runs (/var/lib/hecate/runs.json) with default fallback from config.

Power metrics:

  • Hecate exposes Prometheus metrics on :9560/metrics by default.
  • This is intended for a dedicated Grafana power dashboard and a high-level overview row.

Notes

  • Default behavior for startup and shutdown is dry-run unless --execute is set.
  • Hecate tracks intent in /var/lib/hecate/intent.json (normal, startup_in_progress, shutting_down, shutdown_complete) to avoid startup/shutdown fighting each other.
  • hecate-bootstrap.service is enabled to run at host boot and perform staged startup automatically.
  • HECATE_ENABLE_BOOTSTRAP=1 forces bootstrap on, HECATE_ENABLE_BOOTSTRAP=0 forces it off, and auto enables by default.
  • hecate-update.timer runs on boot and periodically to pull latest main and reinstall Hecate declaratively.

Disruptive startup drills

Hecate includes scripted disruptive drills that intentionally break critical services and verify startup recovery paths:

  • scripts/hecate-drills.sh list
  • scripts/hecate-drills.sh run flux-gitea-deadlock --execute
  • scripts/hecate-drills.sh run foundation-recovery --execute
  • scripts/hecate-drills.sh run reconciliation-resume --execute

These drills are intentionally not part of regular go test ./....