ananke/README.md

4.1 KiB

Hecate

Hecate is the host-level bootstrap and power-protection service for Titan.

It runs on titan-db and handles:

  • Staged startup (including Flux/Gitea bootstrap deadlock fallback)
  • Graceful shutdown
  • UPS-driven automatic shutdown decisions based on discharge/runtime
  • Multi-UPS operation via multiple Hecate instances (for example titan-db + tethys)
  • Full hardware poweroff sequencing after graceful Kubernetes shutdown

Why host-level

A service inside Kubernetes cannot start a cluster that is fully down. Hecate runs outside the cluster under systemd, so it can always orchestrate bring-up.

Commands

  • hecate startup --config /etc/hecate/hecate.yaml --execute --force-flux-branch main
  • hecate shutdown --config /etc/hecate/hecate.yaml --execute
  • hecate daemon --config /etc/hecate/hecate.yaml
  • hecate status --config /etc/hecate/hecate.yaml

Manual install on titan-db

git clone git@gitea-admin:bstein/hecate.git
cd hecate
sudo HECATE_ENABLE_BOOTSTRAP=1 ./scripts/install.sh
sudoedit /etc/hecate/hecate.yaml
sudo systemctl restart hecate.service

The installer is idempotent:

  • Re-runs safely on every update
  • Preserves existing /etc/hecate/hecate.yaml
  • Ensures required dependencies are installed (kubectl, nut-*, ssh, go, etc.)
  • Installs/refreshes systemd units and enables boot-time self-update
  • Applies declarative NUT + udev UPS configuration by default (can be tuned via env vars)

Installer knobs (optional):

  • HECATE_ENABLE_BOOTSTRAP=1 enables hecate-bootstrap.service on this host.
  • HECATE_ENABLE_BOOTSTRAP=0 disables it; default auto preserves current bootstrap enablement state.
  • HECATE_MANAGE_NUT=0 skips writing NUT/udev files.
  • HECATE_NUT_UPS_NAME (default inferred from /etc/hecate/hecate.yaml target, fallback pyrphoros)
  • HECATE_NUT_VENDOR_ID / HECATE_NUT_PRODUCT_ID (defaults 0764 / 0601)
  • HECATE_NUT_MONITOR_USER / HECATE_NUT_MONITOR_PASSWORD (defaults monuser / hecateupsmon)

Bootstrap now (without reboot):

sudo systemctl start hecate-bootstrap.service

Preconditions on titan-db

  • kubectl installed and configured (kubeconfig path in config)
  • SSH reachability to all cluster nodes
  • Remote sudo rights to run:
    • systemctl start/stop k3s
    • systemctl start/stop k3s-agent
  • UPS telemetry available via NUT (upsc)

Multi-UPS topology

Recommended:

  • titan-db runs Hecate as the shutdown coordinator with UPS Pyrphoros (pyrphoros@localhost).
  • tethys runs Hecate as a peer with UPS Statera (statera@localhost) and forwards shutdown triggers to titan-db.
  • If forwarding fails, fallback local shutdown can remain enabled.

Config

See configs/hecate.example.yaml.

UPS auto-shutdown trigger uses:

  • runtime threshold = runtime_safety_factor * estimated_shutdown_budget
  • default safety factor 1.10
  • debounce across multiple polls to avoid noise

Estimated shutdown budget is derived from historical successful shutdown runs (/var/lib/hecate/runs.json) with default fallback from config.

Power metrics:

  • Hecate exposes Prometheus metrics on :9560/metrics by default.
  • This is intended for a dedicated Grafana power dashboard and a high-level overview row.

Notes

  • Default behavior for startup and shutdown is dry-run unless --execute is set.
  • hecate-bootstrap.service is enabled to run at host boot and perform staged startup automatically.
  • HECATE_ENABLE_BOOTSTRAP=1 enables hecate-bootstrap.service (recommended on titan-db; keep disabled on non-coordinator hosts).
  • hecate-update.timer runs on boot and periodically to pull latest main and reinstall Hecate declaratively.

Disruptive startup drills

Hecate includes scripted disruptive drills that intentionally break critical services and verify startup recovery paths:

  • scripts/hecate-drills.sh list
  • scripts/hecate-drills.sh run flux-gitea-deadlock --execute
  • scripts/hecate-drills.sh run foundation-recovery --execute
  • scripts/hecate-drills.sh run reconciliation-resume --execute

These drills are intentionally not part of regular go test ./....