It runs on the host, outside Kubernetes, because some failures start before the cluster is healthy enough to fix itself. Its job is to bring nodes, Flux, core workloads, ingresses, and service checks back into a known-good state.

It is deliberately boring software: do the checks, repair the known deadlocks, and stop loudly when a human needs to touch hardware.

How it works

Ananke reads /etc/ananke/ananke.yaml, then walks the cluster through startup or shutdown gates:

confirm the expected nodes and SSH access
check that Flux is looking at the right repo and branch
wait for required Flux kustomizations and namespaces
repair known startup traps, including Harbor/Gitea/Flux coupling
run ingress, service, endpoint, and soak checks before calling startup done

Recovery cordons are now treated as short leases. If Ananke cordons a node to repair something, it must either clear the cordon within the configured window or mark the node for manual action. The default window is one hour.

Daily commands

Run these on titan-db unless you know you are using the tethys peer:

sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only

Useful files:

/var/lib/ananke/startup-progress.json
/var/lib/ananke/last-startup-report.json
/var/lib/ananke/last-shutdown-report.json
/var/log/ananke/update.log

Development

Run the full local check before installing:

./scripts/quality_gate.sh

Emergency installs can bypass the gate with ANANKE_ENFORCE_QUALITY_GATE=0, but that should stay rare. If a recovery drill needed manual work, the follow-up belongs in Ananke so the next one is cleaner.