4.4 KiB
ananke
Ananke is the host-side recovery orchestrator for Titan power events.
It runs outside Kubernetes (systemd on host), so it can:
- shut the cluster down gracefully before runtime gets dangerous
- bootstrap the cluster after power is restored
- break known startup deadlocks (including Flux + in-cluster Gitea coupling)
- verify real service availability before declaring startup complete
The goal is not clever automation. The goal is boring, repeatable recovery.
Why ananke
In Greek myth, Ananke is inevitability and necessity. That is the exact constraint we operate under during outages and drills.
Power-domain names in this lab align with that naming:
StateraUPS:titan-23,titan-24,titan-jhPyrphorosUPS: all other nodes
Operating model (non-negotiable)
- Ananke does cluster orchestration, not host power control.
- Shutdown defaults to
cluster-onlyand should remain that way for normal drills. - Physical outages can cut host power themselves; Ananke’s job is clean state transitions.
Flux source of truth remains titan-iac.git.
Ananke’s own repo (ananke.git) is software only; it is not the desired-state cluster config repo.
Breakglass reminder
Vault breakglass is available through a remote Magic Mirror path.
If standard unseal retrieval fails, use startup.vault_unseal_breakglass_command.
What "startup complete" means
Startup is complete only after all required gates pass:
- inventory mapping is valid
- expected SSH nodes are reachable/authenticated (minus explicit ignores)
- Flux source drift guard passes (expected URL + branch)
- required Flux kustomizations are healthy
- workload convergence is healthy
- ingress checklist passes
- service checklist passes (internal + externally exposed)
- critical endpoint checks pass
- stability soak passes with no regressions
If a manual intervention is needed during a drill, that is treated as an Ananke gap and must be encoded back into Ananke logic.
Status and reports
Live status:
ananke status --config /etc/ananke/ananke.yamlananke status --config /etc/ananke/ananke.yaml --json
Artifacts:
/var/lib/ananke/startup-progress.json(live run progress)/var/lib/ananke/last-startup-report.json/var/lib/ananke/last-shutdown-report.json/var/lib/ananke/reports/*.json(historical per-run reports)/var/lib/ananke/runs.json(timing history)/var/lib/ananke/update-last.env(latest self-update result)/var/log/ananke/update.log(self-update execution log)
Quick commands
From titan-db:
sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
From titan-24 (tethys peer):
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
Systemd control:
sudo systemctl status ananke.service
sudo systemctl start ananke-bootstrap.service
sudo systemctl start ananke-update.service
sudo cat /var/lib/ananke/update-last.env
sudo tail -n 200 /var/log/ananke/update.log
Config
Primary config path:
/etc/ananke/ananke.yaml
Keep these fields accurate:
expected_flux_source_urlexpected_flux_branchstartup.service_checkliststartup.critical_service_endpointsstartup.require_ingress_checkliststartup.require_node_inventory_reachabilitystartup.ignore_unavailable_nodescoordination.rolecoordination.peer_hosts
Quality gate
Top-level quality/testing module:
testing/
Deployment gate script:
scripts/quality_gate.sh
Gate order:
- docs contract checks
- naming + LOC hygiene checks
- pedantic lint
- per-file coverage gate (95% minimum)
Installer behavior:
scripts/install.shruns the quality gate by default- override only for emergency break/fix:
ANANKE_ENFORCE_QUALITY_GATE=0
Growing with the lab
When adding nodes or services:
- Update inventory and node mapping in config.
- Add/adjust service checklist entries for anything user-facing or critical.
- Add/adjust ingress expectations for exposed services.
- Use temporary ignores only when truly intentional, then remove them.
- Run
scripts/quality_gate.shbefore host deployment.
Recovery quality should improve over time: every drill should reduce manual work in the next drill.