ananke/README.md

4.4 KiB
Raw Blame History

ananke

Ananke is the host-side recovery orchestrator for Titan power events.

It runs outside Kubernetes (systemd on host), so it can:

  • shut the cluster down gracefully before runtime gets dangerous
  • bootstrap the cluster after power is restored
  • break known startup deadlocks (including Flux + in-cluster Gitea coupling)
  • verify real service availability before declaring startup complete

The goal is not clever automation. The goal is boring, repeatable recovery.

Why ananke

In Greek myth, Ananke is inevitability and necessity. That is the exact constraint we operate under during outages and drills.

Power-domain names in this lab align with that naming:

  • Statera UPS: titan-23, titan-24, titan-jh
  • Pyrphoros UPS: all other nodes

Operating model (non-negotiable)

  • Ananke does cluster orchestration, not host power control.
  • Shutdown defaults to cluster-only and should remain that way for normal drills.
  • Physical outages can cut host power themselves; Anankes job is clean state transitions.

Flux source of truth remains titan-iac.git. Anankes own repo (ananke.git) is software only; it is not the desired-state cluster config repo.

Breakglass reminder

Vault breakglass is available through a remote Magic Mirror path. If standard unseal retrieval fails, use startup.vault_unseal_breakglass_command.

What "startup complete" means

Startup is complete only after all required gates pass:

  • inventory mapping is valid
  • expected SSH nodes are reachable/authenticated (minus explicit ignores)
  • Flux source drift guard passes (expected URL + branch)
  • required Flux kustomizations are healthy
  • workload convergence is healthy
  • ingress checklist passes
  • service checklist passes (internal + externally exposed)
  • critical endpoint checks pass
  • stability soak passes with no regressions

If a manual intervention is needed during a drill, that is treated as an Ananke gap and must be encoded back into Ananke logic.

Status and reports

Live status:

  • ananke status --config /etc/ananke/ananke.yaml
  • ananke status --config /etc/ananke/ananke.yaml --json

Artifacts:

  • /var/lib/ananke/startup-progress.json (live run progress)
  • /var/lib/ananke/last-startup-report.json
  • /var/lib/ananke/last-shutdown-report.json
  • /var/lib/ananke/reports/*.json (historical per-run reports)
  • /var/lib/ananke/runs.json (timing history)
  • /var/lib/ananke/update-last.env (latest self-update result)
  • /var/log/ananke/update.log (self-update execution log)

Quick commands

From titan-db:

sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only

From titan-24 (tethys peer):

sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only

Systemd control:

sudo systemctl status ananke.service
sudo systemctl start ananke-bootstrap.service
sudo systemctl start ananke-update.service
sudo cat /var/lib/ananke/update-last.env
sudo tail -n 200 /var/log/ananke/update.log

Config

Primary config path:

  • /etc/ananke/ananke.yaml

Keep these fields accurate:

  • expected_flux_source_url
  • expected_flux_branch
  • startup.service_checklist
  • startup.critical_service_endpoints
  • startup.require_ingress_checklist
  • startup.require_node_inventory_reachability
  • startup.ignore_unavailable_nodes
  • coordination.role
  • coordination.peer_hosts

Quality gate

Top-level quality/testing module:

  • testing/

Deployment gate script:

  • scripts/quality_gate.sh

Gate order:

  1. docs contract checks
  2. naming + LOC hygiene checks
  3. pedantic lint
  4. per-file coverage gate (95% minimum)

Installer behavior:

  • scripts/install.sh runs the quality gate by default
  • override only for emergency break/fix: ANANKE_ENFORCE_QUALITY_GATE=0

Growing with the lab

When adding nodes or services:

  1. Update inventory and node mapping in config.
  2. Add/adjust service checklist entries for anything user-facing or critical.
  3. Add/adjust ingress expectations for exposed services.
  4. Use temporary ignores only when truly intentional, then remove them.
  5. Run scripts/quality_gate.sh before host deployment.

Recovery quality should improve over time: every drill should reduce manual work in the next drill.