ananke/README.md

5.4 KiB
Raw Permalink Blame History

ananke

Ananke is the host-side recovery orchestrator for Titan power events.

It runs outside Kubernetes (systemd on host), so it can:

  • shut the cluster down gracefully before runtime gets dangerous
  • bootstrap the cluster after power is restored
  • break known startup deadlocks (including Flux + in-cluster Gitea coupling)
  • verify real service availability before declaring startup complete

The goal is not clever automation. The goal is boring, repeatable recovery.

Why ananke

In Greek myth, Ananke is inevitability and necessity. That is the exact constraint we operate under during outages and drills.

Power-domain names in this lab align with that naming:

  • Statera UPS: titan-23, titan-24, titan-jh
  • Pyrphoros UPS: all other nodes

Operating model (non-negotiable)

  • Ananke does cluster orchestration, not host power control.
  • Shutdown defaults to cluster-only and should remain that way for normal drills.
  • Physical outages can cut host power themselves; Anankes job is clean state transitions.

Flux source of truth remains titan-iac.git. Anankes own repo (ananke.git) is software only; it is not the desired-state cluster config repo.

Breakglass reminder

Vault breakglass is available through a remote Magic Mirror path. If standard unseal retrieval fails, use startup.vault_unseal_breakglass_command.

What "startup complete" means

Startup is complete only after all required gates pass:

  • inventory mapping is valid
  • expected SSH nodes are reachable/authenticated (minus explicit ignores)
  • Flux source drift guard passes (expected URL + branch)
  • required Flux kustomizations are healthy
  • workload convergence is healthy
  • ingress checklist passes
  • service checklist passes (internal + externally exposed)
  • critical endpoint checks pass
  • stability soak passes with no regressions

If a manual intervention is needed during a drill, that is treated as an Ananke gap and must be encoded back into Ananke logic.

Status and reports

Live status:

  • ananke status --config /etc/ananke/ananke.yaml
  • ananke status --config /etc/ananke/ananke.yaml --json

Artifacts:

  • /var/lib/ananke/startup-progress.json (live run progress)
  • /var/lib/ananke/last-startup-report.json
  • /var/lib/ananke/last-shutdown-report.json
  • /var/lib/ananke/reports/*.json (historical per-run reports)
  • /var/lib/ananke/runs.json (timing history)
  • /var/lib/ananke/update-last.env (latest self-update result)
  • /var/log/ananke/update.log (self-update execution log)

Quick commands

From titan-db:

sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only

From titan-24 (tethys peer):

sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only

Systemd control:

sudo systemctl status ananke.service
sudo systemctl start ananke-bootstrap.service
sudo systemctl start ananke-update.service
sudo cat /var/lib/ananke/update-last.env
sudo tail -n 200 /var/log/ananke/update.log

Config

Primary config path:

  • /etc/ananke/ananke.yaml

Keep these fields accurate:

  • expected_flux_source_url
  • expected_flux_branch
  • startup.service_checklist_explicit_only
  • startup.service_checklist
  • startup.critical_service_endpoints
  • startup.require_ingress_checklist
  • startup.require_node_inventory_reachability
  • startup.node_inventory_reachability_required_nodes
  • startup.node_ssh_auth_required_nodes
  • startup.flux_health_required_kustomizations
  • startup.workload_convergence_required_namespaces
  • startup.ignore_unavailable_nodes
  • coordination.role
  • coordination.peer_hosts

Quality gate

Top-level quality/testing module:

  • testing/

Deployment gate script:

  • scripts/quality_gate.sh

Gate order:

  1. docs contract checks
  2. split test-module contract (cmd/ + internal/ cannot grow new in-tree _test.go files)
  3. naming + LOC hygiene checks
  4. pedantic lint
  5. per-file coverage gate (95% minimum)

Current migration rule:

  • keep new tests in the top-level testing/ module
  • legacy in-tree _test.go files are temporarily grandfathered through testing/hygiene/in_tree_test_allowlist.txt until they are migrated safely

Installer behavior:

  • scripts/install.sh runs the quality gate by default
  • override only for emergency break/fix: ANANKE_ENFORCE_QUALITY_GATE=0
  • host quality runs keep writing local ananke_quality_gate_* metrics and also publish platform_quality_gate_runs_total{suite="ananke",status=*} to Pushgateway for shared Grafana panels
  • override the Pushgateway target when running outside cluster DNS: ANANKE_QUALITY_PUSHGATEWAY_URL=http://... ./scripts/quality_gate.sh

Growing with the lab

When adding nodes or services:

  1. Update inventory and node mapping in config.
  2. Keep the explicit service checklist focused on the core services that must come back during an outage.
  3. Keep *_required_* startup scopes aligned with the same core set so optional stacks do not block bootstrap.
  4. Add/adjust ingress expectations for exposed services.
  5. Use temporary ignores only when truly intentional, then remove them.
  6. Run scripts/quality_gate.sh before host deployment.

Recovery quality should improve over time: every drill should reduce manual work in the next drill.