bstein/ananke

Fork 0

Brad Stein 1cdb2cb3f3 update: add self-healing updater logs, lock, and status output

2026-04-09 04:13:18 -03:00

4.4 KiB

Raw Blame History

ananke

Ananke is the host-side recovery orchestrator for Titan power events.

It runs outside Kubernetes (systemd on host), so it can:

shut the cluster down gracefully before runtime gets dangerous
bootstrap the cluster after power is restored
break known startup deadlocks (including Flux + in-cluster Gitea coupling)
verify real service availability before declaring startup complete

The goal is not clever automation. The goal is boring, repeatable recovery.

Why `ananke`

In Greek myth, Ananke is inevitability and necessity. That is the exact constraint we operate under during outages and drills.

Power-domain names in this lab align with that naming:

Statera UPS: titan-23, titan-24, titan-jh
Pyrphoros UPS: all other nodes

Operating model (non-negotiable)

Ananke does cluster orchestration, not host power control.
Shutdown defaults to cluster-only and should remain that way for normal drills.
Physical outages can cut host power themselves; Ananke’s job is clean state transitions.

Flux source of truth remains titan-iac.git. Ananke’s own repo (ananke.git) is software only; it is not the desired-state cluster config repo.

Breakglass reminder

Vault breakglass is available through a remote Magic Mirror path. If standard unseal retrieval fails, use startup.vault_unseal_breakglass_command.

What "startup complete" means

Startup is complete only after all required gates pass:

inventory mapping is valid
expected SSH nodes are reachable/authenticated (minus explicit ignores)
Flux source drift guard passes (expected URL + branch)
required Flux kustomizations are healthy
workload convergence is healthy
ingress checklist passes
service checklist passes (internal + externally exposed)
critical endpoint checks pass
stability soak passes with no regressions

If a manual intervention is needed during a drill, that is treated as an Ananke gap and must be encoded back into Ananke logic.

Status and reports

Live status:

ananke status --config /etc/ananke/ananke.yaml
ananke status --config /etc/ananke/ananke.yaml --json

Artifacts:

/var/lib/ananke/startup-progress.json (live run progress)
/var/lib/ananke/last-startup-report.json
/var/lib/ananke/last-shutdown-report.json
/var/lib/ananke/reports/*.json (historical per-run reports)
/var/lib/ananke/runs.json (timing history)
/var/lib/ananke/update-last.env (latest self-update result)
/var/log/ananke/update.log (self-update execution log)

Quick commands

From titan-db:

sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only

From titan-24 (tethys peer):

sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only

Systemd control:

sudo systemctl status ananke.service
sudo systemctl start ananke-bootstrap.service
sudo systemctl start ananke-update.service
sudo cat /var/lib/ananke/update-last.env
sudo tail -n 200 /var/log/ananke/update.log

Config

Primary config path:

/etc/ananke/ananke.yaml

Keep these fields accurate:

expected_flux_source_url
expected_flux_branch
startup.service_checklist
startup.critical_service_endpoints
startup.require_ingress_checklist
startup.require_node_inventory_reachability
startup.ignore_unavailable_nodes
coordination.role
coordination.peer_hosts

Quality gate

Top-level quality/testing module:

testing/

Deployment gate script:

scripts/quality_gate.sh

Gate order:

docs contract checks
naming + LOC hygiene checks
pedantic lint
per-file coverage gate (95% minimum)

Installer behavior:

scripts/install.sh runs the quality gate by default
override only for emergency break/fix: ANANKE_ENFORCE_QUALITY_GATE=0

Growing with the lab

When adding nodes or services:

Update inventory and node mapping in config.
Add/adjust service checklist entries for anything user-facing or critical.
Add/adjust ingress expectations for exposed services.
Use temporary ignores only when truly intentional, then remove them.
Run scripts/quality_gate.sh before host deployment.

Recovery quality should improve over time: every drill should reduce manual work in the next drill.

4.4 KiB Raw Blame History Unescape Escape