bstein/ananke

Fork 0

codex 1f656de5df startup(ananke): scope emergency recovery to core services

2026-05-05 05:17:59 -03:00

5.4 KiB

Raw Permalink Blame History

ananke

Ananke is the host-side recovery orchestrator for Titan power events.

It runs outside Kubernetes (systemd on host), so it can:

shut the cluster down gracefully before runtime gets dangerous
bootstrap the cluster after power is restored
break known startup deadlocks (including Flux + in-cluster Gitea coupling)
verify real service availability before declaring startup complete

The goal is not clever automation. The goal is boring, repeatable recovery.

Why `ananke`

In Greek myth, Ananke is inevitability and necessity. That is the exact constraint we operate under during outages and drills.

Power-domain names in this lab align with that naming:

Statera UPS: titan-23, titan-24, titan-jh
Pyrphoros UPS: all other nodes

Operating model (non-negotiable)

Ananke does cluster orchestration, not host power control.
Shutdown defaults to cluster-only and should remain that way for normal drills.
Physical outages can cut host power themselves; Ananke’s job is clean state transitions.

Flux source of truth remains titan-iac.git. Ananke’s own repo (ananke.git) is software only; it is not the desired-state cluster config repo.

Breakglass reminder

Vault breakglass is available through a remote Magic Mirror path. If standard unseal retrieval fails, use startup.vault_unseal_breakglass_command.

What "startup complete" means

Startup is complete only after all required gates pass:

inventory mapping is valid
expected SSH nodes are reachable/authenticated (minus explicit ignores)
Flux source drift guard passes (expected URL + branch)
required Flux kustomizations are healthy
workload convergence is healthy
ingress checklist passes
service checklist passes (internal + externally exposed)
critical endpoint checks pass
stability soak passes with no regressions

If a manual intervention is needed during a drill, that is treated as an Ananke gap and must be encoded back into Ananke logic.

Status and reports

Live status:

ananke status --config /etc/ananke/ananke.yaml
ananke status --config /etc/ananke/ananke.yaml --json

Artifacts:

/var/lib/ananke/startup-progress.json (live run progress)
/var/lib/ananke/last-startup-report.json
/var/lib/ananke/last-shutdown-report.json
/var/lib/ananke/reports/*.json (historical per-run reports)
/var/lib/ananke/runs.json (timing history)
/var/lib/ananke/update-last.env (latest self-update result)
/var/log/ananke/update.log (self-update execution log)

Quick commands

From titan-db:

sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only

From titan-24 (tethys peer):

sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only

Systemd control:

sudo systemctl status ananke.service
sudo systemctl start ananke-bootstrap.service
sudo systemctl start ananke-update.service
sudo cat /var/lib/ananke/update-last.env
sudo tail -n 200 /var/log/ananke/update.log

Config

Primary config path:

/etc/ananke/ananke.yaml

Keep these fields accurate:

expected_flux_source_url
expected_flux_branch
startup.service_checklist_explicit_only
startup.service_checklist
startup.critical_service_endpoints
startup.require_ingress_checklist
startup.require_node_inventory_reachability
startup.node_inventory_reachability_required_nodes
startup.node_ssh_auth_required_nodes
startup.flux_health_required_kustomizations
startup.workload_convergence_required_namespaces
startup.ignore_unavailable_nodes
coordination.role
coordination.peer_hosts

Quality gate

Top-level quality/testing module:

testing/

Deployment gate script:

scripts/quality_gate.sh

Gate order:

docs contract checks
split test-module contract (cmd/ + internal/ cannot grow new in-tree _test.go files)
naming + LOC hygiene checks
pedantic lint
per-file coverage gate (95% minimum)

Current migration rule:

keep new tests in the top-level testing/ module
legacy in-tree _test.go files are temporarily grandfathered through testing/hygiene/in_tree_test_allowlist.txt until they are migrated safely

Installer behavior:

scripts/install.sh runs the quality gate by default
override only for emergency break/fix: ANANKE_ENFORCE_QUALITY_GATE=0
host quality runs keep writing local ananke_quality_gate_* metrics and also publish platform_quality_gate_runs_total{suite="ananke",status=*} to Pushgateway for shared Grafana panels
override the Pushgateway target when running outside cluster DNS: ANANKE_QUALITY_PUSHGATEWAY_URL=http://... ./scripts/quality_gate.sh

Growing with the lab

When adding nodes or services:

Update inventory and node mapping in config.
Keep the explicit service checklist focused on the core services that must come back during an outage.
Keep *_required_* startup scopes aligned with the same core set so optional stacks do not block bootstrap.
Add/adjust ingress expectations for exposed services.
Use temporary ignores only when truly intentional, then remove them.
Run scripts/quality_gate.sh before host deployment.

Recovery quality should improve over time: every drill should reduce manual work in the next drill.

5.4 KiB Raw Permalink Blame History Unescape Escape