# ananke Ananke is the host-side recovery orchestrator for Titan power events. It runs outside Kubernetes (systemd on host), so it can: - shut the cluster down gracefully before runtime gets dangerous - bootstrap the cluster after power is restored - break known startup deadlocks (including Flux + in-cluster Gitea coupling) - verify real service availability before declaring startup complete The goal is not clever automation. The goal is boring, repeatable recovery. ## Why `ananke` In Greek myth, **Ananke** is inevitability and necessity. That is the exact constraint we operate under during outages and drills. Power-domain names in this lab align with that naming: - `Statera` UPS: `titan-23`, `titan-24`, `titan-jh` - `Pyrphoros` UPS: all other nodes ## Operating model (non-negotiable) - Ananke does **cluster orchestration**, not host power control. - Shutdown defaults to `cluster-only` and should remain that way for normal drills. - Physical outages can cut host power themselves; Ananke’s job is clean state transitions. Flux source of truth remains `titan-iac.git`. Ananke’s own repo (`ananke.git`) is software only; it is not the desired-state cluster config repo. ## Breakglass reminder Vault breakglass is available through a remote Magic Mirror path. If standard unseal retrieval fails, use `startup.vault_unseal_breakglass_command`. ## What "startup complete" means Startup is complete only after all required gates pass: - inventory mapping is valid - expected SSH nodes are reachable/authenticated (minus explicit ignores) - Flux source drift guard passes (expected URL + branch) - required Flux kustomizations are healthy - workload convergence is healthy - ingress checklist passes - service checklist passes (internal + externally exposed) - critical endpoint checks pass - stability soak passes with no regressions If a manual intervention is needed during a drill, that is treated as an Ananke gap and must be encoded back into Ananke logic. ## Status and reports Live status: - `ananke status --config /etc/ananke/ananke.yaml` - `ananke status --config /etc/ananke/ananke.yaml --json` Artifacts: - `/var/lib/ananke/startup-progress.json` (live run progress) - `/var/lib/ananke/last-startup-report.json` - `/var/lib/ananke/last-shutdown-report.json` - `/var/lib/ananke/reports/*.json` (historical per-run reports) - `/var/lib/ananke/runs.json` (timing history) - `/var/lib/ananke/update-last.env` (latest self-update result) - `/var/log/ananke/update.log` (self-update execution log) ## Quick commands From `titan-db`: ```bash sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only ``` From `titan-24` (`tethys` peer): ```bash sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only ``` Systemd control: ```bash sudo systemctl status ananke.service sudo systemctl start ananke-bootstrap.service sudo systemctl start ananke-update.service sudo cat /var/lib/ananke/update-last.env sudo tail -n 200 /var/log/ananke/update.log ``` ## Config Primary config path: - `/etc/ananke/ananke.yaml` Keep these fields accurate: - `expected_flux_source_url` - `expected_flux_branch` - `startup.service_checklist` - `startup.critical_service_endpoints` - `startup.require_ingress_checklist` - `startup.require_node_inventory_reachability` - `startup.ignore_unavailable_nodes` - `coordination.role` - `coordination.peer_hosts` ## Quality gate Top-level quality/testing module: - `testing/` Deployment gate script: - `scripts/quality_gate.sh` Gate order: 1. docs contract checks 2. split test-module contract (`cmd/` + `internal/` cannot grow new in-tree `_test.go` files) 3. naming + LOC hygiene checks 4. pedantic lint 5. per-file coverage gate (95% minimum) Current migration rule: - keep new tests in the top-level `testing/` module - legacy in-tree `_test.go` files are temporarily grandfathered through `testing/hygiene/in_tree_test_allowlist.txt` until they are migrated safely Installer behavior: - `scripts/install.sh` runs the quality gate by default - override only for emergency break/fix: `ANANKE_ENFORCE_QUALITY_GATE=0` - host quality runs keep writing local `ananke_quality_gate_*` metrics and also publish `platform_quality_gate_runs_total{suite="ananke",status=*}` to Pushgateway for shared Grafana panels - override the Pushgateway target when running outside cluster DNS: `ANANKE_QUALITY_PUSHGATEWAY_URL=http://... ./scripts/quality_gate.sh` ## Growing with the lab When adding nodes or services: 1. Update inventory and node mapping in config. 2. Add/adjust service checklist entries for anything user-facing or critical. 3. Add/adjust ingress expectations for exposed services. 4. Use temporary ignores only when truly intentional, then remove them. 5. Run `scripts/quality_gate.sh` before host deployment. Recovery quality should improve over time: every drill should reduce manual work in the next drill.