From 57610c623a1b2af0266394f186266d2e1c805dd2 Mon Sep 17 00:00:00 2001 From: codex Date: Fri, 19 Jun 2026 15:43:49 -0300 Subject: [PATCH] docs: shorten ananke README --- README.md | 160 +++++++++++------------------------------------------- 1 file changed, 33 insertions(+), 127 deletions(-) diff --git a/README.md b/README.md index 0b9673c..afcb57f 100644 --- a/README.md +++ b/README.md @@ -1,71 +1,32 @@ # ananke -Ananke is the host-side recovery orchestrator for Titan power events. +Ananke is the thing that gets Atlas back on its feet after power trouble. -It runs outside Kubernetes (systemd on host), so it can: -- shut the cluster down gracefully before runtime gets dangerous -- bootstrap the cluster after power is restored -- break known startup deadlocks (including Flux + in-cluster Gitea coupling) -- verify real service availability before declaring startup complete +It runs on the host, outside Kubernetes, because some failures start before the +cluster is healthy enough to fix itself. Its job is to bring nodes, Flux, core +workloads, ingresses, and service checks back into a known-good state. -The goal is not clever automation. The goal is boring, repeatable recovery. +It is deliberately boring software: do the checks, repair the known deadlocks, +and stop loudly when a human needs to touch hardware. -## Why `ananke` +## How it works -In Greek myth, **Ananke** is inevitability and necessity. -That is the exact constraint we operate under during outages and drills. +Ananke reads `/etc/ananke/ananke.yaml`, then walks the cluster through startup or +shutdown gates: -Power-domain names in this lab align with that naming: -- `Statera` UPS: `titan-23`, `titan-24`, `titan-jh` -- `Pyrphoros` UPS: all other nodes +- confirm the expected nodes and SSH access +- check that Flux is looking at the right repo and branch +- wait for required Flux kustomizations and namespaces +- repair known startup traps, including Harbor/Gitea/Flux coupling +- run ingress, service, endpoint, and soak checks before calling startup done -## Operating model (non-negotiable) +Recovery cordons are now treated as short leases. If Ananke cordons a node to +repair something, it must either clear the cordon within the configured window +or mark the node for manual action. The default window is one hour. -- Ananke does **cluster orchestration**, not host power control. -- Shutdown defaults to `cluster-only` and should remain that way for normal drills. -- Physical outages can cut host power themselves; Ananke’s job is clean state transitions. +## Daily commands -Flux source of truth remains `titan-iac.git`. -Ananke’s own repo (`ananke.git`) is software only; it is not the desired-state cluster config repo. - -## Breakglass reminder - -Vault breakglass is available through a remote Magic Mirror path. -If standard unseal retrieval fails, use `startup.vault_unseal_breakglass_command`. - -## What "startup complete" means - -Startup is complete only after all required gates pass: -- inventory mapping is valid -- expected SSH nodes are reachable/authenticated (minus explicit ignores) -- Flux source drift guard passes (expected URL + branch) -- required Flux kustomizations are healthy -- workload convergence is healthy -- ingress checklist passes -- service checklist passes (internal + externally exposed) -- critical endpoint checks pass -- stability soak passes with no regressions - -If a manual intervention is needed during a drill, that is treated as an Ananke gap and must be encoded back into Ananke logic. - -## Status and reports - -Live status: -- `ananke status --config /etc/ananke/ananke.yaml` -- `ananke status --config /etc/ananke/ananke.yaml --json` - -Artifacts: -- `/var/lib/ananke/startup-progress.json` (live run progress) -- `/var/lib/ananke/last-startup-report.json` -- `/var/lib/ananke/last-shutdown-report.json` -- `/var/lib/ananke/reports/*.json` (historical per-run reports) -- `/var/lib/ananke/runs.json` (timing history) -- `/var/lib/ananke/update-last.env` (latest self-update result) -- `/var/log/ananke/update.log` (self-update execution log) - -## Quick commands - -From `titan-db`: +Run these on `titan-db` unless you know you are using the `tethys` peer: ```bash sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml @@ -73,76 +34,21 @@ sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute -- sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only ``` -From `titan-24` (`tethys` peer): +Useful files: + +- `/var/lib/ananke/startup-progress.json` +- `/var/lib/ananke/last-startup-report.json` +- `/var/lib/ananke/last-shutdown-report.json` +- `/var/log/ananke/update.log` + +## Development + +Run the full local check before installing: ```bash -sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only +./scripts/quality_gate.sh ``` -Systemd control: - -```bash -sudo systemctl status ananke.service -sudo systemctl start ananke-bootstrap.service -sudo systemctl start ananke-update.service -sudo cat /var/lib/ananke/update-last.env -sudo tail -n 200 /var/log/ananke/update.log -``` - -## Config - -Primary config path: -- `/etc/ananke/ananke.yaml` - -Keep these fields accurate: -- `expected_flux_source_url` -- `expected_flux_branch` -- `startup.service_checklist_explicit_only` -- `startup.service_checklist` -- `startup.critical_service_endpoints` -- `startup.require_ingress_checklist` -- `startup.require_node_inventory_reachability` -- `startup.node_inventory_reachability_required_nodes` -- `startup.node_ssh_auth_required_nodes` -- `startup.flux_health_required_kustomizations` -- `startup.workload_convergence_required_namespaces` -- `startup.ignore_unavailable_nodes` -- `coordination.role` -- `coordination.peer_hosts` - -## Quality gate - -Top-level quality/testing module: -- `testing/` - -Deployment gate script: -- `scripts/quality_gate.sh` - -Gate order: -1. docs contract checks -2. split test-module contract (`cmd/` + `internal/` cannot grow new in-tree `_test.go` files) -3. naming + LOC hygiene checks -4. pedantic lint -5. per-file coverage gate (95% minimum) - -Current migration rule: -- keep new tests in the top-level `testing/` module -- legacy in-tree `_test.go` files are temporarily grandfathered through `testing/hygiene/in_tree_test_allowlist.txt` until they are migrated safely - -Installer behavior: -- `scripts/install.sh` runs the quality gate by default -- override only for emergency break/fix: `ANANKE_ENFORCE_QUALITY_GATE=0` -- host quality runs keep writing local `ananke_quality_gate_*` metrics and also publish `platform_quality_gate_runs_total{suite="ananke",status=*}` to Pushgateway for shared Grafana panels -- override the Pushgateway target when running outside cluster DNS: `ANANKE_QUALITY_PUSHGATEWAY_URL=http://... ./scripts/quality_gate.sh` - -## Growing with the lab - -When adding nodes or services: -1. Update inventory and node mapping in config. -2. Keep the explicit service checklist focused on the core services that must come back during an outage. -3. Keep `*_required_*` startup scopes aligned with the same core set so optional stacks do not block bootstrap. -4. Add/adjust ingress expectations for exposed services. -5. Use temporary ignores only when truly intentional, then remove them. -6. Run `scripts/quality_gate.sh` before host deployment. - -Recovery quality should improve over time: every drill should reduce manual work in the next drill. +Emergency installs can bypass the gate with +`ANANKE_ENFORCE_QUALITY_GATE=0`, but that should stay rare. If a recovery drill +needed manual work, the follow-up belongs in Ananke so the next one is cleaner.