From 57610c623a1b2af0266394f186266d2e1c805dd2 Mon Sep 17 00:00:00 2001
From: codex <codex@bstein.dev>
Date: Fri, 19 Jun 2026 15:43:49 -0300
Subject: [PATCH] docs: shorten ananke README

---
 README.md | 160 +++++++++++-------------------------------------------
 1 file changed, 33 insertions(+), 127 deletions(-)

diff --git a/README.md b/README.md
index 0b9673c..afcb57f 100644
--- a/README.md
+++ b/README.md
@@ -1,71 +1,32 @@
 # ananke
 
-Ananke is the host-side recovery orchestrator for Titan power events.
+Ananke is the thing that gets Atlas back on its feet after power trouble.
 
-It runs outside Kubernetes (systemd on host), so it can:
-- shut the cluster down gracefully before runtime gets dangerous
-- bootstrap the cluster after power is restored
-- break known startup deadlocks (including Flux + in-cluster Gitea coupling)
-- verify real service availability before declaring startup complete
+It runs on the host, outside Kubernetes, because some failures start before the
+cluster is healthy enough to fix itself. Its job is to bring nodes, Flux, core
+workloads, ingresses, and service checks back into a known-good state.
 
-The goal is not clever automation. The goal is boring, repeatable recovery.
+It is deliberately boring software: do the checks, repair the known deadlocks,
+and stop loudly when a human needs to touch hardware.
 
-## Why `ananke`
+## How it works
 
-In Greek myth, **Ananke** is inevitability and necessity.
-That is the exact constraint we operate under during outages and drills.
+Ananke reads `/etc/ananke/ananke.yaml`, then walks the cluster through startup or
+shutdown gates:
 
-Power-domain names in this lab align with that naming:
-- `Statera` UPS: `titan-23`, `titan-24`, `titan-jh`
-- `Pyrphoros` UPS: all other nodes
+- confirm the expected nodes and SSH access
+- check that Flux is looking at the right repo and branch
+- wait for required Flux kustomizations and namespaces
+- repair known startup traps, including Harbor/Gitea/Flux coupling
+- run ingress, service, endpoint, and soak checks before calling startup done
 
-## Operating model (non-negotiable)
+Recovery cordons are now treated as short leases. If Ananke cordons a node to
+repair something, it must either clear the cordon within the configured window
+or mark the node for manual action. The default window is one hour.
 
-- Ananke does **cluster orchestration**, not host power control.
-- Shutdown defaults to `cluster-only` and should remain that way for normal drills.
-- Physical outages can cut host power themselves; Ananke’s job is clean state transitions.
+## Daily commands
 
-Flux source of truth remains `titan-iac.git`.
-Ananke’s own repo (`ananke.git`) is software only; it is not the desired-state cluster config repo.
-
-## Breakglass reminder
-
-Vault breakglass is available through a remote Magic Mirror path.
-If standard unseal retrieval fails, use `startup.vault_unseal_breakglass_command`.
-
-## What "startup complete" means
-
-Startup is complete only after all required gates pass:
-- inventory mapping is valid
-- expected SSH nodes are reachable/authenticated (minus explicit ignores)
-- Flux source drift guard passes (expected URL + branch)
-- required Flux kustomizations are healthy
-- workload convergence is healthy
-- ingress checklist passes
-- service checklist passes (internal + externally exposed)
-- critical endpoint checks pass
-- stability soak passes with no regressions
-
-If a manual intervention is needed during a drill, that is treated as an Ananke gap and must be encoded back into Ananke logic.
-
-## Status and reports
-
-Live status:
-- `ananke status --config /etc/ananke/ananke.yaml`
-- `ananke status --config /etc/ananke/ananke.yaml --json`
-
-Artifacts:
-- `/var/lib/ananke/startup-progress.json` (live run progress)
-- `/var/lib/ananke/last-startup-report.json`
-- `/var/lib/ananke/last-shutdown-report.json`
-- `/var/lib/ananke/reports/*.json` (historical per-run reports)
-- `/var/lib/ananke/runs.json` (timing history)
-- `/var/lib/ananke/update-last.env` (latest self-update result)
-- `/var/log/ananke/update.log` (self-update execution log)
-
-## Quick commands
-
-From `titan-db`:
+Run these on `titan-db` unless you know you are using the `tethys` peer:
 
 ```bash
 sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
@@ -73,76 +34,21 @@ sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --
 sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
 ```
 
-From `titan-24` (`tethys` peer):
+Useful files:
+
+- `/var/lib/ananke/startup-progress.json`
+- `/var/lib/ananke/last-startup-report.json`
+- `/var/lib/ananke/last-shutdown-report.json`
+- `/var/log/ananke/update.log`
+
+## Development
+
+Run the full local check before installing:
 
 ```bash
-sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
+./scripts/quality_gate.sh
 ```
 
-Systemd control:
-
-```bash
-sudo systemctl status ananke.service
-sudo systemctl start ananke-bootstrap.service
-sudo systemctl start ananke-update.service
-sudo cat /var/lib/ananke/update-last.env
-sudo tail -n 200 /var/log/ananke/update.log
-```
-
-## Config
-
-Primary config path:
-- `/etc/ananke/ananke.yaml`
-
-Keep these fields accurate:
-- `expected_flux_source_url`
-- `expected_flux_branch`
-- `startup.service_checklist_explicit_only`
-- `startup.service_checklist`
-- `startup.critical_service_endpoints`
-- `startup.require_ingress_checklist`
-- `startup.require_node_inventory_reachability`
-- `startup.node_inventory_reachability_required_nodes`
-- `startup.node_ssh_auth_required_nodes`
-- `startup.flux_health_required_kustomizations`
-- `startup.workload_convergence_required_namespaces`
-- `startup.ignore_unavailable_nodes`
-- `coordination.role`
-- `coordination.peer_hosts`
-
-## Quality gate
-
-Top-level quality/testing module:
-- `testing/`
-
-Deployment gate script:
-- `scripts/quality_gate.sh`
-
-Gate order:
-1. docs contract checks
-2. split test-module contract (`cmd/` + `internal/` cannot grow new in-tree `_test.go` files)
-3. naming + LOC hygiene checks
-4. pedantic lint
-5. per-file coverage gate (95% minimum)
-
-Current migration rule:
-- keep new tests in the top-level `testing/` module
-- legacy in-tree `_test.go` files are temporarily grandfathered through `testing/hygiene/in_tree_test_allowlist.txt` until they are migrated safely
-
-Installer behavior:
-- `scripts/install.sh` runs the quality gate by default
-- override only for emergency break/fix: `ANANKE_ENFORCE_QUALITY_GATE=0`
-- host quality runs keep writing local `ananke_quality_gate_*` metrics and also publish `platform_quality_gate_runs_total{suite="ananke",status=*}` to Pushgateway for shared Grafana panels
-- override the Pushgateway target when running outside cluster DNS: `ANANKE_QUALITY_PUSHGATEWAY_URL=http://... ./scripts/quality_gate.sh`
-
-## Growing with the lab
-
-When adding nodes or services:
-1. Update inventory and node mapping in config.
-2. Keep the explicit service checklist focused on the core services that must come back during an outage.
-3. Keep `*_required_*` startup scopes aligned with the same core set so optional stacks do not block bootstrap.
-4. Add/adjust ingress expectations for exposed services.
-5. Use temporary ignores only when truly intentional, then remove them.
-6. Run `scripts/quality_gate.sh` before host deployment.
-
-Recovery quality should improve over time: every drill should reduce manual work in the next drill.
+Emergency installs can bypass the gate with
+`ANANKE_ENFORCE_QUALITY_GATE=0`, but that should stay rare. If a recovery drill
+needed manual work, the follow-up belongs in Ananke so the next one is cleaner.