From 25a9d05e26c9b460a4fae64c850a955dc58b733d Mon Sep 17 00:00:00 2001
From: Brad Stein <Brad.Stein@gmail.com>
Date: Wed, 8 Apr 2026 19:02:49 -0300
Subject: [PATCH] docs: refresh ananke README and clarify flux source ownership

---
 README.md | 139 ++++++++++++++++++++++++++++++++----------------------
 1 file changed, 82 insertions(+), 57 deletions(-)

diff --git a/README.md b/README.md
index 89cf237..6ccc09c 100644
--- a/README.md
+++ b/README.md
@@ -1,53 +1,74 @@
 # ananke
 
-`ananke` is the host-side power + bootstrap orchestrator for Titan.
+Ananke is the host-side recovery orchestrator for Titan power events.
 
 It runs outside Kubernetes (systemd on host), so it can:
-- shut the cluster down gracefully before battery/runtime redlines
-- bring the cluster back after power returns
-- recover common Flux/Kustomize startup deadlocks
-- validate service health from the outside before declaring startup done
+- shut the cluster down gracefully before runtime gets dangerous
+- bootstrap the cluster after power is restored
+- break known startup deadlocks (including Flux + in-cluster Gitea coupling)
+- verify real service availability before declaring startup complete
+
+The goal is not clever automation. The goal is boring, repeatable recovery.
 
 ## Why `ananke`
 
-I wanted a name that fits Titan/mythology, but also describes what this service actually does.
+In Greek myth, **Ananke** is inevitability and necessity.
+That is the exact constraint we operate under during outages and drills.
 
-In Greek myth, **Ananke** is inevitability/necessity. That matches this tool: when power events happen, graceful sequencing is not optional.
+Power-domain names in this lab align with that naming:
+- `Statera` UPS: `titan-23`, `titan-24`, `titan-jh`
+- `Pyrphoros` UPS: all other nodes
 
-UPS names in this cluster are also part of the story:
-- `Statera`: powers `titan-23`, `titan-24`, `titan-jh`
-- `Pyrphoros`: powers all other nodes
+## Operating model (non-negotiable)
+
+- Ananke does **cluster orchestration**, not host power control.
+- Shutdown defaults to `cluster-only` and should remain that way for normal drills.
+- Physical outages can cut host power themselves; Ananke’s job is clean state transitions.
+
+Flux source of truth remains `titan-iac.git`.
+Ananke’s own repo (`ananke.git`) is software only; it is not the desired-state cluster config repo.
 
 ## Breakglass reminder
 
-Vault unseal breakglass is wired for remote retrieval (magic mirror host). If local key retrieval fails, Ananke can use the configured breakglass command.
+Vault breakglass is available through a remote Magic Mirror path.
+If standard unseal retrieval fails, use `startup.vault_unseal_breakglass_command`.
 
-## What “startup complete” means now
+## What "startup complete" means
 
-Ananke does **not** stop at “Flux says Ready”. Startup only completes when all configured gates pass:
-- node inventory preflight passes (host mapping + ssh user + port sanity)
-- node SSH auth gate passes (real command execution, not just TCP)
-- Flux source drift guard passes (`expected_flux_source_url` + branch expectation)
-- Flux kustomizations are healthy
-- controller convergence is healthy (deployments/statefulsets/daemonsets)
-- ingress checklist passes (all discovered ingress hosts reachable with accepted status)
-- external service checklist passes (Gitea, Grafana, Keycloak OIDC, Harbor registry auth challenge, Longhorn auth redirect)
-- stability soak window passes (no regressions, no CrashLoop/ImagePull failures)
+Startup is complete only after all required gates pass:
+- inventory mapping is valid
+- expected SSH nodes are reachable/authenticated (minus explicit ignores)
+- Flux source drift guard passes (expected URL + branch)
+- required Flux kustomizations are healthy
+- workload convergence is healthy
+- ingress checklist passes
+- service checklist passes (internal + externally exposed)
+- critical endpoint checks pass
+- stability soak passes with no regressions
 
-During startup, Ananke also auto-heals known failure patterns (stuck controller pods, immutable Flux Jobs, critical workloads scaled to zero) and writes a report:
+If a manual intervention is needed during a drill, that is treated as an Ananke gap and must be encoded back into Ananke logic.
+
+## Status and reports
+
+Live status:
+- `ananke status --config /etc/ananke/ananke.yaml`
+- `ananke status --config /etc/ananke/ananke.yaml --json`
+
+Artifacts:
+- `/var/lib/ananke/startup-progress.json` (live run progress)
 - `/var/lib/ananke/last-startup-report.json`
+- `/var/lib/ananke/last-shutdown-report.json`
+- `/var/lib/ananke/reports/*.json` (historical per-run reports)
+- `/var/lib/ananke/runs.json` (timing history)
 
-If any gate fails, startup is blocked with a concrete reason.
+## Quick commands
 
-## Command quick sheet
-
-From `titan-db` (coordinator):
+From `titan-db`:
 
 ```bash
 sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
 sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main
 sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
-sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason emergency-power --mode poweroff --skip-drain --skip-etcd-snapshot
 ```
 
 From `titan-24` (`tethys` peer):
@@ -56,7 +77,7 @@ From `titan-24` (`tethys` peer):
 sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
 ```
 
-Systemd:
+Systemd control:
 
 ```bash
 sudo systemctl status ananke.service
@@ -64,43 +85,47 @@ sudo systemctl start ananke-bootstrap.service
 sudo systemctl start ananke-update.service
 ```
 
-## Shutdown modes (explicit)
+## Config
 
-`ananke shutdown` now supports explicit mode selection:
-- default behavior is `cluster-only` (host poweroff is not performed)
-- `--mode config`: use config default (`shutdown.poweroff_enabled`)
-- `--mode cluster-only`: stop cluster services only (no host poweroff)
-- `--mode poweroff`: include host poweroff path (explicit only)
-
-This removes ambiguity during drills.
-
-## Config file
-
-Primary path:
+Primary config path:
 - `/etc/ananke/ananke.yaml`
 
-Core settings to keep accurate:
-- `expected_flux_branch`
+Keep these fields accurate:
 - `expected_flux_source_url`
-- `startup.require_node_ssh_auth`
-- `startup.require_ingress_checklist`
+- `expected_flux_branch`
 - `startup.service_checklist`
-- `startup.service_checklist_stability_seconds`
-- `startup.ignore_unavailable_nodes` (for planned temporary node outages)
-- `coordination.role`, `coordination.peer_hosts`
+- `startup.critical_service_endpoints`
+- `startup.require_ingress_checklist`
+- `startup.require_node_inventory_reachability`
+- `startup.ignore_unavailable_nodes`
+- `coordination.role`
+- `coordination.peer_hosts`
 
-## Install / update
+## Quality gate
 
-```bash
-sudo ./scripts/install.sh
-```
+Top-level quality/testing module:
+- `testing/`
+
+Deployment gate script:
+- `scripts/quality_gate.sh`
+
+Gate order:
+1. docs contract checks
+2. naming + LOC hygiene checks
+3. pedantic lint
+4. per-file coverage gate (95% minimum)
 
 Installer behavior:
-- builds and installs `/usr/local/bin/ananke`
-- installs `ananke*.service` units
-- migrates and enforces current `ananke` config/state paths
+- `scripts/install.sh` runs the quality gate by default
+- override only for emergency break/fix: `ANANKE_ENFORCE_QUALITY_GATE=0`
 
-## Notes
+## Growing with the lab
 
-- Apply changes through Git/Flux manifests; avoid manual in-cluster edits for durable changes.
-- For controlled shutdown/startup drills, treat any manual intervention as a bug and fold the logic back into Ananke.
+When adding nodes or services:
+1. Update inventory and node mapping in config.
+2. Add/adjust service checklist entries for anything user-facing or critical.
+3. Add/adjust ingress expectations for exposed services.
+4. Use temporary ignores only when truly intentional, then remove them.
+5. Run `scripts/quality_gate.sh` before host deployment.
+
+Recovery quality should improve over time: every drill should reduce manual work in the next drill.