From 25a9d05e26c9b460a4fae64c850a955dc58b733d Mon Sep 17 00:00:00 2001 From: Brad Stein Date: Wed, 8 Apr 2026 19:02:49 -0300 Subject: [PATCH] docs: refresh ananke README and clarify flux source ownership --- README.md | 139 ++++++++++++++++++++++++++++++++---------------------- 1 file changed, 82 insertions(+), 57 deletions(-) diff --git a/README.md b/README.md index 89cf237..6ccc09c 100644 --- a/README.md +++ b/README.md @@ -1,53 +1,74 @@ # ananke -`ananke` is the host-side power + bootstrap orchestrator for Titan. +Ananke is the host-side recovery orchestrator for Titan power events. It runs outside Kubernetes (systemd on host), so it can: -- shut the cluster down gracefully before battery/runtime redlines -- bring the cluster back after power returns -- recover common Flux/Kustomize startup deadlocks -- validate service health from the outside before declaring startup done +- shut the cluster down gracefully before runtime gets dangerous +- bootstrap the cluster after power is restored +- break known startup deadlocks (including Flux + in-cluster Gitea coupling) +- verify real service availability before declaring startup complete + +The goal is not clever automation. The goal is boring, repeatable recovery. ## Why `ananke` -I wanted a name that fits Titan/mythology, but also describes what this service actually does. +In Greek myth, **Ananke** is inevitability and necessity. +That is the exact constraint we operate under during outages and drills. -In Greek myth, **Ananke** is inevitability/necessity. That matches this tool: when power events happen, graceful sequencing is not optional. +Power-domain names in this lab align with that naming: +- `Statera` UPS: `titan-23`, `titan-24`, `titan-jh` +- `Pyrphoros` UPS: all other nodes -UPS names in this cluster are also part of the story: -- `Statera`: powers `titan-23`, `titan-24`, `titan-jh` -- `Pyrphoros`: powers all other nodes +## Operating model (non-negotiable) + +- Ananke does **cluster orchestration**, not host power control. +- Shutdown defaults to `cluster-only` and should remain that way for normal drills. +- Physical outages can cut host power themselves; Ananke’s job is clean state transitions. + +Flux source of truth remains `titan-iac.git`. +Ananke’s own repo (`ananke.git`) is software only; it is not the desired-state cluster config repo. ## Breakglass reminder -Vault unseal breakglass is wired for remote retrieval (magic mirror host). If local key retrieval fails, Ananke can use the configured breakglass command. +Vault breakglass is available through a remote Magic Mirror path. +If standard unseal retrieval fails, use `startup.vault_unseal_breakglass_command`. -## What “startup complete” means now +## What "startup complete" means -Ananke does **not** stop at “Flux says Ready”. Startup only completes when all configured gates pass: -- node inventory preflight passes (host mapping + ssh user + port sanity) -- node SSH auth gate passes (real command execution, not just TCP) -- Flux source drift guard passes (`expected_flux_source_url` + branch expectation) -- Flux kustomizations are healthy -- controller convergence is healthy (deployments/statefulsets/daemonsets) -- ingress checklist passes (all discovered ingress hosts reachable with accepted status) -- external service checklist passes (Gitea, Grafana, Keycloak OIDC, Harbor registry auth challenge, Longhorn auth redirect) -- stability soak window passes (no regressions, no CrashLoop/ImagePull failures) +Startup is complete only after all required gates pass: +- inventory mapping is valid +- expected SSH nodes are reachable/authenticated (minus explicit ignores) +- Flux source drift guard passes (expected URL + branch) +- required Flux kustomizations are healthy +- workload convergence is healthy +- ingress checklist passes +- service checklist passes (internal + externally exposed) +- critical endpoint checks pass +- stability soak passes with no regressions -During startup, Ananke also auto-heals known failure patterns (stuck controller pods, immutable Flux Jobs, critical workloads scaled to zero) and writes a report: +If a manual intervention is needed during a drill, that is treated as an Ananke gap and must be encoded back into Ananke logic. + +## Status and reports + +Live status: +- `ananke status --config /etc/ananke/ananke.yaml` +- `ananke status --config /etc/ananke/ananke.yaml --json` + +Artifacts: +- `/var/lib/ananke/startup-progress.json` (live run progress) - `/var/lib/ananke/last-startup-report.json` +- `/var/lib/ananke/last-shutdown-report.json` +- `/var/lib/ananke/reports/*.json` (historical per-run reports) +- `/var/lib/ananke/runs.json` (timing history) -If any gate fails, startup is blocked with a concrete reason. +## Quick commands -## Command quick sheet - -From `titan-db` (coordinator): +From `titan-db`: ```bash sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only -sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason emergency-power --mode poweroff --skip-drain --skip-etcd-snapshot ``` From `titan-24` (`tethys` peer): @@ -56,7 +77,7 @@ From `titan-24` (`tethys` peer): sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only ``` -Systemd: +Systemd control: ```bash sudo systemctl status ananke.service @@ -64,43 +85,47 @@ sudo systemctl start ananke-bootstrap.service sudo systemctl start ananke-update.service ``` -## Shutdown modes (explicit) +## Config -`ananke shutdown` now supports explicit mode selection: -- default behavior is `cluster-only` (host poweroff is not performed) -- `--mode config`: use config default (`shutdown.poweroff_enabled`) -- `--mode cluster-only`: stop cluster services only (no host poweroff) -- `--mode poweroff`: include host poweroff path (explicit only) - -This removes ambiguity during drills. - -## Config file - -Primary path: +Primary config path: - `/etc/ananke/ananke.yaml` -Core settings to keep accurate: -- `expected_flux_branch` +Keep these fields accurate: - `expected_flux_source_url` -- `startup.require_node_ssh_auth` -- `startup.require_ingress_checklist` +- `expected_flux_branch` - `startup.service_checklist` -- `startup.service_checklist_stability_seconds` -- `startup.ignore_unavailable_nodes` (for planned temporary node outages) -- `coordination.role`, `coordination.peer_hosts` +- `startup.critical_service_endpoints` +- `startup.require_ingress_checklist` +- `startup.require_node_inventory_reachability` +- `startup.ignore_unavailable_nodes` +- `coordination.role` +- `coordination.peer_hosts` -## Install / update +## Quality gate -```bash -sudo ./scripts/install.sh -``` +Top-level quality/testing module: +- `testing/` + +Deployment gate script: +- `scripts/quality_gate.sh` + +Gate order: +1. docs contract checks +2. naming + LOC hygiene checks +3. pedantic lint +4. per-file coverage gate (95% minimum) Installer behavior: -- builds and installs `/usr/local/bin/ananke` -- installs `ananke*.service` units -- migrates and enforces current `ananke` config/state paths +- `scripts/install.sh` runs the quality gate by default +- override only for emergency break/fix: `ANANKE_ENFORCE_QUALITY_GATE=0` -## Notes +## Growing with the lab -- Apply changes through Git/Flux manifests; avoid manual in-cluster edits for durable changes. -- For controlled shutdown/startup drills, treat any manual intervention as a bug and fold the logic back into Ananke. +When adding nodes or services: +1. Update inventory and node mapping in config. +2. Add/adjust service checklist entries for anything user-facing or critical. +3. Add/adjust ingress expectations for exposed services. +4. Use temporary ignores only when truly intentional, then remove them. +5. Run `scripts/quality_gate.sh` before host deployment. + +Recovery quality should improve over time: every drill should reduce manual work in the next drill.