docs: refresh ananke README and clarify flux source ownership

This commit is contained in:
Brad Stein 2026-04-08 19:02:49 -03:00
parent 2268e8915a
commit 25a9d05e26

139
README.md
View File

@ -1,53 +1,74 @@
# ananke # ananke
`ananke` is the host-side power + bootstrap orchestrator for Titan. Ananke is the host-side recovery orchestrator for Titan power events.
It runs outside Kubernetes (systemd on host), so it can: It runs outside Kubernetes (systemd on host), so it can:
- shut the cluster down gracefully before battery/runtime redlines - shut the cluster down gracefully before runtime gets dangerous
- bring the cluster back after power returns - bootstrap the cluster after power is restored
- recover common Flux/Kustomize startup deadlocks - break known startup deadlocks (including Flux + in-cluster Gitea coupling)
- validate service health from the outside before declaring startup done - verify real service availability before declaring startup complete
The goal is not clever automation. The goal is boring, repeatable recovery.
## Why `ananke` ## Why `ananke`
I wanted a name that fits Titan/mythology, but also describes what this service actually does. In Greek myth, **Ananke** is inevitability and necessity.
That is the exact constraint we operate under during outages and drills.
In Greek myth, **Ananke** is inevitability/necessity. That matches this tool: when power events happen, graceful sequencing is not optional. Power-domain names in this lab align with that naming:
- `Statera` UPS: `titan-23`, `titan-24`, `titan-jh`
- `Pyrphoros` UPS: all other nodes
UPS names in this cluster are also part of the story: ## Operating model (non-negotiable)
- `Statera`: powers `titan-23`, `titan-24`, `titan-jh`
- `Pyrphoros`: powers all other nodes - Ananke does **cluster orchestration**, not host power control.
- Shutdown defaults to `cluster-only` and should remain that way for normal drills.
- Physical outages can cut host power themselves; Anankes job is clean state transitions.
Flux source of truth remains `titan-iac.git`.
Anankes own repo (`ananke.git`) is software only; it is not the desired-state cluster config repo.
## Breakglass reminder ## Breakglass reminder
Vault unseal breakglass is wired for remote retrieval (magic mirror host). If local key retrieval fails, Ananke can use the configured breakglass command. Vault breakglass is available through a remote Magic Mirror path.
If standard unseal retrieval fails, use `startup.vault_unseal_breakglass_command`.
## What “startup complete” means now ## What "startup complete" means
Ananke does **not** stop at “Flux says Ready”. Startup only completes when all configured gates pass: Startup is complete only after all required gates pass:
- node inventory preflight passes (host mapping + ssh user + port sanity) - inventory mapping is valid
- node SSH auth gate passes (real command execution, not just TCP) - expected SSH nodes are reachable/authenticated (minus explicit ignores)
- Flux source drift guard passes (`expected_flux_source_url` + branch expectation) - Flux source drift guard passes (expected URL + branch)
- Flux kustomizations are healthy - required Flux kustomizations are healthy
- controller convergence is healthy (deployments/statefulsets/daemonsets) - workload convergence is healthy
- ingress checklist passes (all discovered ingress hosts reachable with accepted status) - ingress checklist passes
- external service checklist passes (Gitea, Grafana, Keycloak OIDC, Harbor registry auth challenge, Longhorn auth redirect) - service checklist passes (internal + externally exposed)
- stability soak window passes (no regressions, no CrashLoop/ImagePull failures) - critical endpoint checks pass
- stability soak passes with no regressions
During startup, Ananke also auto-heals known failure patterns (stuck controller pods, immutable Flux Jobs, critical workloads scaled to zero) and writes a report: If a manual intervention is needed during a drill, that is treated as an Ananke gap and must be encoded back into Ananke logic.
## Status and reports
Live status:
- `ananke status --config /etc/ananke/ananke.yaml`
- `ananke status --config /etc/ananke/ananke.yaml --json`
Artifacts:
- `/var/lib/ananke/startup-progress.json` (live run progress)
- `/var/lib/ananke/last-startup-report.json` - `/var/lib/ananke/last-startup-report.json`
- `/var/lib/ananke/last-shutdown-report.json`
- `/var/lib/ananke/reports/*.json` (historical per-run reports)
- `/var/lib/ananke/runs.json` (timing history)
If any gate fails, startup is blocked with a concrete reason. ## Quick commands
## Command quick sheet From `titan-db`:
From `titan-db` (coordinator):
```bash ```bash
sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason emergency-power --mode poweroff --skip-drain --skip-etcd-snapshot
``` ```
From `titan-24` (`tethys` peer): From `titan-24` (`tethys` peer):
@ -56,7 +77,7 @@ From `titan-24` (`tethys` peer):
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
``` ```
Systemd: Systemd control:
```bash ```bash
sudo systemctl status ananke.service sudo systemctl status ananke.service
@ -64,43 +85,47 @@ sudo systemctl start ananke-bootstrap.service
sudo systemctl start ananke-update.service sudo systemctl start ananke-update.service
``` ```
## Shutdown modes (explicit) ## Config
`ananke shutdown` now supports explicit mode selection: Primary config path:
- default behavior is `cluster-only` (host poweroff is not performed)
- `--mode config`: use config default (`shutdown.poweroff_enabled`)
- `--mode cluster-only`: stop cluster services only (no host poweroff)
- `--mode poweroff`: include host poweroff path (explicit only)
This removes ambiguity during drills.
## Config file
Primary path:
- `/etc/ananke/ananke.yaml` - `/etc/ananke/ananke.yaml`
Core settings to keep accurate: Keep these fields accurate:
- `expected_flux_branch`
- `expected_flux_source_url` - `expected_flux_source_url`
- `startup.require_node_ssh_auth` - `expected_flux_branch`
- `startup.require_ingress_checklist`
- `startup.service_checklist` - `startup.service_checklist`
- `startup.service_checklist_stability_seconds` - `startup.critical_service_endpoints`
- `startup.ignore_unavailable_nodes` (for planned temporary node outages) - `startup.require_ingress_checklist`
- `coordination.role`, `coordination.peer_hosts` - `startup.require_node_inventory_reachability`
- `startup.ignore_unavailable_nodes`
- `coordination.role`
- `coordination.peer_hosts`
## Install / update ## Quality gate
```bash Top-level quality/testing module:
sudo ./scripts/install.sh - `testing/`
```
Deployment gate script:
- `scripts/quality_gate.sh`
Gate order:
1. docs contract checks
2. naming + LOC hygiene checks
3. pedantic lint
4. per-file coverage gate (95% minimum)
Installer behavior: Installer behavior:
- builds and installs `/usr/local/bin/ananke` - `scripts/install.sh` runs the quality gate by default
- installs `ananke*.service` units - override only for emergency break/fix: `ANANKE_ENFORCE_QUALITY_GATE=0`
- migrates and enforces current `ananke` config/state paths
## Notes ## Growing with the lab
- Apply changes through Git/Flux manifests; avoid manual in-cluster edits for durable changes. When adding nodes or services:
- For controlled shutdown/startup drills, treat any manual intervention as a bug and fold the logic back into Ananke. 1. Update inventory and node mapping in config.
2. Add/adjust service checklist entries for anything user-facing or critical.
3. Add/adjust ingress expectations for exposed services.
4. Use temporary ignores only when truly intentional, then remove them.
5. Run `scripts/quality_gate.sh` before host deployment.
Recovery quality should improve over time: every drill should reduce manual work in the next drill.