docs: shorten ananke README

This commit is contained in:
codex 2026-06-19 15:43:49 -03:00
parent 85c0741b3e
commit 57610c623a

160
README.md
View File

@ -1,71 +1,32 @@
# ananke # ananke
Ananke is the host-side recovery orchestrator for Titan power events. Ananke is the thing that gets Atlas back on its feet after power trouble.
It runs outside Kubernetes (systemd on host), so it can: It runs on the host, outside Kubernetes, because some failures start before the
- shut the cluster down gracefully before runtime gets dangerous cluster is healthy enough to fix itself. Its job is to bring nodes, Flux, core
- bootstrap the cluster after power is restored workloads, ingresses, and service checks back into a known-good state.
- break known startup deadlocks (including Flux + in-cluster Gitea coupling)
- verify real service availability before declaring startup complete
The goal is not clever automation. The goal is boring, repeatable recovery. It is deliberately boring software: do the checks, repair the known deadlocks,
and stop loudly when a human needs to touch hardware.
## Why `ananke` ## How it works
In Greek myth, **Ananke** is inevitability and necessity. Ananke reads `/etc/ananke/ananke.yaml`, then walks the cluster through startup or
That is the exact constraint we operate under during outages and drills. shutdown gates:
Power-domain names in this lab align with that naming: - confirm the expected nodes and SSH access
- `Statera` UPS: `titan-23`, `titan-24`, `titan-jh` - check that Flux is looking at the right repo and branch
- `Pyrphoros` UPS: all other nodes - wait for required Flux kustomizations and namespaces
- repair known startup traps, including Harbor/Gitea/Flux coupling
- run ingress, service, endpoint, and soak checks before calling startup done
## Operating model (non-negotiable) Recovery cordons are now treated as short leases. If Ananke cordons a node to
repair something, it must either clear the cordon within the configured window
or mark the node for manual action. The default window is one hour.
- Ananke does **cluster orchestration**, not host power control. ## Daily commands
- Shutdown defaults to `cluster-only` and should remain that way for normal drills.
- Physical outages can cut host power themselves; Anankes job is clean state transitions.
Flux source of truth remains `titan-iac.git`. Run these on `titan-db` unless you know you are using the `tethys` peer:
Anankes own repo (`ananke.git`) is software only; it is not the desired-state cluster config repo.
## Breakglass reminder
Vault breakglass is available through a remote Magic Mirror path.
If standard unseal retrieval fails, use `startup.vault_unseal_breakglass_command`.
## What "startup complete" means
Startup is complete only after all required gates pass:
- inventory mapping is valid
- expected SSH nodes are reachable/authenticated (minus explicit ignores)
- Flux source drift guard passes (expected URL + branch)
- required Flux kustomizations are healthy
- workload convergence is healthy
- ingress checklist passes
- service checklist passes (internal + externally exposed)
- critical endpoint checks pass
- stability soak passes with no regressions
If a manual intervention is needed during a drill, that is treated as an Ananke gap and must be encoded back into Ananke logic.
## Status and reports
Live status:
- `ananke status --config /etc/ananke/ananke.yaml`
- `ananke status --config /etc/ananke/ananke.yaml --json`
Artifacts:
- `/var/lib/ananke/startup-progress.json` (live run progress)
- `/var/lib/ananke/last-startup-report.json`
- `/var/lib/ananke/last-shutdown-report.json`
- `/var/lib/ananke/reports/*.json` (historical per-run reports)
- `/var/lib/ananke/runs.json` (timing history)
- `/var/lib/ananke/update-last.env` (latest self-update result)
- `/var/log/ananke/update.log` (self-update execution log)
## Quick commands
From `titan-db`:
```bash ```bash
sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
@ -73,76 +34,21 @@ sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
``` ```
From `titan-24` (`tethys` peer): Useful files:
- `/var/lib/ananke/startup-progress.json`
- `/var/lib/ananke/last-startup-report.json`
- `/var/lib/ananke/last-shutdown-report.json`
- `/var/log/ananke/update.log`
## Development
Run the full local check before installing:
```bash ```bash
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only ./scripts/quality_gate.sh
``` ```
Systemd control: Emergency installs can bypass the gate with
`ANANKE_ENFORCE_QUALITY_GATE=0`, but that should stay rare. If a recovery drill
```bash needed manual work, the follow-up belongs in Ananke so the next one is cleaner.
sudo systemctl status ananke.service
sudo systemctl start ananke-bootstrap.service
sudo systemctl start ananke-update.service
sudo cat /var/lib/ananke/update-last.env
sudo tail -n 200 /var/log/ananke/update.log
```
## Config
Primary config path:
- `/etc/ananke/ananke.yaml`
Keep these fields accurate:
- `expected_flux_source_url`
- `expected_flux_branch`
- `startup.service_checklist_explicit_only`
- `startup.service_checklist`
- `startup.critical_service_endpoints`
- `startup.require_ingress_checklist`
- `startup.require_node_inventory_reachability`
- `startup.node_inventory_reachability_required_nodes`
- `startup.node_ssh_auth_required_nodes`
- `startup.flux_health_required_kustomizations`
- `startup.workload_convergence_required_namespaces`
- `startup.ignore_unavailable_nodes`
- `coordination.role`
- `coordination.peer_hosts`
## Quality gate
Top-level quality/testing module:
- `testing/`
Deployment gate script:
- `scripts/quality_gate.sh`
Gate order:
1. docs contract checks
2. split test-module contract (`cmd/` + `internal/` cannot grow new in-tree `_test.go` files)
3. naming + LOC hygiene checks
4. pedantic lint
5. per-file coverage gate (95% minimum)
Current migration rule:
- keep new tests in the top-level `testing/` module
- legacy in-tree `_test.go` files are temporarily grandfathered through `testing/hygiene/in_tree_test_allowlist.txt` until they are migrated safely
Installer behavior:
- `scripts/install.sh` runs the quality gate by default
- override only for emergency break/fix: `ANANKE_ENFORCE_QUALITY_GATE=0`
- host quality runs keep writing local `ananke_quality_gate_*` metrics and also publish `platform_quality_gate_runs_total{suite="ananke",status=*}` to Pushgateway for shared Grafana panels
- override the Pushgateway target when running outside cluster DNS: `ANANKE_QUALITY_PUSHGATEWAY_URL=http://... ./scripts/quality_gate.sh`
## Growing with the lab
When adding nodes or services:
1. Update inventory and node mapping in config.
2. Keep the explicit service checklist focused on the core services that must come back during an outage.
3. Keep `*_required_*` startup scopes aligned with the same core set so optional stacks do not block bootstrap.
4. Add/adjust ingress expectations for exposed services.
5. Use temporary ignores only when truly intentional, then remove them.
6. Run `scripts/quality_gate.sh` before host deployment.
Recovery quality should improve over time: every drill should reduce manual work in the next drill.