ananke/README.md

132 lines
4.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ananke
Ananke is the host-side recovery orchestrator for Titan power events.
It runs outside Kubernetes (systemd on host), so it can:
- shut the cluster down gracefully before runtime gets dangerous
- bootstrap the cluster after power is restored
- break known startup deadlocks (including Flux + in-cluster Gitea coupling)
- verify real service availability before declaring startup complete
The goal is not clever automation. The goal is boring, repeatable recovery.
## Why `ananke`
In Greek myth, **Ananke** is inevitability and necessity.
That is the exact constraint we operate under during outages and drills.
Power-domain names in this lab align with that naming:
- `Statera` UPS: `titan-23`, `titan-24`, `titan-jh`
- `Pyrphoros` UPS: all other nodes
## Operating model (non-negotiable)
- Ananke does **cluster orchestration**, not host power control.
- Shutdown defaults to `cluster-only` and should remain that way for normal drills.
- Physical outages can cut host power themselves; Anankes job is clean state transitions.
Flux source of truth remains `titan-iac.git`.
Anankes own repo (`ananke.git`) is software only; it is not the desired-state cluster config repo.
## Breakglass reminder
Vault breakglass is available through a remote Magic Mirror path.
If standard unseal retrieval fails, use `startup.vault_unseal_breakglass_command`.
## What "startup complete" means
Startup is complete only after all required gates pass:
- inventory mapping is valid
- expected SSH nodes are reachable/authenticated (minus explicit ignores)
- Flux source drift guard passes (expected URL + branch)
- required Flux kustomizations are healthy
- workload convergence is healthy
- ingress checklist passes
- service checklist passes (internal + externally exposed)
- critical endpoint checks pass
- stability soak passes with no regressions
If a manual intervention is needed during a drill, that is treated as an Ananke gap and must be encoded back into Ananke logic.
## Status and reports
Live status:
- `ananke status --config /etc/ananke/ananke.yaml`
- `ananke status --config /etc/ananke/ananke.yaml --json`
Artifacts:
- `/var/lib/ananke/startup-progress.json` (live run progress)
- `/var/lib/ananke/last-startup-report.json`
- `/var/lib/ananke/last-shutdown-report.json`
- `/var/lib/ananke/reports/*.json` (historical per-run reports)
- `/var/lib/ananke/runs.json` (timing history)
## Quick commands
From `titan-db`:
```bash
sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
```
From `titan-24` (`tethys` peer):
```bash
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
```
Systemd control:
```bash
sudo systemctl status ananke.service
sudo systemctl start ananke-bootstrap.service
sudo systemctl start ananke-update.service
```
## Config
Primary config path:
- `/etc/ananke/ananke.yaml`
Keep these fields accurate:
- `expected_flux_source_url`
- `expected_flux_branch`
- `startup.service_checklist`
- `startup.critical_service_endpoints`
- `startup.require_ingress_checklist`
- `startup.require_node_inventory_reachability`
- `startup.ignore_unavailable_nodes`
- `coordination.role`
- `coordination.peer_hosts`
## Quality gate
Top-level quality/testing module:
- `testing/`
Deployment gate script:
- `scripts/quality_gate.sh`
Gate order:
1. docs contract checks
2. naming + LOC hygiene checks
3. pedantic lint
4. per-file coverage gate (95% minimum)
Installer behavior:
- `scripts/install.sh` runs the quality gate by default
- override only for emergency break/fix: `ANANKE_ENFORCE_QUALITY_GATE=0`
## Growing with the lab
When adding nodes or services:
1. Update inventory and node mapping in config.
2. Add/adjust service checklist entries for anything user-facing or critical.
3. Add/adjust ingress expectations for exposed services.
4. Use temporary ignores only when truly intentional, then remove them.
5. Run `scripts/quality_gate.sh` before host deployment.
Recovery quality should improve over time: every drill should reduce manual work in the next drill.