ananke/README.md

55 lines
1.9 KiB
Markdown
Raw Normal View History

# ananke
2026-06-19 15:43:49 -03:00
Ananke is the thing that gets Atlas back on its feet after power trouble.
2026-06-19 15:43:49 -03:00
It runs on the host, outside Kubernetes, because some failures start before the
cluster is healthy enough to fix itself. Its job is to bring nodes, Flux, core
workloads, ingresses, and service checks back into a known-good state.
2026-06-19 15:43:49 -03:00
It is deliberately boring software: do the checks, repair the known deadlocks,
and stop loudly when a human needs to touch hardware.
2026-06-19 15:43:49 -03:00
## How it works
2026-06-19 15:43:49 -03:00
Ananke reads `/etc/ananke/ananke.yaml`, then walks the cluster through startup or
shutdown gates:
2026-06-19 15:43:49 -03:00
- confirm the expected nodes and SSH access
- check that Flux is looking at the right repo and branch
- wait for required Flux kustomizations and namespaces
- repair known startup traps, including Harbor/Gitea/Flux coupling
- run ingress, service, endpoint, and soak checks before calling startup done
2026-06-19 15:43:49 -03:00
Recovery cordons are now treated as short leases. If Ananke cordons a node to
repair something, it must either clear the cordon within the configured window
or mark the node for manual action. The default window is one hour.
2026-06-19 15:43:49 -03:00
## Daily commands
2026-06-19 15:43:49 -03:00
Run these on `titan-db` unless you know you are using the `tethys` peer:
```bash
sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
```
2026-06-19 15:43:49 -03:00
Useful files:
2026-06-19 15:43:49 -03:00
- `/var/lib/ananke/startup-progress.json`
- `/var/lib/ananke/last-startup-report.json`
- `/var/lib/ananke/last-shutdown-report.json`
- `/var/log/ananke/update.log`
## Development
2026-06-19 15:43:49 -03:00
Run the full local check before installing:
```bash
2026-06-19 15:43:49 -03:00
./scripts/quality_gate.sh
```
2026-06-19 15:43:49 -03:00
Emergency installs can bypass the gate with
`ANANKE_ENFORCE_QUALITY_GATE=0`, but that should stay rare. If a recovery drill
needed manual work, the follow-up belongs in Ananke so the next one is cleaner.