55 lines
1.9 KiB
Markdown
55 lines
1.9 KiB
Markdown
# ananke
|
|
|
|
Ananke is the thing that gets Atlas back on its feet after power trouble.
|
|
|
|
It runs on the host, outside Kubernetes, because some failures start before the
|
|
cluster is healthy enough to fix itself. Its job is to bring nodes, Flux, core
|
|
workloads, ingresses, and service checks back into a known-good state.
|
|
|
|
It is deliberately boring software: do the checks, repair the known deadlocks,
|
|
and stop loudly when a human needs to touch hardware.
|
|
|
|
## How it works
|
|
|
|
Ananke reads `/etc/ananke/ananke.yaml`, then walks the cluster through startup or
|
|
shutdown gates:
|
|
|
|
- confirm the expected nodes and SSH access
|
|
- check that Flux is looking at the right repo and branch
|
|
- wait for required Flux kustomizations and namespaces
|
|
- repair known startup traps, including Harbor/Gitea/Flux coupling
|
|
- run ingress, service, endpoint, and soak checks before calling startup done
|
|
|
|
Recovery cordons are now treated as short leases. If Ananke cordons a node to
|
|
repair something, it must either clear the cordon within the configured window
|
|
or mark the node for manual action. The default window is one hour.
|
|
|
|
## Daily commands
|
|
|
|
Run these on `titan-db` unless you know you are using the `tethys` peer:
|
|
|
|
```bash
|
|
sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
|
|
sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main
|
|
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
|
|
```
|
|
|
|
Useful files:
|
|
|
|
- `/var/lib/ananke/startup-progress.json`
|
|
- `/var/lib/ananke/last-startup-report.json`
|
|
- `/var/lib/ananke/last-shutdown-report.json`
|
|
- `/var/log/ananke/update.log`
|
|
|
|
## Development
|
|
|
|
Run the full local check before installing:
|
|
|
|
```bash
|
|
./scripts/quality_gate.sh
|
|
```
|
|
|
|
Emergency installs can bypass the gate with
|
|
`ANANKE_ENFORCE_QUALITY_GATE=0`, but that should stay rare. If a recovery drill
|
|
needed manual work, the follow-up belongs in Ananke so the next one is cleaner.
|