ananke/README.md

47 lines
1.9 KiB
Markdown
Raw Normal View History

# ananke
2026-06-19 20:40:33 +00:00
Ananke gets Atlas back on its feet after power trouble.
2026-06-19 20:40:33 +00:00
It runs on both tethys (in cluster - titan-24) and titan-db (out of cluster), outside Kubernetes as the host level, because some failures start before the cluster is healthy enough to fix itself. Its job is to bring nodes, Flux, core workloads, ingresses, and service checks back into a known-good state.
2026-06-19 20:40:33 +00:00
It aspires to be boring software: do the checks, repair the known deadlocks, and stomp loudly when it exhausts is remedy library.
2026-06-19 15:43:49 -03:00
## How it works
2026-06-19 20:40:33 +00:00
Ananke walks the cluster through startup or shutdown gates:
2026-06-19 15:43:49 -03:00
- confirm the expected nodes and SSH access
- check that Flux is looking at the right repo and branch
- wait for required Flux kustomizations and namespaces
- repair known startup traps, including Harbor/Gitea/Flux coupling
- run ingress, service, endpoint, and soak checks before calling startup done
2026-06-19 20:40:33 +00:00
Recovery cordons are given short 1hr leases. If Ananke cordons a node to repair something, it must either clear the cordon within the configured window or mark the node for manual action.
2026-06-19 20:40:33 +00:00
The following are notes for future Brad.
2026-06-19 20:40:33 +00:00
## Daily commands
```bash
sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
```
2026-06-19 20:40:33 +00:00
Host files:
2026-06-19 15:43:49 -03:00
- `/var/lib/ananke/startup-progress.json`
- `/var/lib/ananke/last-startup-report.json`
- `/var/lib/ananke/last-shutdown-report.json`
- `/var/log/ananke/update.log`
## Development
2026-06-19 20:40:33 +00:00
Local testing check before installing:
```bash
2026-06-19 15:43:49 -03:00
./scripts/quality_gate.sh
```
2026-06-19 20:40:33 +00:00
Emergency installs can bypass the gate with `ANANKE_ENFORCE_QUALITY_GATE=0` - try to avoid this. You should be treating failures as an instructive opportunity to improve Ananke.