Update README.md
This commit is contained in:
parent
83d987f43a
commit
09409660a2
28
README.md
28
README.md
@ -1,18 +1,14 @@
|
||||
# ananke
|
||||
|
||||
Ananke is the thing that gets Atlas back on its feet after power trouble.
|
||||
Ananke gets Atlas back on its feet after power trouble.
|
||||
|
||||
It runs on the host, outside Kubernetes, because some failures start before the
|
||||
cluster is healthy enough to fix itself. Its job is to bring nodes, Flux, core
|
||||
workloads, ingresses, and service checks back into a known-good state.
|
||||
It runs on both tethys (in cluster - titan-24) and titan-db (out of cluster), outside Kubernetes as the host level, because some failures start before the cluster is healthy enough to fix itself. Its job is to bring nodes, Flux, core workloads, ingresses, and service checks back into a known-good state.
|
||||
|
||||
It is deliberately boring software: do the checks, repair the known deadlocks,
|
||||
and stop loudly when a human needs to touch hardware.
|
||||
It aspires to be boring software: do the checks, repair the known deadlocks, and stomp loudly when it exhausts is remedy library.
|
||||
|
||||
## How it works
|
||||
|
||||
Ananke reads `/etc/ananke/ananke.yaml`, then walks the cluster through startup or
|
||||
shutdown gates:
|
||||
Ananke walks the cluster through startup or shutdown gates:
|
||||
|
||||
- confirm the expected nodes and SSH access
|
||||
- check that Flux is looking at the right repo and branch
|
||||
@ -20,21 +16,19 @@ shutdown gates:
|
||||
- repair known startup traps, including Harbor/Gitea/Flux coupling
|
||||
- run ingress, service, endpoint, and soak checks before calling startup done
|
||||
|
||||
Recovery cordons are now treated as short leases. If Ananke cordons a node to
|
||||
repair something, it must either clear the cordon within the configured window
|
||||
or mark the node for manual action. The default window is one hour.
|
||||
Recovery cordons are given short 1hr leases. If Ananke cordons a node to repair something, it must either clear the cordon within the configured window or mark the node for manual action.
|
||||
|
||||
The following are notes for future Brad.
|
||||
|
||||
## Daily commands
|
||||
|
||||
Run these on `titan-db` unless you know you are using the `tethys` peer:
|
||||
|
||||
```bash
|
||||
sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
|
||||
sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main
|
||||
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
|
||||
```
|
||||
|
||||
Useful files:
|
||||
Host files:
|
||||
|
||||
- `/var/lib/ananke/startup-progress.json`
|
||||
- `/var/lib/ananke/last-startup-report.json`
|
||||
@ -43,12 +37,10 @@ Useful files:
|
||||
|
||||
## Development
|
||||
|
||||
Run the full local check before installing:
|
||||
Local testing check before installing:
|
||||
|
||||
```bash
|
||||
./scripts/quality_gate.sh
|
||||
```
|
||||
|
||||
Emergency installs can bypass the gate with
|
||||
`ANANKE_ENFORCE_QUALITY_GATE=0`, but that should stay rare. If a recovery drill
|
||||
needed manual work, the follow-up belongs in Ananke so the next one is cleaner.
|
||||
Emergency installs can bypass the gate with `ANANKE_ENFORCE_QUALITY_GATE=0` - try to avoid this. You should be treating failures as an instructive opportunity to improve Ananke.
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user