2026-04-07 12:40:45 -03:00
|
|
|
|
# ananke
|
2026-04-03 01:43:16 -03:00
|
|
|
|
|
2026-04-08 19:02:49 -03:00
|
|
|
|
Ananke is the host-side recovery orchestrator for Titan power events.
|
2026-04-03 01:43:16 -03:00
|
|
|
|
|
2026-04-07 12:40:45 -03:00
|
|
|
|
It runs outside Kubernetes (systemd on host), so it can:
|
2026-04-08 19:02:49 -03:00
|
|
|
|
- shut the cluster down gracefully before runtime gets dangerous
|
|
|
|
|
|
- bootstrap the cluster after power is restored
|
|
|
|
|
|
- break known startup deadlocks (including Flux + in-cluster Gitea coupling)
|
|
|
|
|
|
- verify real service availability before declaring startup complete
|
|
|
|
|
|
|
|
|
|
|
|
The goal is not clever automation. The goal is boring, repeatable recovery.
|
2026-04-03 01:43:16 -03:00
|
|
|
|
|
2026-04-07 12:40:45 -03:00
|
|
|
|
## Why `ananke`
|
2026-04-03 01:43:16 -03:00
|
|
|
|
|
2026-04-08 19:02:49 -03:00
|
|
|
|
In Greek myth, **Ananke** is inevitability and necessity.
|
|
|
|
|
|
That is the exact constraint we operate under during outages and drills.
|
|
|
|
|
|
|
|
|
|
|
|
Power-domain names in this lab align with that naming:
|
|
|
|
|
|
- `Statera` UPS: `titan-23`, `titan-24`, `titan-jh`
|
|
|
|
|
|
- `Pyrphoros` UPS: all other nodes
|
|
|
|
|
|
|
|
|
|
|
|
## Operating model (non-negotiable)
|
2026-04-03 01:43:16 -03:00
|
|
|
|
|
2026-04-08 19:02:49 -03:00
|
|
|
|
- Ananke does **cluster orchestration**, not host power control.
|
|
|
|
|
|
- Shutdown defaults to `cluster-only` and should remain that way for normal drills.
|
|
|
|
|
|
- Physical outages can cut host power themselves; Ananke’s job is clean state transitions.
|
2026-04-03 01:43:16 -03:00
|
|
|
|
|
2026-04-08 19:02:49 -03:00
|
|
|
|
Flux source of truth remains `titan-iac.git`.
|
|
|
|
|
|
Ananke’s own repo (`ananke.git`) is software only; it is not the desired-state cluster config repo.
|
2026-04-03 01:43:16 -03:00
|
|
|
|
|
2026-04-07 12:40:45 -03:00
|
|
|
|
## Breakglass reminder
|
2026-04-04 12:44:15 -03:00
|
|
|
|
|
2026-04-08 19:02:49 -03:00
|
|
|
|
Vault breakglass is available through a remote Magic Mirror path.
|
|
|
|
|
|
If standard unseal retrieval fails, use `startup.vault_unseal_breakglass_command`.
|
2026-04-07 12:40:45 -03:00
|
|
|
|
|
2026-04-08 19:02:49 -03:00
|
|
|
|
## What "startup complete" means
|
2026-04-07 12:40:45 -03:00
|
|
|
|
|
2026-04-08 19:02:49 -03:00
|
|
|
|
Startup is complete only after all required gates pass:
|
|
|
|
|
|
- inventory mapping is valid
|
|
|
|
|
|
- expected SSH nodes are reachable/authenticated (minus explicit ignores)
|
|
|
|
|
|
- Flux source drift guard passes (expected URL + branch)
|
|
|
|
|
|
- required Flux kustomizations are healthy
|
|
|
|
|
|
- workload convergence is healthy
|
|
|
|
|
|
- ingress checklist passes
|
|
|
|
|
|
- service checklist passes (internal + externally exposed)
|
|
|
|
|
|
- critical endpoint checks pass
|
|
|
|
|
|
- stability soak passes with no regressions
|
2026-04-07 12:40:45 -03:00
|
|
|
|
|
2026-04-08 19:02:49 -03:00
|
|
|
|
If a manual intervention is needed during a drill, that is treated as an Ananke gap and must be encoded back into Ananke logic.
|
|
|
|
|
|
|
|
|
|
|
|
## Status and reports
|
2026-04-07 22:40:15 -03:00
|
|
|
|
|
2026-04-08 19:02:49 -03:00
|
|
|
|
Live status:
|
|
|
|
|
|
- `ananke status --config /etc/ananke/ananke.yaml`
|
|
|
|
|
|
- `ananke status --config /etc/ananke/ananke.yaml --json`
|
2026-04-07 12:40:45 -03:00
|
|
|
|
|
2026-04-08 19:02:49 -03:00
|
|
|
|
Artifacts:
|
|
|
|
|
|
- `/var/lib/ananke/startup-progress.json` (live run progress)
|
|
|
|
|
|
- `/var/lib/ananke/last-startup-report.json`
|
|
|
|
|
|
- `/var/lib/ananke/last-shutdown-report.json`
|
|
|
|
|
|
- `/var/lib/ananke/reports/*.json` (historical per-run reports)
|
|
|
|
|
|
- `/var/lib/ananke/runs.json` (timing history)
|
|
|
|
|
|
|
|
|
|
|
|
## Quick commands
|
2026-04-07 12:40:45 -03:00
|
|
|
|
|
2026-04-08 19:02:49 -03:00
|
|
|
|
From `titan-db`:
|
2026-04-03 01:43:16 -03:00
|
|
|
|
|
|
|
|
|
|
```bash
|
2026-04-07 12:40:45 -03:00
|
|
|
|
sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
|
|
|
|
|
|
sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main
|
|
|
|
|
|
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
|
2026-04-03 01:43:16 -03:00
|
|
|
|
```
|
|
|
|
|
|
|
2026-04-07 12:40:45 -03:00
|
|
|
|
From `titan-24` (`tethys` peer):
|
2026-04-03 01:43:16 -03:00
|
|
|
|
|
|
|
|
|
|
```bash
|
2026-04-07 12:40:45 -03:00
|
|
|
|
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
|
2026-04-03 01:43:16 -03:00
|
|
|
|
```
|
|
|
|
|
|
|
2026-04-08 19:02:49 -03:00
|
|
|
|
Systemd control:
|
2026-04-03 01:43:16 -03:00
|
|
|
|
|
2026-04-07 12:40:45 -03:00
|
|
|
|
```bash
|
|
|
|
|
|
sudo systemctl status ananke.service
|
|
|
|
|
|
sudo systemctl start ananke-bootstrap.service
|
|
|
|
|
|
sudo systemctl start ananke-update.service
|
|
|
|
|
|
```
|
2026-04-03 01:43:16 -03:00
|
|
|
|
|
2026-04-08 19:02:49 -03:00
|
|
|
|
## Config
|
2026-04-03 01:43:16 -03:00
|
|
|
|
|
2026-04-08 19:02:49 -03:00
|
|
|
|
Primary config path:
|
2026-04-07 12:40:45 -03:00
|
|
|
|
- `/etc/ananke/ananke.yaml`
|
2026-04-03 01:43:16 -03:00
|
|
|
|
|
2026-04-08 19:02:49 -03:00
|
|
|
|
Keep these fields accurate:
|
2026-04-07 12:40:45 -03:00
|
|
|
|
- `expected_flux_source_url`
|
2026-04-08 19:02:49 -03:00
|
|
|
|
- `expected_flux_branch`
|
2026-04-07 12:40:45 -03:00
|
|
|
|
- `startup.service_checklist`
|
2026-04-08 19:02:49 -03:00
|
|
|
|
- `startup.critical_service_endpoints`
|
|
|
|
|
|
- `startup.require_ingress_checklist`
|
|
|
|
|
|
- `startup.require_node_inventory_reachability`
|
|
|
|
|
|
- `startup.ignore_unavailable_nodes`
|
|
|
|
|
|
- `coordination.role`
|
|
|
|
|
|
- `coordination.peer_hosts`
|
2026-04-05 11:30:54 -03:00
|
|
|
|
|
2026-04-08 19:02:49 -03:00
|
|
|
|
## Quality gate
|
2026-04-03 01:43:16 -03:00
|
|
|
|
|
2026-04-08 19:02:49 -03:00
|
|
|
|
Top-level quality/testing module:
|
|
|
|
|
|
- `testing/`
|
|
|
|
|
|
|
|
|
|
|
|
Deployment gate script:
|
|
|
|
|
|
- `scripts/quality_gate.sh`
|
|
|
|
|
|
|
|
|
|
|
|
Gate order:
|
|
|
|
|
|
1. docs contract checks
|
|
|
|
|
|
2. naming + LOC hygiene checks
|
|
|
|
|
|
3. pedantic lint
|
|
|
|
|
|
4. per-file coverage gate (95% minimum)
|
2026-04-03 01:43:16 -03:00
|
|
|
|
|
2026-04-07 12:40:45 -03:00
|
|
|
|
Installer behavior:
|
2026-04-08 19:02:49 -03:00
|
|
|
|
- `scripts/install.sh` runs the quality gate by default
|
|
|
|
|
|
- override only for emergency break/fix: `ANANKE_ENFORCE_QUALITY_GATE=0`
|
|
|
|
|
|
|
|
|
|
|
|
## Growing with the lab
|
2026-04-03 14:46:03 -03:00
|
|
|
|
|
2026-04-08 19:02:49 -03:00
|
|
|
|
When adding nodes or services:
|
|
|
|
|
|
1. Update inventory and node mapping in config.
|
|
|
|
|
|
2. Add/adjust service checklist entries for anything user-facing or critical.
|
|
|
|
|
|
3. Add/adjust ingress expectations for exposed services.
|
|
|
|
|
|
4. Use temporary ignores only when truly intentional, then remove them.
|
|
|
|
|
|
5. Run `scripts/quality_gate.sh` before host deployment.
|
2026-04-03 01:43:16 -03:00
|
|
|
|
|
2026-04-08 19:02:49 -03:00
|
|
|
|
Recovery quality should improve over time: every drill should reduce manual work in the next drill.
|