docs: shorten ananke README
This commit is contained in:
parent
85c0741b3e
commit
57610c623a
160
README.md
160
README.md
@ -1,71 +1,32 @@
|
|||||||
# ananke
|
# ananke
|
||||||
|
|
||||||
Ananke is the host-side recovery orchestrator for Titan power events.
|
Ananke is the thing that gets Atlas back on its feet after power trouble.
|
||||||
|
|
||||||
It runs outside Kubernetes (systemd on host), so it can:
|
It runs on the host, outside Kubernetes, because some failures start before the
|
||||||
- shut the cluster down gracefully before runtime gets dangerous
|
cluster is healthy enough to fix itself. Its job is to bring nodes, Flux, core
|
||||||
- bootstrap the cluster after power is restored
|
workloads, ingresses, and service checks back into a known-good state.
|
||||||
- break known startup deadlocks (including Flux + in-cluster Gitea coupling)
|
|
||||||
- verify real service availability before declaring startup complete
|
|
||||||
|
|
||||||
The goal is not clever automation. The goal is boring, repeatable recovery.
|
It is deliberately boring software: do the checks, repair the known deadlocks,
|
||||||
|
and stop loudly when a human needs to touch hardware.
|
||||||
|
|
||||||
## Why `ananke`
|
## How it works
|
||||||
|
|
||||||
In Greek myth, **Ananke** is inevitability and necessity.
|
Ananke reads `/etc/ananke/ananke.yaml`, then walks the cluster through startup or
|
||||||
That is the exact constraint we operate under during outages and drills.
|
shutdown gates:
|
||||||
|
|
||||||
Power-domain names in this lab align with that naming:
|
- confirm the expected nodes and SSH access
|
||||||
- `Statera` UPS: `titan-23`, `titan-24`, `titan-jh`
|
- check that Flux is looking at the right repo and branch
|
||||||
- `Pyrphoros` UPS: all other nodes
|
- wait for required Flux kustomizations and namespaces
|
||||||
|
- repair known startup traps, including Harbor/Gitea/Flux coupling
|
||||||
|
- run ingress, service, endpoint, and soak checks before calling startup done
|
||||||
|
|
||||||
## Operating model (non-negotiable)
|
Recovery cordons are now treated as short leases. If Ananke cordons a node to
|
||||||
|
repair something, it must either clear the cordon within the configured window
|
||||||
|
or mark the node for manual action. The default window is one hour.
|
||||||
|
|
||||||
- Ananke does **cluster orchestration**, not host power control.
|
## Daily commands
|
||||||
- Shutdown defaults to `cluster-only` and should remain that way for normal drills.
|
|
||||||
- Physical outages can cut host power themselves; Ananke’s job is clean state transitions.
|
|
||||||
|
|
||||||
Flux source of truth remains `titan-iac.git`.
|
Run these on `titan-db` unless you know you are using the `tethys` peer:
|
||||||
Ananke’s own repo (`ananke.git`) is software only; it is not the desired-state cluster config repo.
|
|
||||||
|
|
||||||
## Breakglass reminder
|
|
||||||
|
|
||||||
Vault breakglass is available through a remote Magic Mirror path.
|
|
||||||
If standard unseal retrieval fails, use `startup.vault_unseal_breakglass_command`.
|
|
||||||
|
|
||||||
## What "startup complete" means
|
|
||||||
|
|
||||||
Startup is complete only after all required gates pass:
|
|
||||||
- inventory mapping is valid
|
|
||||||
- expected SSH nodes are reachable/authenticated (minus explicit ignores)
|
|
||||||
- Flux source drift guard passes (expected URL + branch)
|
|
||||||
- required Flux kustomizations are healthy
|
|
||||||
- workload convergence is healthy
|
|
||||||
- ingress checklist passes
|
|
||||||
- service checklist passes (internal + externally exposed)
|
|
||||||
- critical endpoint checks pass
|
|
||||||
- stability soak passes with no regressions
|
|
||||||
|
|
||||||
If a manual intervention is needed during a drill, that is treated as an Ananke gap and must be encoded back into Ananke logic.
|
|
||||||
|
|
||||||
## Status and reports
|
|
||||||
|
|
||||||
Live status:
|
|
||||||
- `ananke status --config /etc/ananke/ananke.yaml`
|
|
||||||
- `ananke status --config /etc/ananke/ananke.yaml --json`
|
|
||||||
|
|
||||||
Artifacts:
|
|
||||||
- `/var/lib/ananke/startup-progress.json` (live run progress)
|
|
||||||
- `/var/lib/ananke/last-startup-report.json`
|
|
||||||
- `/var/lib/ananke/last-shutdown-report.json`
|
|
||||||
- `/var/lib/ananke/reports/*.json` (historical per-run reports)
|
|
||||||
- `/var/lib/ananke/runs.json` (timing history)
|
|
||||||
- `/var/lib/ananke/update-last.env` (latest self-update result)
|
|
||||||
- `/var/log/ananke/update.log` (self-update execution log)
|
|
||||||
|
|
||||||
## Quick commands
|
|
||||||
|
|
||||||
From `titan-db`:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
|
sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
|
||||||
@ -73,76 +34,21 @@ sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --
|
|||||||
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
|
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
|
||||||
```
|
```
|
||||||
|
|
||||||
From `titan-24` (`tethys` peer):
|
Useful files:
|
||||||
|
|
||||||
|
- `/var/lib/ananke/startup-progress.json`
|
||||||
|
- `/var/lib/ananke/last-startup-report.json`
|
||||||
|
- `/var/lib/ananke/last-shutdown-report.json`
|
||||||
|
- `/var/log/ananke/update.log`
|
||||||
|
|
||||||
|
## Development
|
||||||
|
|
||||||
|
Run the full local check before installing:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
|
./scripts/quality_gate.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
Systemd control:
|
Emergency installs can bypass the gate with
|
||||||
|
`ANANKE_ENFORCE_QUALITY_GATE=0`, but that should stay rare. If a recovery drill
|
||||||
```bash
|
needed manual work, the follow-up belongs in Ananke so the next one is cleaner.
|
||||||
sudo systemctl status ananke.service
|
|
||||||
sudo systemctl start ananke-bootstrap.service
|
|
||||||
sudo systemctl start ananke-update.service
|
|
||||||
sudo cat /var/lib/ananke/update-last.env
|
|
||||||
sudo tail -n 200 /var/log/ananke/update.log
|
|
||||||
```
|
|
||||||
|
|
||||||
## Config
|
|
||||||
|
|
||||||
Primary config path:
|
|
||||||
- `/etc/ananke/ananke.yaml`
|
|
||||||
|
|
||||||
Keep these fields accurate:
|
|
||||||
- `expected_flux_source_url`
|
|
||||||
- `expected_flux_branch`
|
|
||||||
- `startup.service_checklist_explicit_only`
|
|
||||||
- `startup.service_checklist`
|
|
||||||
- `startup.critical_service_endpoints`
|
|
||||||
- `startup.require_ingress_checklist`
|
|
||||||
- `startup.require_node_inventory_reachability`
|
|
||||||
- `startup.node_inventory_reachability_required_nodes`
|
|
||||||
- `startup.node_ssh_auth_required_nodes`
|
|
||||||
- `startup.flux_health_required_kustomizations`
|
|
||||||
- `startup.workload_convergence_required_namespaces`
|
|
||||||
- `startup.ignore_unavailable_nodes`
|
|
||||||
- `coordination.role`
|
|
||||||
- `coordination.peer_hosts`
|
|
||||||
|
|
||||||
## Quality gate
|
|
||||||
|
|
||||||
Top-level quality/testing module:
|
|
||||||
- `testing/`
|
|
||||||
|
|
||||||
Deployment gate script:
|
|
||||||
- `scripts/quality_gate.sh`
|
|
||||||
|
|
||||||
Gate order:
|
|
||||||
1. docs contract checks
|
|
||||||
2. split test-module contract (`cmd/` + `internal/` cannot grow new in-tree `_test.go` files)
|
|
||||||
3. naming + LOC hygiene checks
|
|
||||||
4. pedantic lint
|
|
||||||
5. per-file coverage gate (95% minimum)
|
|
||||||
|
|
||||||
Current migration rule:
|
|
||||||
- keep new tests in the top-level `testing/` module
|
|
||||||
- legacy in-tree `_test.go` files are temporarily grandfathered through `testing/hygiene/in_tree_test_allowlist.txt` until they are migrated safely
|
|
||||||
|
|
||||||
Installer behavior:
|
|
||||||
- `scripts/install.sh` runs the quality gate by default
|
|
||||||
- override only for emergency break/fix: `ANANKE_ENFORCE_QUALITY_GATE=0`
|
|
||||||
- host quality runs keep writing local `ananke_quality_gate_*` metrics and also publish `platform_quality_gate_runs_total{suite="ananke",status=*}` to Pushgateway for shared Grafana panels
|
|
||||||
- override the Pushgateway target when running outside cluster DNS: `ANANKE_QUALITY_PUSHGATEWAY_URL=http://... ./scripts/quality_gate.sh`
|
|
||||||
|
|
||||||
## Growing with the lab
|
|
||||||
|
|
||||||
When adding nodes or services:
|
|
||||||
1. Update inventory and node mapping in config.
|
|
||||||
2. Keep the explicit service checklist focused on the core services that must come back during an outage.
|
|
||||||
3. Keep `*_required_*` startup scopes aligned with the same core set so optional stacks do not block bootstrap.
|
|
||||||
4. Add/adjust ingress expectations for exposed services.
|
|
||||||
5. Use temporary ignores only when truly intentional, then remove them.
|
|
||||||
6. Run `scripts/quality_gate.sh` before host deployment.
|
|
||||||
|
|
||||||
Recovery quality should improve over time: every drill should reduce manual work in the next drill.
|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user