ananke/README.md

# ananke

Ananke is the host-side recovery orchestrator for Titan power events.

It runs outside Kubernetes (systemd on host), so it can:
- shut the cluster down gracefully before runtime gets dangerous
- bootstrap the cluster after power is restored
- break known startup deadlocks (including Flux + in-cluster Gitea coupling)
- verify real service availability before declaring startup complete

The goal is not clever automation. The goal is boring, repeatable recovery.

## Why `ananke`

In Greek myth, **Ananke** is inevitability and necessity.
That is the exact constraint we operate under during outages and drills.

Power-domain names in this lab align with that naming:
- `Statera` UPS: `titan-23`, `titan-24`, `titan-jh`
- `Pyrphoros` UPS: all other nodes

## Operating model (non-negotiable)

- Ananke does **cluster orchestration**, not host power control.
- Shutdown defaults to `cluster-only` and should remain that way for normal drills.
- Physical outages can cut host power themselves; Ananke’s job is clean state transitions.

Flux source of truth remains `titan-iac.git`.
Ananke’s own repo (`ananke.git`) is software only; it is not the desired-state cluster config repo.

## Breakglass reminder

Vault breakglass is available through a remote Magic Mirror path.
If standard unseal retrieval fails, use `startup.vault_unseal_breakglass_command`.

## What "startup complete" means

Startup is complete only after all required gates pass:
- inventory mapping is valid
- expected SSH nodes are reachable/authenticated (minus explicit ignores)
- Flux source drift guard passes (expected URL + branch)
- required Flux kustomizations are healthy
- workload convergence is healthy
- ingress checklist passes
- service checklist passes (internal + externally exposed)
- critical endpoint checks pass
- stability soak passes with no regressions

If a manual intervention is needed during a drill, that is treated as an Ananke gap and must be encoded back into Ananke logic.

## Status and reports

Live status:
- `ananke status --config /etc/ananke/ananke.yaml`
- `ananke status --config /etc/ananke/ananke.yaml --json`

Artifacts:
- `/var/lib/ananke/startup-progress.json` (live run progress)
- `/var/lib/ananke/last-startup-report.json`
- `/var/lib/ananke/last-shutdown-report.json`
- `/var/lib/ananke/reports/*.json` (historical per-run reports)
- `/var/lib/ananke/runs.json` (timing history)
- `/var/lib/ananke/update-last.env` (latest self-update result)
- `/var/log/ananke/update.log` (self-update execution log)

## Quick commands

From `titan-db`:

```bash
sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
```

From `titan-24` (`tethys` peer):

```bash
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
```

Systemd control:

```bash
sudo systemctl status ananke.service
sudo systemctl start ananke-bootstrap.service
sudo systemctl start ananke-update.service
sudo cat /var/lib/ananke/update-last.env
sudo tail -n 200 /var/log/ananke/update.log
```

## Config

Primary config path:
- `/etc/ananke/ananke.yaml`

Keep these fields accurate:
- `expected_flux_source_url`
- `expected_flux_branch`
- `startup.service_checklist_explicit_only`
- `startup.service_checklist`
- `startup.critical_service_endpoints`
- `startup.require_ingress_checklist`
- `startup.require_node_inventory_reachability`
- `startup.node_inventory_reachability_required_nodes`
- `startup.node_ssh_auth_required_nodes`
- `startup.flux_health_required_kustomizations`
- `startup.workload_convergence_required_namespaces`
- `startup.ignore_unavailable_nodes`
- `coordination.role`
- `coordination.peer_hosts`

## Quality gate

Top-level quality/testing module:
- `testing/`

Deployment gate script:
- `scripts/quality_gate.sh`

Gate order:
1. docs contract checks
2. split test-module contract (`cmd/` + `internal/` cannot grow new in-tree `_test.go` files)
3. naming + LOC hygiene checks
4. pedantic lint
5. per-file coverage gate (95% minimum)

Current migration rule:
- keep new tests in the top-level `testing/` module
- legacy in-tree `_test.go` files are temporarily grandfathered through `testing/hygiene/in_tree_test_allowlist.txt` until they are migrated safely

Installer behavior:
- `scripts/install.sh` runs the quality gate by default
- override only for emergency break/fix: `ANANKE_ENFORCE_QUALITY_GATE=0`
- host quality runs keep writing local `ananke_quality_gate_*` metrics and also publish `platform_quality_gate_runs_total{suite="ananke",status=*}` to Pushgateway for shared Grafana panels
- override the Pushgateway target when running outside cluster DNS: `ANANKE_QUALITY_PUSHGATEWAY_URL=http://... ./scripts/quality_gate.sh`

## Growing with the lab

When adding nodes or services:
1. Update inventory and node mapping in config.
2. Keep the explicit service checklist focused on the core services that must come back during an outage.
3. Keep `*_required_*` startup scopes aligned with the same core set so optional stacks do not block bootstrap.
4. Add/adjust ingress expectations for exposed services.
5. Use temporary ignores only when truly intentional, then remove them.
6. Run `scripts/quality_gate.sh` before host deployment.

Recovery quality should improve over time: every drill should reduce manual work in the next drill.
-												docs: replace legacy hecate README with ananke runbook

											
										
										
											2026-04-07 12:40:45 -03:00
+								# ananke
-												bootstrap: scaffold hecate startup/shutdown service

											
										
										
											2026-04-03 01:43:16 -03:00
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								Ananke is the host-side recovery orchestrator for Titan power events.
-												bootstrap: scaffold hecate startup/shutdown service

											
										
										
											2026-04-03 01:43:16 -03:00
-												docs: replace legacy hecate README with ananke runbook

											
										
										
											2026-04-07 12:40:45 -03:00
+								It runs outside Kubernetes (systemd on host), so it can:
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								- shut the cluster down gracefully before runtime gets dangerous
 								- bootstrap the cluster after power is restored
 								- break known startup deadlocks (including Flux + in-cluster Gitea coupling)
 								- verify real service availability before declaring startup complete
 								The goal is not clever automation. The goal is boring, repeatable recovery.
-												bootstrap: scaffold hecate startup/shutdown service

											
										
										
											2026-04-03 01:43:16 -03:00
-												docs: replace legacy hecate README with ananke runbook

											
										
										
											2026-04-07 12:40:45 -03:00
+								## Why `ananke`
-												bootstrap: scaffold hecate startup/shutdown service

											
										
										
											2026-04-03 01:43:16 -03:00
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								In Greek myth, **Ananke** is inevitability and necessity.
 								That is the exact constraint we operate under during outages and drills.
 								Power-domain names in this lab align with that naming:
 								- `Statera` UPS: `titan-23`, `titan-24`, `titan-jh`
 								- `Pyrphoros` UPS: all other nodes
 								## Operating model (non-negotiable)
-												bootstrap: scaffold hecate startup/shutdown service

											
										
										
											2026-04-03 01:43:16 -03:00
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								- Ananke does **cluster orchestration**, not host power control.
 								- Shutdown defaults to `cluster-only` and should remain that way for normal drills.
 								- Physical outages can cut host power themselves; Ananke’s job is clean state transitions.
-												bootstrap: scaffold hecate startup/shutdown service

											
										
										
											2026-04-03 01:43:16 -03:00
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								Flux source of truth remains `titan-iac.git`.
 								Ananke’s own repo (`ananke.git`) is software only; it is not the desired-state cluster config repo.
-												bootstrap: scaffold hecate startup/shutdown service

											
										
										
											2026-04-03 01:43:16 -03:00
-												docs: replace legacy hecate README with ananke runbook

											
										
										
											2026-04-07 12:40:45 -03:00
+								## Breakglass reminder
-												hecate(startup): add coordinated intent guards and resilient recovery ssh

											
										
										
											2026-04-04 12:44:15 -03:00
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								Vault breakglass is available through a remote Magic Mirror path.
 								If standard unseal retrieval fails, use `startup.vault_unseal_breakglass_command`.
-												docs: replace legacy hecate README with ananke runbook

											
										
										
											2026-04-07 12:40:45 -03:00
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								## What "startup complete" means
-												docs: replace legacy hecate README with ananke runbook

											
										
										
											2026-04-07 12:40:45 -03:00
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								Startup is complete only after all required gates pass:
 								- inventory mapping is valid
 								- expected SSH nodes are reachable/authenticated (minus explicit ignores)
 								- Flux source drift guard passes (expected URL + branch)
 								- required Flux kustomizations are healthy
 								- workload convergence is healthy
 								- ingress checklist passes
 								- service checklist passes (internal + externally exposed)
 								- critical endpoint checks pass
 								- stability soak passes with no regressions
-												docs: replace legacy hecate README with ananke runbook

											
										
										
											2026-04-07 12:40:45 -03:00
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								If a manual intervention is needed during a drill, that is treated as an Ananke gap and must be encoded back into Ananke logic.
 								## Status and reports
-												startup: add strict preflight, ssh auth gate, ingress checks, and startup report

											
										
										
											2026-04-07 22:40:15 -03:00
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								Live status:
 								- `ananke status --config /etc/ananke/ananke.yaml`
 								- `ananke status --config /etc/ananke/ananke.yaml --json`
-												docs: replace legacy hecate README with ananke runbook

											
										
										
											2026-04-07 12:40:45 -03:00
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								Artifacts:
 								- `/var/lib/ananke/startup-progress.json` (live run progress)
 								- `/var/lib/ananke/last-startup-report.json`
 								- `/var/lib/ananke/last-shutdown-report.json`
 								- `/var/lib/ananke/reports/*.json` (historical per-run reports)
 								- `/var/lib/ananke/runs.json` (timing history)
-												update: add self-healing updater logs, lock, and status output

											
										
										
											2026-04-09 04:13:18 -03:00
+								- `/var/lib/ananke/update-last.env` (latest self-update result)
 								- `/var/log/ananke/update.log` (self-update execution log)
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
 								## Quick commands
-												docs: replace legacy hecate README with ananke runbook

											
										
										
											2026-04-07 12:40:45 -03:00
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								From `titan-db`:
-												bootstrap: scaffold hecate startup/shutdown service

											
										
										
											2026-04-03 01:43:16 -03:00
 								```bash
-												docs: replace legacy hecate README with ananke runbook

											
										
										
											2026-04-07 12:40:45 -03:00
+								sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
 								sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main
 								sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
-												bootstrap: scaffold hecate startup/shutdown service

											
										
										
											2026-04-03 01:43:16 -03:00
+								```
-												docs: replace legacy hecate README with ananke runbook

											
										
										
											2026-04-07 12:40:45 -03:00
+								From `titan-24` (`tethys` peer):
-												bootstrap: scaffold hecate startup/shutdown service

											
										
										
											2026-04-03 01:43:16 -03:00
 								```bash
-												docs: replace legacy hecate README with ananke runbook

											
										
										
											2026-04-07 12:40:45 -03:00
+								sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
-												bootstrap: scaffold hecate startup/shutdown service

											
										
										
											2026-04-03 01:43:16 -03:00
+								```
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								Systemd control:
-												bootstrap: scaffold hecate startup/shutdown service

											
										
										
											2026-04-03 01:43:16 -03:00
-												docs: replace legacy hecate README with ananke runbook

											
										
										
											2026-04-07 12:40:45 -03:00
+								```bash
 								sudo systemctl status ananke.service
 								sudo systemctl start ananke-bootstrap.service
 								sudo systemctl start ananke-update.service
-												update: add self-healing updater logs, lock, and status output

											
										
										
											2026-04-09 04:13:18 -03:00
+								sudo cat /var/lib/ananke/update-last.env
 								sudo tail -n 200 /var/log/ananke/update.log
-												docs: replace legacy hecate README with ananke runbook

											
										
										
											2026-04-07 12:40:45 -03:00
+								```
-												bootstrap: scaffold hecate startup/shutdown service

											
										
										
											2026-04-03 01:43:16 -03:00
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								## Config
-												bootstrap: scaffold hecate startup/shutdown service

											
										
										
											2026-04-03 01:43:16 -03:00
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								Primary config path:
-												docs: replace legacy hecate README with ananke runbook

											
										
										
											2026-04-07 12:40:45 -03:00
+								- `/etc/ananke/ananke.yaml`
-												bootstrap: scaffold hecate startup/shutdown service

											
										
										
											2026-04-03 01:43:16 -03:00
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								Keep these fields accurate:
-												docs: replace legacy hecate README with ananke runbook

											
										
										
											2026-04-07 12:40:45 -03:00
+								- `expected_flux_source_url`
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								- `expected_flux_branch`
-												startup(ananke): scope emergency recovery to core services

											
										
										
											2026-05-05 05:17:59 -03:00
+								- `startup.service_checklist_explicit_only`
-												docs: replace legacy hecate README with ananke runbook

											
										
										
											2026-04-07 12:40:45 -03:00
+								- `startup.service_checklist`
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								- `startup.critical_service_endpoints`
 								- `startup.require_ingress_checklist`
 								- `startup.require_node_inventory_reachability`
-												startup(ananke): scope emergency recovery to core services

											
										
										
											2026-05-05 05:17:59 -03:00
+								- `startup.node_inventory_reachability_required_nodes`
 								- `startup.node_ssh_auth_required_nodes`
 								- `startup.flux_health_required_kustomizations`
 								- `startup.workload_convergence_required_namespaces`
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								- `startup.ignore_unavailable_nodes`
 								- `coordination.role`
 								- `coordination.peer_hosts`
-												startup: add off-site break-glass unseal fallback

											
										
										
											2026-04-05 11:30:54 -03:00
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								## Quality gate
-												bootstrap: scaffold hecate startup/shutdown service

											
										
										
											2026-04-03 01:43:16 -03:00
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								Top-level quality/testing module:
 								- `testing/`
 								Deployment gate script:
 								- `scripts/quality_gate.sh`
 								Gate order:
 . docs contract checks
-												quality: enforce split test-module baseline

											
										
										
											2026-04-10 16:55:27 -03:00
+. split test-module contract (`cmd/` + `internal/` cannot grow new in-tree `_test.go` files)
 . naming + LOC hygiene checks
 . pedantic lint
 . per-file coverage gate (95% minimum)
 								Current migration rule:
 								- keep new tests in the top-level `testing/` module
 								- legacy in-tree `_test.go` files are temporarily grandfathered through `testing/hygiene/in_tree_test_allowlist.txt` until they are migrated safely
-												bootstrap: scaffold hecate startup/shutdown service

											
										
										
											2026-04-03 01:43:16 -03:00
-												docs: replace legacy hecate README with ananke runbook

											
										
										
											2026-04-07 12:40:45 -03:00
+								Installer behavior:
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								- `scripts/install.sh` runs the quality gate by default
 								- override only for emergency break/fix: `ANANKE_ENFORCE_QUALITY_GATE=0`
-												quality: publish ananke gate results to pushgateway

											
										
										
											2026-04-10 13:53:42 -03:00
+								- host quality runs keep writing local `ananke_quality_gate_*` metrics and also publish `platform_quality_gate_runs_total{suite="ananke",status=*}` to Pushgateway for shared Grafana panels
 								- override the Pushgateway target when running outside cluster DNS: `ANANKE_QUALITY_PUSHGATEWAY_URL=http://... ./scripts/quality_gate.sh`
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
 								## Growing with the lab
-												hecate: add multi-ups coordination, poweroff, metrics, and declarative self-update install

											
										
										
											2026-04-03 14:46:03 -03:00
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								When adding nodes or services:
 . Update inventory and node mapping in config.
-												startup(ananke): scope emergency recovery to core services

											
										
										
											2026-05-05 05:17:59 -03:00
+. Keep the explicit service checklist focused on the core services that must come back during an outage.
 . Keep `*_required_*` startup scopes aligned with the same core set so optional stacks do not block bootstrap.
 . Add/adjust ingress expectations for exposed services.
 . Use temporary ignores only when truly intentional, then remove them.
 . Run `scripts/quality_gate.sh` before host deployment.
-												bootstrap: scaffold hecate startup/shutdown service

											
										
										
											2026-04-03 01:43:16 -03:00
-												docs: refresh ananke README and clarify flux source ownership

											
										
										
											2026-04-08 19:02:49 -03:00
+								Recovery quality should improve over time: every drill should reduce manual work in the next drill.