ananke/README.md

# ananke

Ananke gets Atlas back on its feet after power trouble.

It runs on both tethys (in cluster - titan-24) and titan-db (out of cluster), outside Kubernetes as the host level, because some failures start before the cluster is healthy enough to fix itself. Its job is to bring nodes, Flux, core workloads, ingresses, and service checks back into a known-good state.

It aspires to be boring software: do the checks, repair the known deadlocks, and stomp loudly when it exhausts is remedy library.

## How it works

Ananke walks the cluster through startup or shutdown gates:

- confirm the expected nodes and SSH access
- check that Flux is looking at the right repo and branch
- wait for required Flux kustomizations and namespaces
- repair known startup traps, including Harbor/Gitea/Flux coupling
- run ingress, service, endpoint, and soak checks before calling startup done

Recovery cordons are given short 1hr leases. If Ananke cordons a node to repair something, it must either clear the cordon within the configured window or mark the node for manual action.

The following are notes for future Brad.

## Bring-up dependencies

Ananke should be one of the first things working. It does not need Harbor, Gitea, Longhorn, Grafana, or the apps to be healthy before it starts; those are often the mess it is there to sort out.

It does need:

- an Ananke host that came up on its own: usually `titan-db`, with the `tethys`/`titan-24` peer path as the backup
- `/etc/ananke/ananke.yaml`, the Ananke SSH key, and enough host config to reach nodes on the Atlas SSH port
- Kubernetes API access once the control plane is answering; before that it can only do host-side checks
- Flux CRDs/controllers and the `titan-iac` source once the API is up, because most startup gates are Flux-shaped
- basic node hygiene that Ananke cannot fake forever: SSH, sudo for managed repairs, sane clocks, and Longhorn host packages like `cryptsetup`, `open-iscsi`, `dmsetup`, and `nfs-common`
- NUT/UPS access if this is making real shutdown decisions instead of just doing startup recovery

If this is a total bring-up, start Ananke after the host boots and before waiting on applications. If Ananke is not running, Atlas is missing the thing that knows the order of operations.

## Daily commands

```bash
sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
```

Host files:

- `/var/lib/ananke/startup-progress.json`
- `/var/lib/ananke/last-startup-report.json`
- `/var/lib/ananke/last-shutdown-report.json`
- `/var/log/ananke/update.log`

## Development

Local testing check before installing:

```bash
./scripts/quality_gate.sh
```

Emergency installs can bypass the gate with `ANANKE_ENFORCE_QUALITY_GATE=0` - try to avoid this. You should be treating failures as an instructive opportunity to improve Ananke.
docs: replace legacy hecate README with ananke runbook 2026-04-07 12:40:45 -03:00			`# ananke`
bootstrap: scaffold hecate startup/shutdown service 2026-04-03 01:43:16 -03:00
Update README.md 2026-06-19 20:40:33 +00:00			`Ananke gets Atlas back on its feet after power trouble.`
bootstrap: scaffold hecate startup/shutdown service 2026-04-03 01:43:16 -03:00
Update README.md 2026-06-19 20:40:33 +00:00			`It runs on both tethys (in cluster - titan-24) and titan-db (out of cluster), outside Kubernetes as the host level, because some failures start before the cluster is healthy enough to fix itself. Its job is to bring nodes, Flux, core workloads, ingresses, and service checks back into a known-good state.`
docs: refresh ananke README and clarify flux source ownership 2026-04-08 19:02:49 -03:00
Update README.md 2026-06-19 20:40:33 +00:00			`It aspires to be boring software: do the checks, repair the known deadlocks, and stomp loudly when it exhausts is remedy library.`
bootstrap: scaffold hecate startup/shutdown service 2026-04-03 01:43:16 -03:00
docs: shorten ananke README 2026-06-19 15:43:49 -03:00			`## How it works`
bootstrap: scaffold hecate startup/shutdown service 2026-04-03 01:43:16 -03:00
Update README.md 2026-06-19 20:40:33 +00:00			`Ananke walks the cluster through startup or shutdown gates:`
docs: refresh ananke README and clarify flux source ownership 2026-04-08 19:02:49 -03:00
docs: shorten ananke README 2026-06-19 15:43:49 -03:00			`- confirm the expected nodes and SSH access`
			`- check that Flux is looking at the right repo and branch`
			`- wait for required Flux kustomizations and namespaces`
			`- repair known startup traps, including Harbor/Gitea/Flux coupling`
			`- run ingress, service, endpoint, and soak checks before calling startup done`
docs: refresh ananke README and clarify flux source ownership 2026-04-08 19:02:49 -03:00
Update README.md 2026-06-19 20:40:33 +00:00			`Recovery cordons are given short 1hr leases. If Ananke cordons a node to repair something, it must either clear the cordon within the configured window or mark the node for manual action.`
bootstrap: scaffold hecate startup/shutdown service 2026-04-03 01:43:16 -03:00
Update README.md 2026-06-19 20:40:33 +00:00			`The following are notes for future Brad.`
bootstrap: scaffold hecate startup/shutdown service 2026-04-03 01:43:16 -03:00
docs: note ananke bring-up dependencies 2026-06-19 17:53:25 -03:00			`## Bring-up dependencies`

			`Ananke should be one of the first things working. It does not need Harbor, Gitea, Longhorn, Grafana, or the apps to be healthy before it starts; those are often the mess it is there to sort out.`

			`It does need:`

			- an Ananke host that came up on its own: usually `titan-db`, with the `tethys`/`titan-24` peer path as the backup
			- `/etc/ananke/ananke.yaml`, the Ananke SSH key, and enough host config to reach nodes on the Atlas SSH port
			`- Kubernetes API access once the control plane is answering; before that it can only do host-side checks`
			- Flux CRDs/controllers and the `titan-iac` source once the API is up, because most startup gates are Flux-shaped
			- basic node hygiene that Ananke cannot fake forever: SSH, sudo for managed repairs, sane clocks, and Longhorn host packages like `cryptsetup`, `open-iscsi`, `dmsetup`, and `nfs-common`
			`- NUT/UPS access if this is making real shutdown decisions instead of just doing startup recovery`

			`If this is a total bring-up, start Ananke after the host boots and before waiting on applications. If Ananke is not running, Atlas is missing the thing that knows the order of operations.`

Update README.md 2026-06-19 20:40:33 +00:00			`## Daily commands`
bootstrap: scaffold hecate startup/shutdown service 2026-04-03 01:43:16 -03:00
			```bash
docs: replace legacy hecate README with ananke runbook 2026-04-07 12:40:45 -03:00			`sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml`
			`sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main`
			`sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only`
bootstrap: scaffold hecate startup/shutdown service 2026-04-03 01:43:16 -03:00			```

Update README.md 2026-06-19 20:40:33 +00:00			`Host files:`
bootstrap: scaffold hecate startup/shutdown service 2026-04-03 01:43:16 -03:00
docs: shorten ananke README 2026-06-19 15:43:49 -03:00			- `/var/lib/ananke/startup-progress.json`
			- `/var/lib/ananke/last-startup-report.json`
			- `/var/lib/ananke/last-shutdown-report.json`
			- `/var/log/ananke/update.log`

			`## Development`
bootstrap: scaffold hecate startup/shutdown service 2026-04-03 01:43:16 -03:00
Update README.md 2026-06-19 20:40:33 +00:00			`Local testing check before installing:`
bootstrap: scaffold hecate startup/shutdown service 2026-04-03 01:43:16 -03:00
docs: replace legacy hecate README with ananke runbook 2026-04-07 12:40:45 -03:00			```bash
docs: shorten ananke README 2026-06-19 15:43:49 -03:00			`./scripts/quality_gate.sh`
docs: replace legacy hecate README with ananke runbook 2026-04-07 12:40:45 -03:00			```
bootstrap: scaffold hecate startup/shutdown service 2026-04-03 01:43:16 -03:00
Update README.md 2026-06-19 20:40:33 +00:00			Emergency installs can bypass the gate with `ANANKE_ENFORCE_QUALITY_GATE=0` - try to avoid this. You should be treating failures as an instructive opportunity to improve Ananke.