98 lines
3.4 KiB
Markdown
98 lines
3.4 KiB
Markdown
# ananke
|
|
|
|
`ananke` is the host-side power + bootstrap orchestrator for Titan.
|
|
|
|
It runs outside Kubernetes (systemd on host), so it can:
|
|
- shut the cluster down gracefully before battery/runtime redlines
|
|
- bring the cluster back after power returns
|
|
- recover common Flux/Kustomize startup deadlocks
|
|
- validate service health from the outside before declaring startup done
|
|
|
|
## Why `ananke`
|
|
|
|
I wanted a name that fits Titan/mythology, but also describes what this service actually does.
|
|
|
|
In Greek myth, **Ananke** is inevitability/necessity. That matches this tool: when power events happen, graceful sequencing is not optional.
|
|
|
|
UPS names in this cluster are also part of the story:
|
|
- `Statera`: powers `titan-23`, `titan-24`, `titan-jh`
|
|
- `Pyrphoros`: powers all other nodes
|
|
|
|
## Breakglass reminder
|
|
|
|
Vault unseal breakglass is wired for remote retrieval (magic mirror host). If local key retrieval fails, Ananke can use the configured breakglass command.
|
|
|
|
## What “startup complete” means now
|
|
|
|
Ananke does **not** stop at “Flux says Ready”. Startup only completes when all configured gates pass:
|
|
- Flux source drift guard passes (`expected_flux_source_url` + branch expectation)
|
|
- Flux kustomizations are healthy
|
|
- controller convergence is healthy (deployments/statefulsets/daemonsets)
|
|
- external service checklist passes (Gitea, Grafana, Keycloak OIDC, Harbor registry auth challenge, Longhorn auth redirect)
|
|
- stability soak window passes (no regressions, no CrashLoop/ImagePull failures)
|
|
|
|
If any gate fails, startup is blocked with a concrete reason.
|
|
|
|
## Command quick sheet
|
|
|
|
From `titan-db` (coordinator):
|
|
|
|
```bash
|
|
sudo /usr/local/bin/ananke status --config /etc/ananke/ananke.yaml
|
|
sudo /usr/local/bin/ananke startup --config /etc/ananke/ananke.yaml --execute --force-flux-branch main
|
|
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
|
|
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason emergency-power --mode poweroff --skip-drain --skip-etcd-snapshot
|
|
```
|
|
|
|
From `titan-24` (`tethys` peer):
|
|
|
|
```bash
|
|
sudo /usr/local/bin/ananke shutdown --config /etc/ananke/ananke.yaml --execute --reason graceful-maintenance --mode cluster-only
|
|
```
|
|
|
|
Systemd:
|
|
|
|
```bash
|
|
sudo systemctl status ananke.service
|
|
sudo systemctl start ananke-bootstrap.service
|
|
sudo systemctl start ananke-update.service
|
|
```
|
|
|
|
## Shutdown modes (explicit)
|
|
|
|
`ananke shutdown` now supports explicit mode selection:
|
|
- `--mode config`: use config default (`shutdown.poweroff_enabled`)
|
|
- `--mode cluster-only`: stop cluster services only (no host poweroff)
|
|
- `--mode poweroff`: include host poweroff path
|
|
|
|
This removes ambiguity during drills.
|
|
|
|
## Config file
|
|
|
|
Primary path:
|
|
- `/etc/ananke/ananke.yaml`
|
|
|
|
Core settings to keep accurate:
|
|
- `expected_flux_branch`
|
|
- `expected_flux_source_url`
|
|
- `startup.service_checklist`
|
|
- `startup.service_checklist_stability_seconds`
|
|
- `startup.ignore_unavailable_nodes` (for planned temporary node outages)
|
|
- `coordination.role`, `coordination.peer_hosts`
|
|
|
|
## Install / update
|
|
|
|
```bash
|
|
sudo ./scripts/install.sh
|
|
```
|
|
|
|
Installer behavior:
|
|
- builds and installs `/usr/local/bin/ananke`
|
|
- installs `ananke*.service` units
|
|
- migrates and enforces current `ananke` config/state paths
|
|
|
|
## Notes
|
|
|
|
- Apply changes through Git/Flux manifests; avoid manual in-cluster edits for durable changes.
|
|
- For controlled shutdown/startup drills, treat any manual intervention as a bug and fold the logic back into Ananke.
|