titan-iac/README.md

2.8 KiB

titan-iac

Flux-managed Kubernetes cluster config for bstein.dev.

Canonical repo URL:

  • ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git

Why ananke

Ananke is inevitability and constraint. That is exactly what this tooling is for:

  • power events happen
  • recovery windows are finite
  • bootstrap has to be deterministic

The point is not clever automation. The point is boring, repeatable recovery.

Power Domains

Two UPS domains matter during shutdown/startup drills:

  • Statera: titan-23, titan-24, titan-jh
  • Pyrphoros: all other nodes

Default UPS checks in Ananke read from Pyrphoros (pyrphoros@localhost) unless overridden.

Breakglass

If primary operator access is lost, breakglass is on the remote Magic Mirror.

Ananke Commands

Ananke is the recovery orchestrator. Flux desired-state source remains titan-iac.git.

Use titan-db as the canonical control host. tethys (titan-24) is the backup operator host.

From titan-db:

~/ananke-cluster-power status
~/ananke-cluster-power prepare --execute
~/ananke-cluster-power shutdown --execute --require-ups-battery
~/ananke-cluster-power startup --execute --force-flux-branch main --require-ups-battery

From tethys / titan-24 (delegating to titan-db):

~/ananke-tools/cluster_power_console.sh --delegate-host titan-db status
~/ananke-tools/cluster_power_console.sh --delegate-host titan-db prepare --execute
~/ananke-tools/cluster_power_console.sh --delegate-host titan-db shutdown --execute --require-ups-battery
~/ananke-tools/cluster_power_console.sh --delegate-host titan-db startup --execute --force-flux-branch main --require-ups-battery

Shutdown Modes

cluster_power_recovery.sh supports two shutdown behaviors:

  • --shutdown-mode host-poweroff (default): graceful cluster shutdown plus scheduled host poweroff.
  • --shutdown-mode cluster-only: graceful cluster shutdown without host poweroff (stops k3s / k3s-agent only).

Startup Completion Rules

Ananke startup is not “done” just because Flux says green once.

Startup now completes only after:

  • Flux source drift checks pass (expected URL and branch)
  • all non-optional Flux kustomizations report Ready=True
  • external service checklist passes (default includes Gitea, Grafana, Harbor)
  • generated ingress reachability checks pass (default accepted statuses: 200,301,302,307,308,401,403,404)
  • a stability soak window passes with no CrashLoopBackOff / image-pull failures and checklist still healthy

If you intentionally need to correct Flux source during recovery, use:

  • --force-flux-url ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git
  • --force-flux-branch main

--force-flux-url is breakglass-only and requires --allow-flux-source-mutation.

The defaults live in:

  • scripts/bootstrap/recovery-config.env

Detailed runbook:

  • knowledge/runbooks/cluster-power-recovery.md