ananke: harden recovery checks and finalize naming migration

This commit is contained in:
Brad Stein 2026-04-07 12:30:28 -03:00
parent c1dc50cace
commit cc316c472b
8 changed files with 1002 additions and 83 deletions

View File

@ -1,3 +1,80 @@
# titan-iac # titan-iac
Flux-managed Kubernetes cluster for bstein.dev services. Flux-managed Kubernetes cluster config for bstein.dev.
Canonical repo URL:
- `ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git`
## Why `ananke`
`Ananke` is inevitability and constraint. That is exactly what this tooling is for:
- power events happen
- recovery windows are finite
- bootstrap has to be deterministic
The point is not clever automation. The point is boring, repeatable recovery.
## Power Domains
Two UPS domains matter during shutdown/startup drills:
- `Statera`: `titan-23`, `titan-24`, `titan-jh`
- `Pyrphoros`: all other nodes
Default UPS checks in Ananke read from `Pyrphoros` (`pyrphoros@localhost`) unless overridden.
## Breakglass
If primary operator access is lost, breakglass is on the remote Magic Mirror.
## Ananke Commands
Ananke is the recovery orchestrator. Flux desired-state source remains `titan-iac.git`.
Use `titan-db` as the canonical control host. `tethys` (`titan-24`) is the backup operator host.
From `titan-db`:
```bash
~/ananke-cluster-power status
~/ananke-cluster-power prepare --execute
~/ananke-cluster-power shutdown --execute --require-ups-battery
~/ananke-cluster-power startup --execute --force-flux-branch main --require-ups-battery
```
From `tethys` / `titan-24` (delegating to `titan-db`):
```bash
~/ananke-tools/cluster_power_console.sh --delegate-host titan-db status
~/ananke-tools/cluster_power_console.sh --delegate-host titan-db prepare --execute
~/ananke-tools/cluster_power_console.sh --delegate-host titan-db shutdown --execute --require-ups-battery
~/ananke-tools/cluster_power_console.sh --delegate-host titan-db startup --execute --force-flux-branch main --require-ups-battery
```
## Shutdown Modes
`cluster_power_recovery.sh` supports two shutdown behaviors:
- `--shutdown-mode host-poweroff` (default): graceful cluster shutdown plus scheduled host poweroff.
- `--shutdown-mode cluster-only`: graceful cluster shutdown without host poweroff (stops `k3s` / `k3s-agent` only).
## Startup Completion Rules
Ananke startup is not “done” just because Flux says green once.
Startup now completes only after:
- Flux source drift checks pass (expected URL and branch)
- all non-optional Flux kustomizations report `Ready=True`
- external service checklist passes (default includes Gitea, Grafana, Harbor)
- generated ingress reachability checks pass (default accepted statuses: `200,301,302,307,308,401,403,404`)
- a stability soak window passes with no `CrashLoopBackOff` / image-pull failures and checklist still healthy
If you intentionally need to correct Flux source during recovery, use:
- `--force-flux-url ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git`
- `--force-flux-branch main`
`--force-flux-url` is breakglass-only and requires `--allow-flux-source-mutation`.
The defaults live in:
- `scripts/bootstrap/recovery-config.env`
Detailed runbook:
- `knowledge/runbooks/cluster-power-recovery.md`

View File

@ -9,7 +9,7 @@ metadata:
spec: spec:
interval: 1m0s interval: 1m0s
ref: ref:
branch: feature/atlasbot branch: main
secretRef: secretRef:
name: flux-system-gitea name: flux-system-gitea
url: ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git url: ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git

View File

@ -45,33 +45,37 @@ Execute examples
Manual remote console examples Manual remote console examples
- Canonical operator hosts: - Canonical operator hosts:
- `titan-db` - `titan-db`
- `titan-24` - `tethys` (`titan-24`)
- Both hosts now have: - Both hosts now have:
- `~/hecate-tools/cluster_power_recovery.sh` - `~/ananke-tools/cluster_power_recovery.sh`
- `~/hecate-tools/cluster_power_console.sh` - `~/ananke-tools/cluster_power_console.sh`
- `~/hecate-tools/bootstrap/recovery-config.env` - `~/ananke-tools/bootstrap/recovery-config.env`
- `~/hecate-tools/bootstrap/harbor-bootstrap-images.txt` - `~/ananke-tools/bootstrap/harbor-bootstrap-images.txt`
- `~/hecate-tools/kubeconfig` - `~/ananke-tools/kubeconfig`
- `~/hecate-cluster-power` - `~/ananke-cluster-power`
- `~/bin/hecate-cluster-power` - `~/bin/ananke-cluster-power`
- `~/hecate-repo/{infrastructure,services,scripts}` - `~/ananke-repo/{infrastructure,services,scripts}`
- Both hosts also keep the Harbor bootstrap bundle at: - Both hosts also keep the Harbor bootstrap bundle at:
- `~/.local/share/hecate/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst` - `~/.local/share/ananke/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`
- Remote usage: - Remote usage:
- `ssh titan-db` - `ssh titan-db`
- `~/hecate-cluster-power status` - `~/ananke-cluster-power status`
- `~/hecate-cluster-power prepare --execute` - `~/ananke-cluster-power prepare --execute`
- `~/hecate-cluster-power shutdown --execute` - `~/ananke-cluster-power shutdown --execute`
- `~/hecate-cluster-power startup --execute --force-flux-branch main` - `~/ananke-cluster-power startup --execute --force-flux-branch main`
- `ssh titan-24` - `ssh tethys`
- `~/hecate-cluster-power status` - `~/ananke-cluster-power status`
- `~/hecate-cluster-power prepare --execute` - `~/ananke-cluster-power prepare --execute`
- `~/hecate-cluster-power shutdown --execute` - `~/ananke-cluster-power shutdown --execute`
- `~/hecate-cluster-power startup --execute --force-flux-branch main` - `~/ananke-cluster-power startup --execute --force-flux-branch main`
Useful options Useful options
- `--shutdown-mode host-poweroff|cluster-only`
- `--expected-flux-branch main` - `--expected-flux-branch main`
- `--expected-flux-url ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git`
- `--force-flux-url ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git`
- `--force-flux-branch main` - `--force-flux-branch main`
- `--allow-flux-source-mutation` (required with `--force-flux-url`; breakglass only)
- `--skip-local-bootstrap` (not recommended for cold-start recovery) - `--skip-local-bootstrap` (not recommended for cold-start recovery)
- `--skip-harbor-bootstrap` (skip the Harbor recovery stage if you know Harbor should stay deferred) - `--skip-harbor-bootstrap` (skip the Harbor recovery stage if you know Harbor should stay deferred)
- `--skip-harbor-seed` (skip bundle import if Harbor images are already cached on the target node) - `--skip-harbor-seed` (skip bundle import if Harbor images are already cached on the target node)
@ -81,8 +85,12 @@ Useful options
- `--require-ups-battery` - `--require-ups-battery`
- `--drain-timeout 180` - `--drain-timeout 180`
- `--emergency-drain-timeout 45` - `--emergency-drain-timeout 45`
- `--recovery-state-file ~/.local/share/hecate/cluster_power_recovery.state` - `--flux-ready-timeout 1200`
- `--harbor-bundle-file ~/.local/share/hecate/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst` - `--startup-checklist-timeout 900`
- `--startup-stability-window 180`
- `--startup-stability-timeout 900`
- `--recovery-state-file ~/.local/share/ananke/cluster_power_recovery.state`
- `--harbor-bundle-file ~/.local/share/ananke/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`
Controlled drill checklist (recommended) Controlled drill checklist (recommended)
- Operator host: use `titan-db` as canonical control host for the drill. - Operator host: use `titan-db` as canonical control host for the drill.
@ -91,37 +99,48 @@ Controlled drill checklist (recommended)
- Confirm they will manually power cluster nodes back on after shutdown completes. - Confirm they will manually power cluster nodes back on after shutdown completes.
- Confirm who will announce "all nodes powered on" to resume startup. - Confirm who will announce "all nodes powered on" to resume startup.
- Preflight on `titan-db`: - Preflight on `titan-db`:
- `mkdir -p ~/hecate-logs` - `mkdir -p ~/ananke-logs`
- `~/hecate-cluster-power status` and verify: - `~/ananke-cluster-power status` and verify:
- `ups_host=pyrphoros@localhost` - `ups_host=pyrphoros@localhost`
- `ups_battery` is numeric - `ups_battery` is numeric
- `flux_source_ready=True` - `flux_source_ready=True`
- Warm helper image just before shutdown: - Warm helper image just before shutdown:
- `~/hecate-cluster-power prepare --execute` - `~/ananke-cluster-power prepare --execute`
- Run in a persistent shell and capture logs: - Run in a persistent shell and capture logs:
- `tmux new -s hecate-drill` - `tmux new -s ananke-drill`
- `script -q -a ~/hecate-logs/hecate-drill-$(date +%Y%m%d-%H%M%S).log` - `script -q -a ~/ananke-logs/ananke-drill-$(date +%Y%m%d-%H%M%S).log`
- Execute controlled shutdown with telemetry enforcement: - Execute controlled shutdown with telemetry enforcement:
- `~/hecate-cluster-power shutdown --execute --require-ups-battery` - `~/ananke-cluster-power shutdown --execute --require-ups-battery`
- After on-site power-on confirmation, execute startup: - After on-site power-on confirmation, execute startup:
- `~/hecate-cluster-power startup --execute --force-flux-branch main --require-ups-battery` - `~/ananke-cluster-power startup --execute --force-flux-branch main --require-ups-battery`
- Post-check: - Post-check:
- `~/hecate-cluster-power status` - `~/ananke-cluster-power status`
- Verify critical services (`longhorn`, `vault`, `postgres`, `gitea`, `harbor`, `pegasus`) and no widespread pull/crash failures. - Verify critical services (`longhorn`, `vault`, `postgres`, `gitea`, `harbor`, `pegasus`) and no widespread pull/crash failures.
Operational notes Operational notes
- The flow suspends Flux Kustomizations/HelmReleases during shutdown to prevent churn. - The flow suspends Flux Kustomizations/HelmReleases during shutdown to prevent churn.
- Shutdown behavior is explicit:
- `host-poweroff` schedules host poweroff after service stop.
- `cluster-only` stops `k3s`/`k3s-agent` without powering hosts off.
- Worker drain is no longer best-effort only. The script now escalates from normal drain, to `--force`, to `--disable-eviction` once the configured timeout is exhausted. - Worker drain is no longer best-effort only. The script now escalates from normal drain, to `--force`, to `--disable-eviction` once the configured timeout is exhausted.
- During startup, if Flux source is not `Ready`, local bootstrap fallback is applied first using the repo snapshot under `~/hecate-repo`. - Startup fails fast if Flux source URL/branch drift from expected values (unless branch override is explicitly requested with `--force-flux-branch`).
- Flux desired-state source remains `titan-iac.git`. Ananke orchestrates runtime recovery and should not be used as the normal Flux source repo.
- During startup, if Flux source is not `Ready`, local bootstrap fallback is applied first using the repo snapshot under `~/ananke-repo`.
- Longhorn is reconciled before Vault/Postgres/Gitea so storage-backed services are not racing the volume layer. - Longhorn is reconciled before Vault/Postgres/Gitea so storage-backed services are not racing the volume layer.
- Harbor is reconciled after the first critical stateful services. - Harbor is reconciled after the first critical stateful services.
- Harbor bootstrap is now designed around a control-host bundle: - Harbor bootstrap is now designed around a control-host bundle:
- Build the Harbor bundle locally with `scripts/build_harbor_bootstrap_bundle.sh`. - Build the Harbor bundle locally with `scripts/build_harbor_bootstrap_bundle.sh`.
- Stage it on the operator host at `~/.local/share/hecate/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`. - Stage it on the operator host at `~/.local/share/ananke/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`.
- Use `harbor-seed --execute` or a full `startup --execute` to stream/import that bundle onto `titan-05`. - Use `harbor-seed --execute` or a full `startup --execute` to stream/import that bundle onto `titan-05`.
- The Harbor bundle remains arm64-only because Harbor is pinned to arm64 nodes. The node-helper image is multi-arch because Hecate uses it across both arm64 and amd64 nodes during prepare/shutdown operations. - The Harbor bundle remains arm64-only because Harbor is pinned to arm64 nodes. The node-helper image is multi-arch because Ananke uses it across both arm64 and amd64 nodes during prepare/shutdown operations.
- Hecate uses a temporary privileged helper pod for host-side operations. The helper image is prewarmed with `prepare --execute` so later shutdown/startup steps do not stall on image pulls. - Ananke uses a temporary privileged helper pod for host-side operations. The helper image is prewarmed with `prepare --execute` so later shutdown/startup steps do not stall on image pulls.
- The script persists outage state in `~/.local/state/cluster_power_recovery.state` by default. If startup is attempted during an outage window and power becomes unstable again, rerunning startup with insufficient UPS charge will flip into the emergency shutdown path instead of continuing to bootstrap. - The script persists outage state in `~/.local/share/ananke/cluster_power_recovery.state` by default. If startup is attempted during an outage window and power becomes unstable again, rerunning startup with insufficient UPS charge will flip into the emergency shutdown path instead of continuing to bootstrap.
- Startup completion is strict now:
- all non-optional Flux kustomizations must be `Ready=True`
- external service checklist must pass (defaults include Gitea, Grafana, Harbor)
- generated ingress reachability checks must pass (default accepted codes: `200,301,302,307,308,401,403,404`)
- stability soak must pass with no crashloop/pull-failure churn
- If Flux hits immutable one-off Job drift during reconcile, Ananke now attempts self-heal by pruning failed Flux-managed Jobs and retrying reconcile.
- In dry-run mode, the script now skips the live API wait step so preview runs do not stall on an offline cluster. - In dry-run mode, the script now skips the live API wait step so preview runs do not stall on an offline cluster.
- Dry-run mode no longer mutates outage recovery state. - Dry-run mode no longer mutates outage recovery state.
- `harbor-seed --execute` was validated by: - `harbor-seed --execute` was validated by:

View File

@ -1,14 +1,36 @@
CANONICAL_CONTROL_HOST="titan-db" CANONICAL_CONTROL_HOST="titan-db"
DEFAULT_FLUX_BRANCH="main" DEFAULT_FLUX_BRANCH="main"
STATE_SUBDIR=".local/share/hecate" EXPECTED_FLUX_URL="ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git"
SHUTDOWN_MODE="host-poweroff"
STATE_SUBDIR=".local/share/ananke"
HARBOR_BUNDLE_BASENAME="harbor-bootstrap-v2.14.1-arm64.tar.zst" HARBOR_BUNDLE_BASENAME="harbor-bootstrap-v2.14.1-arm64.tar.zst"
HARBOR_TARGET_NODE="titan-05" HARBOR_TARGET_NODE=""
HARBOR_CANARY_NODE="titan-04" HARBOR_CANARY_NODE=""
HARBOR_HOST_LABEL_KEY="ananke.bstein.dev/harbor-bootstrap"
HARBOR_CANARY_IMAGE="registry.bstein.dev/bstein/kubectl:1.35.0" HARBOR_CANARY_IMAGE="registry.bstein.dev/bstein/kubectl:1.35.0"
NODE_HELPER_IMAGE="registry.bstein.dev/bstein/hecate-node-helper:0.1.0" NODE_HELPER_IMAGE="registry.bstein.dev/bstein/ananke-node-helper:0.1.0"
NODE_HELPER_NAMESPACE="maintenance" NODE_HELPER_NAMESPACE="maintenance"
NODE_HELPER_SERVICE_ACCOUNT="default" NODE_HELPER_SERVICE_ACCOUNT="default"
REGISTRY_PULL_SECRET="harbor-regcred" REGISTRY_PULL_SECRET="harbor-regcred"
BUNDLE_HTTP_PORT="8877" BUNDLE_HTTP_PORT="8877"
UPS_HOST="pyrphoros@localhost" UPS_HOST="pyrphoros@localhost"
UPS_BATTERY_KEY="battery.charge" UPS_BATTERY_KEY="battery.charge"
FLUX_READY_TIMEOUT_SECONDS="1200"
FLUX_READY_POLL_SECONDS="10"
STARTUP_CHECKLIST_TIMEOUT_SECONDS="900"
STARTUP_CHECKLIST_POLL_SECONDS="10"
STARTUP_WORKLOAD_TIMEOUT_SECONDS="900"
STARTUP_WORKLOAD_POLL_SECONDS="10"
STARTUP_STABILITY_WINDOW_SECONDS="180"
STARTUP_STABILITY_TIMEOUT_SECONDS="900"
STARTUP_STABILITY_POLL_SECONDS="10"
STARTUP_OPTIONAL_KUSTOMIZATIONS=""
STARTUP_IGNORE_PODS_REGEX=""
STARTUP_IGNORE_WORKLOADS_REGEX=""
STARTUP_WORKLOAD_NAMESPACE_EXCLUDES_REGEX="^(kube-system|kube-public|kube-node-lease|flux-system)$"
STARTUP_SERVICE_CHECK_TIMEOUT_SECONDS="10"
STARTUP_INCLUDE_INGRESS_CHECKS="1"
STARTUP_INGRESS_ALLOWED_STATUSES="200,301,302,307,308,401,403,404"
STARTUP_IGNORE_INGRESS_HOSTS_REGEX=""
STARTUP_INGRESS_CHECK_TIMEOUT_SECONDS="10"
STARTUP_SERVICE_CHECKLIST='gitea|https://scm.bstein.dev/api/healthz|200|"status":"pass"||;grafana|https://metrics.bstein.dev/api/health|200|"database":"ok"||;harbor|https://registry.bstein.dev/v2/|200,401|||'

View File

@ -1,10 +1,10 @@
#!/usr/bin/env bash #!/usr/bin/env bash
set -euo pipefail set -euo pipefail
IMAGE="registry.bstein.dev/bstein/hecate-node-helper:0.1.0" IMAGE="registry.bstein.dev/bstein/ananke-node-helper:0.1.0"
DOCKER_CONFIG_PATH="" DOCKER_CONFIG_PATH=""
PLATFORMS="linux/amd64,linux/arm64" PLATFORMS="linux/amd64,linux/arm64"
BUILDER_NAME="hecate-node-helper-builder" BUILDER_NAME="ananke-node-helper-builder"
while [[ $# -gt 0 ]]; do while [[ $# -gt 0 ]]; do
case "$1" in case "$1" in
@ -26,7 +26,7 @@ while [[ $# -gt 0 ]]; do
;; ;;
-h|--help) -h|--help)
cat <<USAGE cat <<USAGE
Usage: scripts/build_hecate_node_helper.sh [--image <image>] [--docker-config <path>] [--platforms <csv>] [--builder <name>] Usage: scripts/build_ananke_node_helper.sh [--image <image>] [--docker-config <path>] [--platforms <csv>] [--builder <name>]
USAGE USAGE
exit 0 exit 0
;; ;;
@ -50,7 +50,7 @@ fi
docker buildx inspect --bootstrap >/dev/null docker buildx inspect --bootstrap >/dev/null
docker buildx build \ docker buildx build \
--platform "${PLATFORMS}" \ --platform "${PLATFORMS}" \
-f dockerfiles/Dockerfile.hecate-node-helper \ -f dockerfiles/Dockerfile.ananke-node-helper \
-t "${IMAGE}" \ -t "${IMAGE}" \
--push \ --push \
. .

View File

@ -7,11 +7,11 @@ Usage:
scripts/cluster_power_console.sh [--repo-dir <path>] [--delegate-host <host>] [--allow-local] <prepare|status|shutdown|startup> [recovery-script-options...] scripts/cluster_power_console.sh [--repo-dir <path>] [--delegate-host <host>] [--allow-local] <prepare|status|shutdown|startup> [recovery-script-options...]
Purpose: Purpose:
Friendly manual entrypoint for running Hecate from a remote console. Friendly manual entrypoint for running Ananke from a remote console.
The canonical control host is titan-db by default so bundle/state handling stays in one place. The canonical control host is titan-db by default so bundle/state handling stays in one place.
Defaults: Defaults:
--repo-dir \$HOME/Development/titan-iac --repo-dir \$HOME/Development/ananke (fallback: \$HOME/Development/titan-iac)
--delegate-host titan-db --delegate-host titan-db
Examples: Examples:
@ -22,10 +22,14 @@ Examples:
USAGE USAGE
} }
REPO_DIR="${HOME}/Development/titan-iac" if [[ -d "${HOME}/Development/ananke" ]]; then
REPO_DIR="${HOME}/Development/ananke"
else
REPO_DIR="${HOME}/Development/titan-iac"
fi
DELEGATE_HOST="titan-db" DELEGATE_HOST="titan-db"
ALLOW_LOCAL=0 ALLOW_LOCAL=0
REMOTE_REPO_DIR="${HECATE_REMOTE_REPO_DIR:-}" REMOTE_REPO_DIR="${ANANKE_REMOTE_REPO_DIR:-}"
while [[ $# -gt 0 ]]; do while [[ $# -gt 0 ]]; do
case "$1" in case "$1" in
@ -73,6 +77,6 @@ fi
quoted_args="$(printf '%q ' "$@")" quoted_args="$(printf '%q ' "$@")"
remote_prefix="" remote_prefix=""
if [[ -n "${REMOTE_REPO_DIR}" ]]; then if [[ -n "${REMOTE_REPO_DIR}" ]]; then
remote_prefix="HECATE_REPO_DIR=$(printf '%q' "${REMOTE_REPO_DIR}") " remote_prefix="ANANKE_REPO_DIR=$(printf '%q' "${REMOTE_REPO_DIR}") "
fi fi
exec ssh -o BatchMode=yes -o ConnectTimeout=8 "${DELEGATE_HOST}" "${remote_prefix}~/hecate-tools/cluster_power_recovery.sh ${quoted_args}" exec ssh -o BatchMode=yes -o ConnectTimeout=8 "${DELEGATE_HOST}" "${remote_prefix}~/ananke-tools/cluster_power_recovery.sh ${quoted_args}"

File diff suppressed because it is too large Load Diff