ananke: harden recovery checks and finalize naming migration
This commit is contained in:
parent
c1dc50cace
commit
cc316c472b
79
README.md
79
README.md
@ -1,3 +1,80 @@
|
||||
# titan-iac
|
||||
|
||||
Flux-managed Kubernetes cluster for bstein.dev services.
|
||||
Flux-managed Kubernetes cluster config for bstein.dev.
|
||||
|
||||
Canonical repo URL:
|
||||
- `ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git`
|
||||
|
||||
## Why `ananke`
|
||||
|
||||
`Ananke` is inevitability and constraint. That is exactly what this tooling is for:
|
||||
- power events happen
|
||||
- recovery windows are finite
|
||||
- bootstrap has to be deterministic
|
||||
|
||||
The point is not clever automation. The point is boring, repeatable recovery.
|
||||
|
||||
## Power Domains
|
||||
|
||||
Two UPS domains matter during shutdown/startup drills:
|
||||
- `Statera`: `titan-23`, `titan-24`, `titan-jh`
|
||||
- `Pyrphoros`: all other nodes
|
||||
|
||||
Default UPS checks in Ananke read from `Pyrphoros` (`pyrphoros@localhost`) unless overridden.
|
||||
|
||||
## Breakglass
|
||||
|
||||
If primary operator access is lost, breakglass is on the remote Magic Mirror.
|
||||
|
||||
## Ananke Commands
|
||||
|
||||
Ananke is the recovery orchestrator. Flux desired-state source remains `titan-iac.git`.
|
||||
|
||||
Use `titan-db` as the canonical control host. `tethys` (`titan-24`) is the backup operator host.
|
||||
|
||||
From `titan-db`:
|
||||
|
||||
```bash
|
||||
~/ananke-cluster-power status
|
||||
~/ananke-cluster-power prepare --execute
|
||||
~/ananke-cluster-power shutdown --execute --require-ups-battery
|
||||
~/ananke-cluster-power startup --execute --force-flux-branch main --require-ups-battery
|
||||
```
|
||||
|
||||
From `tethys` / `titan-24` (delegating to `titan-db`):
|
||||
|
||||
```bash
|
||||
~/ananke-tools/cluster_power_console.sh --delegate-host titan-db status
|
||||
~/ananke-tools/cluster_power_console.sh --delegate-host titan-db prepare --execute
|
||||
~/ananke-tools/cluster_power_console.sh --delegate-host titan-db shutdown --execute --require-ups-battery
|
||||
~/ananke-tools/cluster_power_console.sh --delegate-host titan-db startup --execute --force-flux-branch main --require-ups-battery
|
||||
```
|
||||
|
||||
## Shutdown Modes
|
||||
|
||||
`cluster_power_recovery.sh` supports two shutdown behaviors:
|
||||
- `--shutdown-mode host-poweroff` (default): graceful cluster shutdown plus scheduled host poweroff.
|
||||
- `--shutdown-mode cluster-only`: graceful cluster shutdown without host poweroff (stops `k3s` / `k3s-agent` only).
|
||||
|
||||
## Startup Completion Rules
|
||||
|
||||
Ananke startup is not “done” just because Flux says green once.
|
||||
|
||||
Startup now completes only after:
|
||||
- Flux source drift checks pass (expected URL and branch)
|
||||
- all non-optional Flux kustomizations report `Ready=True`
|
||||
- external service checklist passes (default includes Gitea, Grafana, Harbor)
|
||||
- generated ingress reachability checks pass (default accepted statuses: `200,301,302,307,308,401,403,404`)
|
||||
- a stability soak window passes with no `CrashLoopBackOff` / image-pull failures and checklist still healthy
|
||||
|
||||
If you intentionally need to correct Flux source during recovery, use:
|
||||
- `--force-flux-url ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git`
|
||||
- `--force-flux-branch main`
|
||||
|
||||
`--force-flux-url` is breakglass-only and requires `--allow-flux-source-mutation`.
|
||||
|
||||
The defaults live in:
|
||||
- `scripts/bootstrap/recovery-config.env`
|
||||
|
||||
Detailed runbook:
|
||||
- `knowledge/runbooks/cluster-power-recovery.md`
|
||||
|
||||
@ -9,7 +9,7 @@ metadata:
|
||||
spec:
|
||||
interval: 1m0s
|
||||
ref:
|
||||
branch: feature/atlasbot
|
||||
branch: main
|
||||
secretRef:
|
||||
name: flux-system-gitea
|
||||
url: ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git
|
||||
|
||||
@ -45,33 +45,37 @@ Execute examples
|
||||
Manual remote console examples
|
||||
- Canonical operator hosts:
|
||||
- `titan-db`
|
||||
- `titan-24`
|
||||
- `tethys` (`titan-24`)
|
||||
- Both hosts now have:
|
||||
- `~/hecate-tools/cluster_power_recovery.sh`
|
||||
- `~/hecate-tools/cluster_power_console.sh`
|
||||
- `~/hecate-tools/bootstrap/recovery-config.env`
|
||||
- `~/hecate-tools/bootstrap/harbor-bootstrap-images.txt`
|
||||
- `~/hecate-tools/kubeconfig`
|
||||
- `~/hecate-cluster-power`
|
||||
- `~/bin/hecate-cluster-power`
|
||||
- `~/hecate-repo/{infrastructure,services,scripts}`
|
||||
- `~/ananke-tools/cluster_power_recovery.sh`
|
||||
- `~/ananke-tools/cluster_power_console.sh`
|
||||
- `~/ananke-tools/bootstrap/recovery-config.env`
|
||||
- `~/ananke-tools/bootstrap/harbor-bootstrap-images.txt`
|
||||
- `~/ananke-tools/kubeconfig`
|
||||
- `~/ananke-cluster-power`
|
||||
- `~/bin/ananke-cluster-power`
|
||||
- `~/ananke-repo/{infrastructure,services,scripts}`
|
||||
- Both hosts also keep the Harbor bootstrap bundle at:
|
||||
- `~/.local/share/hecate/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`
|
||||
- `~/.local/share/ananke/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`
|
||||
- Remote usage:
|
||||
- `ssh titan-db`
|
||||
- `~/hecate-cluster-power status`
|
||||
- `~/hecate-cluster-power prepare --execute`
|
||||
- `~/hecate-cluster-power shutdown --execute`
|
||||
- `~/hecate-cluster-power startup --execute --force-flux-branch main`
|
||||
- `ssh titan-24`
|
||||
- `~/hecate-cluster-power status`
|
||||
- `~/hecate-cluster-power prepare --execute`
|
||||
- `~/hecate-cluster-power shutdown --execute`
|
||||
- `~/hecate-cluster-power startup --execute --force-flux-branch main`
|
||||
- `~/ananke-cluster-power status`
|
||||
- `~/ananke-cluster-power prepare --execute`
|
||||
- `~/ananke-cluster-power shutdown --execute`
|
||||
- `~/ananke-cluster-power startup --execute --force-flux-branch main`
|
||||
- `ssh tethys`
|
||||
- `~/ananke-cluster-power status`
|
||||
- `~/ananke-cluster-power prepare --execute`
|
||||
- `~/ananke-cluster-power shutdown --execute`
|
||||
- `~/ananke-cluster-power startup --execute --force-flux-branch main`
|
||||
|
||||
Useful options
|
||||
- `--shutdown-mode host-poweroff|cluster-only`
|
||||
- `--expected-flux-branch main`
|
||||
- `--expected-flux-url ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git`
|
||||
- `--force-flux-url ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git`
|
||||
- `--force-flux-branch main`
|
||||
- `--allow-flux-source-mutation` (required with `--force-flux-url`; breakglass only)
|
||||
- `--skip-local-bootstrap` (not recommended for cold-start recovery)
|
||||
- `--skip-harbor-bootstrap` (skip the Harbor recovery stage if you know Harbor should stay deferred)
|
||||
- `--skip-harbor-seed` (skip bundle import if Harbor images are already cached on the target node)
|
||||
@ -81,8 +85,12 @@ Useful options
|
||||
- `--require-ups-battery`
|
||||
- `--drain-timeout 180`
|
||||
- `--emergency-drain-timeout 45`
|
||||
- `--recovery-state-file ~/.local/share/hecate/cluster_power_recovery.state`
|
||||
- `--harbor-bundle-file ~/.local/share/hecate/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`
|
||||
- `--flux-ready-timeout 1200`
|
||||
- `--startup-checklist-timeout 900`
|
||||
- `--startup-stability-window 180`
|
||||
- `--startup-stability-timeout 900`
|
||||
- `--recovery-state-file ~/.local/share/ananke/cluster_power_recovery.state`
|
||||
- `--harbor-bundle-file ~/.local/share/ananke/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`
|
||||
|
||||
Controlled drill checklist (recommended)
|
||||
- Operator host: use `titan-db` as canonical control host for the drill.
|
||||
@ -91,37 +99,48 @@ Controlled drill checklist (recommended)
|
||||
- Confirm they will manually power cluster nodes back on after shutdown completes.
|
||||
- Confirm who will announce "all nodes powered on" to resume startup.
|
||||
- Preflight on `titan-db`:
|
||||
- `mkdir -p ~/hecate-logs`
|
||||
- `~/hecate-cluster-power status` and verify:
|
||||
- `mkdir -p ~/ananke-logs`
|
||||
- `~/ananke-cluster-power status` and verify:
|
||||
- `ups_host=pyrphoros@localhost`
|
||||
- `ups_battery` is numeric
|
||||
- `flux_source_ready=True`
|
||||
- Warm helper image just before shutdown:
|
||||
- `~/hecate-cluster-power prepare --execute`
|
||||
- `~/ananke-cluster-power prepare --execute`
|
||||
- Run in a persistent shell and capture logs:
|
||||
- `tmux new -s hecate-drill`
|
||||
- `script -q -a ~/hecate-logs/hecate-drill-$(date +%Y%m%d-%H%M%S).log`
|
||||
- `tmux new -s ananke-drill`
|
||||
- `script -q -a ~/ananke-logs/ananke-drill-$(date +%Y%m%d-%H%M%S).log`
|
||||
- Execute controlled shutdown with telemetry enforcement:
|
||||
- `~/hecate-cluster-power shutdown --execute --require-ups-battery`
|
||||
- `~/ananke-cluster-power shutdown --execute --require-ups-battery`
|
||||
- After on-site power-on confirmation, execute startup:
|
||||
- `~/hecate-cluster-power startup --execute --force-flux-branch main --require-ups-battery`
|
||||
- `~/ananke-cluster-power startup --execute --force-flux-branch main --require-ups-battery`
|
||||
- Post-check:
|
||||
- `~/hecate-cluster-power status`
|
||||
- `~/ananke-cluster-power status`
|
||||
- Verify critical services (`longhorn`, `vault`, `postgres`, `gitea`, `harbor`, `pegasus`) and no widespread pull/crash failures.
|
||||
|
||||
Operational notes
|
||||
- The flow suspends Flux Kustomizations/HelmReleases during shutdown to prevent churn.
|
||||
- Shutdown behavior is explicit:
|
||||
- `host-poweroff` schedules host poweroff after service stop.
|
||||
- `cluster-only` stops `k3s`/`k3s-agent` without powering hosts off.
|
||||
- Worker drain is no longer best-effort only. The script now escalates from normal drain, to `--force`, to `--disable-eviction` once the configured timeout is exhausted.
|
||||
- During startup, if Flux source is not `Ready`, local bootstrap fallback is applied first using the repo snapshot under `~/hecate-repo`.
|
||||
- Startup fails fast if Flux source URL/branch drift from expected values (unless branch override is explicitly requested with `--force-flux-branch`).
|
||||
- Flux desired-state source remains `titan-iac.git`. Ananke orchestrates runtime recovery and should not be used as the normal Flux source repo.
|
||||
- During startup, if Flux source is not `Ready`, local bootstrap fallback is applied first using the repo snapshot under `~/ananke-repo`.
|
||||
- Longhorn is reconciled before Vault/Postgres/Gitea so storage-backed services are not racing the volume layer.
|
||||
- Harbor is reconciled after the first critical stateful services.
|
||||
- Harbor bootstrap is now designed around a control-host bundle:
|
||||
- Build the Harbor bundle locally with `scripts/build_harbor_bootstrap_bundle.sh`.
|
||||
- Stage it on the operator host at `~/.local/share/hecate/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`.
|
||||
- Stage it on the operator host at `~/.local/share/ananke/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`.
|
||||
- Use `harbor-seed --execute` or a full `startup --execute` to stream/import that bundle onto `titan-05`.
|
||||
- The Harbor bundle remains arm64-only because Harbor is pinned to arm64 nodes. The node-helper image is multi-arch because Hecate uses it across both arm64 and amd64 nodes during prepare/shutdown operations.
|
||||
- Hecate uses a temporary privileged helper pod for host-side operations. The helper image is prewarmed with `prepare --execute` so later shutdown/startup steps do not stall on image pulls.
|
||||
- The script persists outage state in `~/.local/state/cluster_power_recovery.state` by default. If startup is attempted during an outage window and power becomes unstable again, rerunning startup with insufficient UPS charge will flip into the emergency shutdown path instead of continuing to bootstrap.
|
||||
- The Harbor bundle remains arm64-only because Harbor is pinned to arm64 nodes. The node-helper image is multi-arch because Ananke uses it across both arm64 and amd64 nodes during prepare/shutdown operations.
|
||||
- Ananke uses a temporary privileged helper pod for host-side operations. The helper image is prewarmed with `prepare --execute` so later shutdown/startup steps do not stall on image pulls.
|
||||
- The script persists outage state in `~/.local/share/ananke/cluster_power_recovery.state` by default. If startup is attempted during an outage window and power becomes unstable again, rerunning startup with insufficient UPS charge will flip into the emergency shutdown path instead of continuing to bootstrap.
|
||||
- Startup completion is strict now:
|
||||
- all non-optional Flux kustomizations must be `Ready=True`
|
||||
- external service checklist must pass (defaults include Gitea, Grafana, Harbor)
|
||||
- generated ingress reachability checks must pass (default accepted codes: `200,301,302,307,308,401,403,404`)
|
||||
- stability soak must pass with no crashloop/pull-failure churn
|
||||
- If Flux hits immutable one-off Job drift during reconcile, Ananke now attempts self-heal by pruning failed Flux-managed Jobs and retrying reconcile.
|
||||
- In dry-run mode, the script now skips the live API wait step so preview runs do not stall on an offline cluster.
|
||||
- Dry-run mode no longer mutates outage recovery state.
|
||||
- `harbor-seed --execute` was validated by:
|
||||
|
||||
@ -1,14 +1,36 @@
|
||||
CANONICAL_CONTROL_HOST="titan-db"
|
||||
DEFAULT_FLUX_BRANCH="main"
|
||||
STATE_SUBDIR=".local/share/hecate"
|
||||
EXPECTED_FLUX_URL="ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git"
|
||||
SHUTDOWN_MODE="host-poweroff"
|
||||
STATE_SUBDIR=".local/share/ananke"
|
||||
HARBOR_BUNDLE_BASENAME="harbor-bootstrap-v2.14.1-arm64.tar.zst"
|
||||
HARBOR_TARGET_NODE="titan-05"
|
||||
HARBOR_CANARY_NODE="titan-04"
|
||||
HARBOR_TARGET_NODE=""
|
||||
HARBOR_CANARY_NODE=""
|
||||
HARBOR_HOST_LABEL_KEY="ananke.bstein.dev/harbor-bootstrap"
|
||||
HARBOR_CANARY_IMAGE="registry.bstein.dev/bstein/kubectl:1.35.0"
|
||||
NODE_HELPER_IMAGE="registry.bstein.dev/bstein/hecate-node-helper:0.1.0"
|
||||
NODE_HELPER_IMAGE="registry.bstein.dev/bstein/ananke-node-helper:0.1.0"
|
||||
NODE_HELPER_NAMESPACE="maintenance"
|
||||
NODE_HELPER_SERVICE_ACCOUNT="default"
|
||||
REGISTRY_PULL_SECRET="harbor-regcred"
|
||||
BUNDLE_HTTP_PORT="8877"
|
||||
UPS_HOST="pyrphoros@localhost"
|
||||
UPS_BATTERY_KEY="battery.charge"
|
||||
FLUX_READY_TIMEOUT_SECONDS="1200"
|
||||
FLUX_READY_POLL_SECONDS="10"
|
||||
STARTUP_CHECKLIST_TIMEOUT_SECONDS="900"
|
||||
STARTUP_CHECKLIST_POLL_SECONDS="10"
|
||||
STARTUP_WORKLOAD_TIMEOUT_SECONDS="900"
|
||||
STARTUP_WORKLOAD_POLL_SECONDS="10"
|
||||
STARTUP_STABILITY_WINDOW_SECONDS="180"
|
||||
STARTUP_STABILITY_TIMEOUT_SECONDS="900"
|
||||
STARTUP_STABILITY_POLL_SECONDS="10"
|
||||
STARTUP_OPTIONAL_KUSTOMIZATIONS=""
|
||||
STARTUP_IGNORE_PODS_REGEX=""
|
||||
STARTUP_IGNORE_WORKLOADS_REGEX=""
|
||||
STARTUP_WORKLOAD_NAMESPACE_EXCLUDES_REGEX="^(kube-system|kube-public|kube-node-lease|flux-system)$"
|
||||
STARTUP_SERVICE_CHECK_TIMEOUT_SECONDS="10"
|
||||
STARTUP_INCLUDE_INGRESS_CHECKS="1"
|
||||
STARTUP_INGRESS_ALLOWED_STATUSES="200,301,302,307,308,401,403,404"
|
||||
STARTUP_IGNORE_INGRESS_HOSTS_REGEX=""
|
||||
STARTUP_INGRESS_CHECK_TIMEOUT_SECONDS="10"
|
||||
STARTUP_SERVICE_CHECKLIST='gitea|https://scm.bstein.dev/api/healthz|200|"status":"pass"||;grafana|https://metrics.bstein.dev/api/health|200|"database":"ok"||;harbor|https://registry.bstein.dev/v2/|200,401|||'
|
||||
|
||||
@ -1,10 +1,10 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
IMAGE="registry.bstein.dev/bstein/hecate-node-helper:0.1.0"
|
||||
IMAGE="registry.bstein.dev/bstein/ananke-node-helper:0.1.0"
|
||||
DOCKER_CONFIG_PATH=""
|
||||
PLATFORMS="linux/amd64,linux/arm64"
|
||||
BUILDER_NAME="hecate-node-helper-builder"
|
||||
BUILDER_NAME="ananke-node-helper-builder"
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
@ -26,7 +26,7 @@ while [[ $# -gt 0 ]]; do
|
||||
;;
|
||||
-h|--help)
|
||||
cat <<USAGE
|
||||
Usage: scripts/build_hecate_node_helper.sh [--image <image>] [--docker-config <path>] [--platforms <csv>] [--builder <name>]
|
||||
Usage: scripts/build_ananke_node_helper.sh [--image <image>] [--docker-config <path>] [--platforms <csv>] [--builder <name>]
|
||||
USAGE
|
||||
exit 0
|
||||
;;
|
||||
@ -50,7 +50,7 @@ fi
|
||||
docker buildx inspect --bootstrap >/dev/null
|
||||
docker buildx build \
|
||||
--platform "${PLATFORMS}" \
|
||||
-f dockerfiles/Dockerfile.hecate-node-helper \
|
||||
-f dockerfiles/Dockerfile.ananke-node-helper \
|
||||
-t "${IMAGE}" \
|
||||
--push \
|
||||
.
|
||||
@ -7,11 +7,11 @@ Usage:
|
||||
scripts/cluster_power_console.sh [--repo-dir <path>] [--delegate-host <host>] [--allow-local] <prepare|status|shutdown|startup> [recovery-script-options...]
|
||||
|
||||
Purpose:
|
||||
Friendly manual entrypoint for running Hecate from a remote console.
|
||||
Friendly manual entrypoint for running Ananke from a remote console.
|
||||
The canonical control host is titan-db by default so bundle/state handling stays in one place.
|
||||
|
||||
Defaults:
|
||||
--repo-dir \$HOME/Development/titan-iac
|
||||
--repo-dir \$HOME/Development/ananke (fallback: \$HOME/Development/titan-iac)
|
||||
--delegate-host titan-db
|
||||
|
||||
Examples:
|
||||
@ -22,10 +22,14 @@ Examples:
|
||||
USAGE
|
||||
}
|
||||
|
||||
if [[ -d "${HOME}/Development/ananke" ]]; then
|
||||
REPO_DIR="${HOME}/Development/ananke"
|
||||
else
|
||||
REPO_DIR="${HOME}/Development/titan-iac"
|
||||
fi
|
||||
DELEGATE_HOST="titan-db"
|
||||
ALLOW_LOCAL=0
|
||||
REMOTE_REPO_DIR="${HECATE_REMOTE_REPO_DIR:-}"
|
||||
REMOTE_REPO_DIR="${ANANKE_REMOTE_REPO_DIR:-}"
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
@ -73,6 +77,6 @@ fi
|
||||
quoted_args="$(printf '%q ' "$@")"
|
||||
remote_prefix=""
|
||||
if [[ -n "${REMOTE_REPO_DIR}" ]]; then
|
||||
remote_prefix="HECATE_REPO_DIR=$(printf '%q' "${REMOTE_REPO_DIR}") "
|
||||
remote_prefix="ANANKE_REPO_DIR=$(printf '%q' "${REMOTE_REPO_DIR}") "
|
||||
fi
|
||||
exec ssh -o BatchMode=yes -o ConnectTimeout=8 "${DELEGATE_HOST}" "${remote_prefix}~/hecate-tools/cluster_power_recovery.sh ${quoted_args}"
|
||||
exec ssh -o BatchMode=yes -o ConnectTimeout=8 "${DELEGATE_HOST}" "${remote_prefix}~/ananke-tools/cluster_power_recovery.sh ${quoted_args}"
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
Loading…
x
Reference in New Issue
Block a user