ananke: harden recovery checks and finalize naming migration

hecate: add controlled drill checklist to runbook
hecate: harden titan-24 cleanup and ups telemetry
2026-04-07 12:30:28 -03:00 · 2026-04-06 04:59:37 -03:00 · 2026-04-06 04:47:05 -03:00 · 2026-04-06 04:21:04 -03:00 · 2026-03-31 14:51:49 -03:00 · 2026-03-31 14:21:53 -03:00
33 changed files with 2937 additions and 103 deletions
--- a/README.md
+++ b/README.md
@ -1,3 +1,80 @@
 # titan-iac

-Flux-managed Kubernetes cluster for bstein.dev services.
+Flux-managed Kubernetes cluster config for bstein.dev.
+
+Canonical repo URL:
+- `ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git`
+
+## Why `ananke`
+
+`Ananke` is inevitability and constraint. That is exactly what this tooling is for:
+- power events happen
+- recovery windows are finite
+- bootstrap has to be deterministic
+
+The point is not clever automation. The point is boring, repeatable recovery.
+
+## Power Domains
+
+Two UPS domains matter during shutdown/startup drills:
+- `Statera`: `titan-23`, `titan-24`, `titan-jh`
+- `Pyrphoros`: all other nodes
+
+Default UPS checks in Ananke read from `Pyrphoros` (`pyrphoros@localhost`) unless overridden.
+
+## Breakglass
+
+If primary operator access is lost, breakglass is on the remote Magic Mirror.
+
+## Ananke Commands
+
+Ananke is the recovery orchestrator. Flux desired-state source remains `titan-iac.git`.
+
+Use `titan-db` as the canonical control host. `tethys` (`titan-24`) is the backup operator host.
+
+From `titan-db`:
+
+```bash
+~/ananke-cluster-power status
+~/ananke-cluster-power prepare --execute
+~/ananke-cluster-power shutdown --execute --require-ups-battery
+~/ananke-cluster-power startup --execute --force-flux-branch main --require-ups-battery
+```
+
+From `tethys` / `titan-24` (delegating to `titan-db`):
+
+```bash
+~/ananke-tools/cluster_power_console.sh --delegate-host titan-db status
+~/ananke-tools/cluster_power_console.sh --delegate-host titan-db prepare --execute
+~/ananke-tools/cluster_power_console.sh --delegate-host titan-db shutdown --execute --require-ups-battery
+~/ananke-tools/cluster_power_console.sh --delegate-host titan-db startup --execute --force-flux-branch main --require-ups-battery
+```
+
+## Shutdown Modes
+
+`cluster_power_recovery.sh` supports two shutdown behaviors:
+- `--shutdown-mode host-poweroff` (default): graceful cluster shutdown plus scheduled host poweroff.
+- `--shutdown-mode cluster-only`: graceful cluster shutdown without host poweroff (stops `k3s` / `k3s-agent` only).
+
+## Startup Completion Rules
+
+Ananke startup is not “done” just because Flux says green once.
+
+Startup now completes only after:
+- Flux source drift checks pass (expected URL and branch)
+- all non-optional Flux kustomizations report `Ready=True`
+- external service checklist passes (default includes Gitea, Grafana, Harbor)
+- generated ingress reachability checks pass (default accepted statuses: `200,301,302,307,308,401,403,404`)
+- a stability soak window passes with no `CrashLoopBackOff` / image-pull failures and checklist still healthy
+
+If you intentionally need to correct Flux source during recovery, use:
+- `--force-flux-url ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git`
+- `--force-flux-branch main`
+
+`--force-flux-url` is breakglass-only and requires `--allow-flux-source-mutation`.
+
+The defaults live in:
+- `scripts/bootstrap/recovery-config.env`
+
+Detailed runbook:
+- `knowledge/runbooks/cluster-power-recovery.md`
--- a/clusters/atlas/flux-system/gotk-sync.yaml
+++ b/clusters/atlas/flux-system/gotk-sync.yaml
@ -9,7 +9,7 @@ metadata:
 spec:
  interval: 1m0s
  ref:
-    branch: feature/atlasbot
+    branch: main
  secretRef:
    name: flux-system-gitea
  url: ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git
--- a/dockerfiles/Dockerfile.ananke-node-helper
+++ b/dockerfiles/Dockerfile.ananke-node-helper
@ -0,0 +1,12 @@
+FROM debian:bookworm-slim
+
+RUN apt-get update \
+    && apt-get install -y --no-install-recommends \
+        bash \
+        ca-certificates \
+        curl \
+        util-linux \
+        zstd \
+    && rm -rf /var/lib/apt/lists/*
+
+CMD ["/bin/sh"]
--- a/knowledge/runbooks/cluster-power-recovery.md
+++ b/knowledge/runbooks/cluster-power-recovery.md
@ -0,0 +1,152 @@
+Atlas Cluster Power Recovery (Graceful Shutdown/Startup)
+
+Purpose
+- Provide a safe operator flow for planned power events and cold-boot recovery.
+- Avoid the Flux/Gitea bootstrap deadlock by using a local bootstrap fallback path.
+- Break the Harbor self-hosting deadlock by seeding Harbor runtime images from a control-host bundle.
+- Refuse bootstrap when UPS charge is too low, and fall back to fast shutdown if a second outage hits mid-recovery.
+
+Bootstrapping risk to remember
+- Flux source is Git over SSH to `scm.bstein.dev` (Gitea).
+- Gitea itself is a Flux-managed workload and depends on storage + database.
+- Harbor is also critical, but it is not part of the first recovery stage because Harbor serves its own runtime images.
+- On cold boot, if Flux cannot fetch source before Gitea is up, reconciliation can stall.
+- Recovery path: bring control plane and workers up, then locally apply minimal platform stack (`core -> helm -> longhorn -> metallb -> traefik -> vault-csi -> vault-injector -> vault -> postgres -> gitea`), then seed Harbor images onto the Harbor node from a control-host bundle, then resume/reconcile Flux. Harbor is a later recovery stage after storage, Vault, Postgres, and Gitea are back.
+
+Script
+- `scripts/cluster_power_recovery.sh`
+- `scripts/cluster_power_console.sh`
+- Modes:
+  - `prepare`
+  - `shutdown`
+  - `harbor-seed`
+  - `startup`
+  - `status`
+- Default is dry-run. Add `--execute` to actually perform actions.
+
+Dry-run examples
+- Shutdown preview:
+  - `scripts/cluster_power_recovery.sh shutdown --skip-etcd-snapshot --skip-drain`
+- Startup preview:
+  - `scripts/cluster_power_recovery.sh startup`
+- Harbor seed preview:
+  - `scripts/cluster_power_recovery.sh harbor-seed`
+
+Execute examples
+- Prepare helper image on every node:
+  - `scripts/cluster_power_recovery.sh prepare --execute`
+- Seed Harbor runtime images onto `titan-05` from the control-host bundle:
+  - `scripts/cluster_power_recovery.sh harbor-seed --execute`
+- Planned shutdown:
+  - `scripts/cluster_power_recovery.sh shutdown --execute`
+- Planned startup (canonical branch):
+  - `scripts/cluster_power_recovery.sh startup --execute --force-flux-branch main`
+
+Manual remote console examples
+- Canonical operator hosts:
+  - `titan-db`
+  - `tethys` (`titan-24`)
+- Both hosts now have:
+  - `~/ananke-tools/cluster_power_recovery.sh`
+  - `~/ananke-tools/cluster_power_console.sh`
+  - `~/ananke-tools/bootstrap/recovery-config.env`
+  - `~/ananke-tools/bootstrap/harbor-bootstrap-images.txt`
+  - `~/ananke-tools/kubeconfig`
+  - `~/ananke-cluster-power`
+  - `~/bin/ananke-cluster-power`
+  - `~/ananke-repo/{infrastructure,services,scripts}`
+- Both hosts also keep the Harbor bootstrap bundle at:
+  - `~/.local/share/ananke/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`
+- Remote usage:
+  - `ssh titan-db`
+  - `~/ananke-cluster-power status`
+  - `~/ananke-cluster-power prepare --execute`
+  - `~/ananke-cluster-power shutdown --execute`
+  - `~/ananke-cluster-power startup --execute --force-flux-branch main`
+  - `ssh tethys`
+  - `~/ananke-cluster-power status`
+  - `~/ananke-cluster-power prepare --execute`
+  - `~/ananke-cluster-power shutdown --execute`
+  - `~/ananke-cluster-power startup --execute --force-flux-branch main`
+
+Useful options
+- `--shutdown-mode host-poweroff|cluster-only`
+- `--expected-flux-branch main`
+- `--expected-flux-url ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git`
+- `--force-flux-url ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git`
+- `--force-flux-branch main`
+- `--allow-flux-source-mutation` (required with `--force-flux-url`; breakglass only)
+- `--skip-local-bootstrap` (not recommended for cold-start recovery)
+- `--skip-harbor-bootstrap` (skip the Harbor recovery stage if you know Harbor should stay deferred)
+- `--skip-harbor-seed` (skip bundle import if Harbor images are already cached on the target node)
+- `--skip-helper-prewarm`
+- `--min-startup-battery 35`
+- `--ups-host pyrphoros@localhost`
+- `--require-ups-battery`
+- `--drain-timeout 180`
+- `--emergency-drain-timeout 45`
+- `--flux-ready-timeout 1200`
+- `--startup-checklist-timeout 900`
+- `--startup-stability-window 180`
+- `--startup-stability-timeout 900`
+- `--recovery-state-file ~/.local/share/ananke/cluster_power_recovery.state`
+- `--harbor-bundle-file ~/.local/share/ananke/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`
+
+Controlled drill checklist (recommended)
+- Operator host: use `titan-db` as canonical control host for the drill.
+- On-site coordination:
+  - Have on-site operator ready before shutdown starts.
+  - Confirm they will manually power cluster nodes back on after shutdown completes.
+  - Confirm who will announce "all nodes powered on" to resume startup.
+- Preflight on `titan-db`:
+  - `mkdir -p ~/ananke-logs`
+  - `~/ananke-cluster-power status` and verify:
+    - `ups_host=pyrphoros@localhost`
+    - `ups_battery` is numeric
+    - `flux_source_ready=True`
+- Warm helper image just before shutdown:
+  - `~/ananke-cluster-power prepare --execute`
+- Run in a persistent shell and capture logs:
+  - `tmux new -s ananke-drill`
+  - `script -q -a ~/ananke-logs/ananke-drill-$(date +%Y%m%d-%H%M%S).log`
+- Execute controlled shutdown with telemetry enforcement:
+  - `~/ananke-cluster-power shutdown --execute --require-ups-battery`
+- After on-site power-on confirmation, execute startup:
+  - `~/ananke-cluster-power startup --execute --force-flux-branch main --require-ups-battery`
+- Post-check:
+  - `~/ananke-cluster-power status`
+  - Verify critical services (`longhorn`, `vault`, `postgres`, `gitea`, `harbor`, `pegasus`) and no widespread pull/crash failures.
+
+Operational notes
+- The flow suspends Flux Kustomizations/HelmReleases during shutdown to prevent churn.
+- Shutdown behavior is explicit:
+  - `host-poweroff` schedules host poweroff after service stop.
+  - `cluster-only` stops `k3s`/`k3s-agent` without powering hosts off.
+- Worker drain is no longer best-effort only. The script now escalates from normal drain, to `--force`, to `--disable-eviction` once the configured timeout is exhausted.
+- Startup fails fast if Flux source URL/branch drift from expected values (unless branch override is explicitly requested with `--force-flux-branch`).
+- Flux desired-state source remains `titan-iac.git`. Ananke orchestrates runtime recovery and should not be used as the normal Flux source repo.
+- During startup, if Flux source is not `Ready`, local bootstrap fallback is applied first using the repo snapshot under `~/ananke-repo`.
+- Longhorn is reconciled before Vault/Postgres/Gitea so storage-backed services are not racing the volume layer.
+- Harbor is reconciled after the first critical stateful services.
+- Harbor bootstrap is now designed around a control-host bundle:
+  - Build the Harbor bundle locally with `scripts/build_harbor_bootstrap_bundle.sh`.
+  - Stage it on the operator host at `~/.local/share/ananke/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`.
+  - Use `harbor-seed --execute` or a full `startup --execute` to stream/import that bundle onto `titan-05`.
+- The Harbor bundle remains arm64-only because Harbor is pinned to arm64 nodes. The node-helper image is multi-arch because Ananke uses it across both arm64 and amd64 nodes during prepare/shutdown operations.
+- Ananke uses a temporary privileged helper pod for host-side operations. The helper image is prewarmed with `prepare --execute` so later shutdown/startup steps do not stall on image pulls.
+- The script persists outage state in `~/.local/share/ananke/cluster_power_recovery.state` by default. If startup is attempted during an outage window and power becomes unstable again, rerunning startup with insufficient UPS charge will flip into the emergency shutdown path instead of continuing to bootstrap.
+- Startup completion is strict now:
+  - all non-optional Flux kustomizations must be `Ready=True`
+  - external service checklist must pass (defaults include Gitea, Grafana, Harbor)
+  - generated ingress reachability checks must pass (default accepted codes: `200,301,302,307,308,401,403,404`)
+  - stability soak must pass with no crashloop/pull-failure churn
+- If Flux hits immutable one-off Job drift during reconcile, Ananke now attempts self-heal by pruning failed Flux-managed Jobs and retrying reconcile.
+- In dry-run mode, the script now skips the live API wait step so preview runs do not stall on an offline cluster.
+- Dry-run mode no longer mutates outage recovery state.
+- `harbor-seed --execute` was validated by:
+  - prewarming the helper image across all nodes
+  - streaming the Harbor bootstrap bundle to `titan-05`
+  - importing Harbor runtime images into host `containerd`
+  - successfully running a Harbor-backed canary pod (`harbor-canary-ok`)
+- After bootstrap, Flux resources are resumed and reconciled.
+- Keep this runbook aligned with `clusters/atlas/flux-system/gotk-sync.yaml`.
--- a/scripts/bootstrap/harbor-bootstrap-images.txt
+++ b/scripts/bootstrap/harbor-bootstrap-images.txt
@ -0,0 +1,9 @@
+# Harbor cold-start bootstrap images.
+registry.bstein.dev/infra/harbor-core:v2.14.1-arm64
+registry.bstein.dev/infra/harbor-jobservice:v2.14.1-arm64
+registry.bstein.dev/infra/harbor-portal:v2.14.1-arm64
+registry.bstein.dev/infra/harbor-registry:v2.14.1-arm64
+registry.bstein.dev/infra/harbor-registryctl:v2.14.1-arm64
+registry.bstein.dev/infra/harbor-redis:v2.14.1-arm64
+registry.bstein.dev/infra/harbor-nginx:v2.14.1-arm64
+registry.bstein.dev/infra/harbor-prepare:v2.14.1-arm64
--- a/scripts/bootstrap/recovery-config.env
+++ b/scripts/bootstrap/recovery-config.env
@ -0,0 +1,36 @@
+CANONICAL_CONTROL_HOST="titan-db"
+DEFAULT_FLUX_BRANCH="main"
+EXPECTED_FLUX_URL="ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git"
+SHUTDOWN_MODE="host-poweroff"
+STATE_SUBDIR=".local/share/ananke"
+HARBOR_BUNDLE_BASENAME="harbor-bootstrap-v2.14.1-arm64.tar.zst"
+HARBOR_TARGET_NODE=""
+HARBOR_CANARY_NODE=""
+HARBOR_HOST_LABEL_KEY="ananke.bstein.dev/harbor-bootstrap"
+HARBOR_CANARY_IMAGE="registry.bstein.dev/bstein/kubectl:1.35.0"
+NODE_HELPER_IMAGE="registry.bstein.dev/bstein/ananke-node-helper:0.1.0"
+NODE_HELPER_NAMESPACE="maintenance"
+NODE_HELPER_SERVICE_ACCOUNT="default"
+REGISTRY_PULL_SECRET="harbor-regcred"
+BUNDLE_HTTP_PORT="8877"
+UPS_HOST="pyrphoros@localhost"
+UPS_BATTERY_KEY="battery.charge"
+FLUX_READY_TIMEOUT_SECONDS="1200"
+FLUX_READY_POLL_SECONDS="10"
+STARTUP_CHECKLIST_TIMEOUT_SECONDS="900"
+STARTUP_CHECKLIST_POLL_SECONDS="10"
+STARTUP_WORKLOAD_TIMEOUT_SECONDS="900"
+STARTUP_WORKLOAD_POLL_SECONDS="10"
+STARTUP_STABILITY_WINDOW_SECONDS="180"
+STARTUP_STABILITY_TIMEOUT_SECONDS="900"
+STARTUP_STABILITY_POLL_SECONDS="10"
+STARTUP_OPTIONAL_KUSTOMIZATIONS=""
+STARTUP_IGNORE_PODS_REGEX=""
+STARTUP_IGNORE_WORKLOADS_REGEX=""
+STARTUP_WORKLOAD_NAMESPACE_EXCLUDES_REGEX="^(kube-system|kube-public|kube-node-lease|flux-system)$"
+STARTUP_SERVICE_CHECK_TIMEOUT_SECONDS="10"
+STARTUP_INCLUDE_INGRESS_CHECKS="1"
+STARTUP_INGRESS_ALLOWED_STATUSES="200,301,302,307,308,401,403,404"
+STARTUP_IGNORE_INGRESS_HOSTS_REGEX=""
+STARTUP_INGRESS_CHECK_TIMEOUT_SECONDS="10"
+STARTUP_SERVICE_CHECKLIST='gitea|https://scm.bstein.dev/api/healthz|200|"status":"pass"||;grafana|https://metrics.bstein.dev/api/health|200|"database":"ok"||;harbor|https://registry.bstein.dev/v2/|200,401|||'
--- a/scripts/build_ananke_node_helper.sh
+++ b/scripts/build_ananke_node_helper.sh
@ -0,0 +1,56 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+IMAGE="registry.bstein.dev/bstein/ananke-node-helper:0.1.0"
+DOCKER_CONFIG_PATH=""
+PLATFORMS="linux/amd64,linux/arm64"
+BUILDER_NAME="ananke-node-helper-builder"
+
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --image)
+      IMAGE="${2:?missing image}"
+      shift 2
+      ;;
+    --docker-config)
+      DOCKER_CONFIG_PATH="${2:?missing docker config path}"
+      shift 2
+      ;;
+    --platforms)
+      PLATFORMS="${2:?missing platforms}"
+      shift 2
+      ;;
+    --builder)
+      BUILDER_NAME="${2:?missing builder}"
+      shift 2
+      ;;
+    -h|--help)
+      cat <<USAGE
+Usage: scripts/build_ananke_node_helper.sh [--image <image>] [--docker-config <path>] [--platforms <csv>] [--builder <name>]
+USAGE
+      exit 0
+      ;;
+    *)
+      echo "Unknown option: $1" >&2
+      exit 1
+      ;;
+  esac
+done
+
+if [[ -n "${DOCKER_CONFIG_PATH}" ]]; then
+  export DOCKER_CONFIG="${DOCKER_CONFIG_PATH}"
+fi
+
+if ! docker buildx inspect "${BUILDER_NAME}" >/dev/null 2>&1; then
+  docker buildx create --name "${BUILDER_NAME}" --driver docker-container --use >/dev/null
+else
+  docker buildx use "${BUILDER_NAME}" >/dev/null
+fi
+
+docker buildx inspect --bootstrap >/dev/null
+docker buildx build \
+  --platform "${PLATFORMS}" \
+  -f dockerfiles/Dockerfile.ananke-node-helper \
+  -t "${IMAGE}" \
+  --push \
+  .
--- a/scripts/build_harbor_bootstrap_bundle.sh
+++ b/scripts/build_harbor_bootstrap_bundle.sh
@ -0,0 +1,58 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+IMAGES_FILE="scripts/bootstrap/harbor-bootstrap-images.txt"
+BUNDLE_FILE="artifacts/harbor-bootstrap-v2.14.1-arm64.tar.zst"
+DOCKER_CONFIG_PATH=""
+PLATFORM="linux/arm64"
+
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --images-file)
+      IMAGES_FILE="${2:?missing images file}"
+      shift 2
+      ;;
+    --bundle-file)
+      BUNDLE_FILE="${2:?missing bundle file}"
+      shift 2
+      ;;
+    --docker-config)
+      DOCKER_CONFIG_PATH="${2:?missing docker config path}"
+      shift 2
+      ;;
+    --platform)
+      PLATFORM="${2:?missing platform}"
+      shift 2
+      ;;
+    -h|--help)
+      cat <<USAGE
+Usage: scripts/build_harbor_bootstrap_bundle.sh [--images-file <path>] [--bundle-file <path>] [--docker-config <path>] [--platform <linux/arm64>]
+USAGE
+      exit 0
+      ;;
+    *)
+      echo "Unknown option: $1" >&2
+      exit 1
+      ;;
+  esac
+done
+
+if [[ -n "${DOCKER_CONFIG_PATH}" ]]; then
+  export DOCKER_CONFIG="${DOCKER_CONFIG_PATH}"
+fi
+
+mapfile -t IMAGES < <(grep -v '^[[:space:]]*#' "${IMAGES_FILE}" | sed '/^[[:space:]]*$/d')
+if [[ ${#IMAGES[@]} -eq 0 ]]; then
+  echo "No images found in ${IMAGES_FILE}" >&2
+  exit 1
+fi
+
+mkdir -p "$(dirname "${BUNDLE_FILE}")"
+for image in "${IMAGES[@]}"; do
+  echo "Pulling ${image}" >&2
+  docker pull --platform "${PLATFORM}" "${image}" >/dev/null
+
+done
+
+docker save "${IMAGES[@]}" | zstd -T0 -19 -o "${BUNDLE_FILE}"
+echo "Wrote ${BUNDLE_FILE}" >&2
--- a/scripts/cluster_power_console.sh
+++ b/scripts/cluster_power_console.sh
@ -0,0 +1,82 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+usage() {
+  cat <<USAGE
+Usage:
+  scripts/cluster_power_console.sh [--repo-dir <path>] [--delegate-host <host>] [--allow-local] <prepare|status|shutdown|startup> [recovery-script-options...]
+
+Purpose:
+  Friendly manual entrypoint for running Ananke from a remote console.
+  The canonical control host is titan-db by default so bundle/state handling stays in one place.
+
+Defaults:
+  --repo-dir       \$HOME/Development/ananke (fallback: \$HOME/Development/titan-iac)
+  --delegate-host  titan-db
+
+Examples:
+  scripts/cluster_power_console.sh status
+  scripts/cluster_power_console.sh prepare --execute
+  scripts/cluster_power_console.sh shutdown --execute
+  scripts/cluster_power_console.sh startup --execute --force-flux-branch main
+USAGE
+}
+
+if [[ -d "${HOME}/Development/ananke" ]]; then
+  REPO_DIR="${HOME}/Development/ananke"
+else
+  REPO_DIR="${HOME}/Development/titan-iac"
+fi
+DELEGATE_HOST="titan-db"
+ALLOW_LOCAL=0
+REMOTE_REPO_DIR="${ANANKE_REMOTE_REPO_DIR:-}"
+
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --repo-dir)
+      REPO_DIR="${2:-}"
+      shift 2
+      ;;
+    --delegate-host)
+      DELEGATE_HOST="${2:-}"
+      shift 2
+      ;;
+    --allow-local)
+      ALLOW_LOCAL=1
+      shift
+      ;;
+    -h|--help)
+      usage
+      exit 0
+      ;;
+    *)
+      break
+      ;;
+  esac
+done
+
+if [[ $# -lt 1 ]]; then
+  usage
+  exit 1
+fi
+
+LOCAL_SCRIPT="${REPO_DIR}/scripts/cluster_power_recovery.sh"
+CURRENT_HOST="$(hostname -s 2>/dev/null || hostname)"
+
+if [[ -x "${LOCAL_SCRIPT}" ]] && command -v kubectl >/dev/null 2>&1; then
+  if [[ "${ALLOW_LOCAL}" -eq 1 || "${CURRENT_HOST}" == "${DELEGATE_HOST}" ]]; then
+    exec "${LOCAL_SCRIPT}" "$@"
+  fi
+fi
+
+if [[ -z "${DELEGATE_HOST}" ]]; then
+  echo "cluster-power-console: no delegate host configured" >&2
+  exit 1
+fi
+
+quoted_args="$(printf '%q ' "$@")"
+remote_prefix=""
+if [[ -n "${REMOTE_REPO_DIR}" ]]; then
+  remote_prefix="ANANKE_REPO_DIR=$(printf '%q' "${REMOTE_REPO_DIR}") "
+fi
+exec ssh -o BatchMode=yes -o ConnectTimeout=8 "${DELEGATE_HOST}" "${remote_prefix}~/ananke-tools/cluster_power_recovery.sh ${quoted_args}"
--- a/scripts/cluster_power_recovery.sh
+++ b/scripts/cluster_power_recovery.sh
--- a/scripts/dashboards_render_atlas.py
+++ b/scripts/dashboards_render_atlas.py
@ -423,16 +423,17 @@ ARIADNE_SCHEDULE_LAST_ERROR_RANGE_HOURS = (
    "(time() - max_over_time(ariadne_schedule_last_error_timestamp_seconds[$__range])) / 3600"
 )
 ARIADNE_ACCESS_REQUESTS = "ariadne_access_requests_total"
-ARIADNE_CI_COVERAGE = 'ariadne_ci_coverage_percent{repo="ariadne"}'
-ARIADNE_CI_TESTS = 'ariadne_ci_tests_total{repo="ariadne"}'
-ARIADNE_TEST_SUCCESS_RATE = (
+TEST_REPO_SELECTOR = 'repo=~"ariadne|metis"'
+TEST_CI_COVERAGE = f'ariadne_ci_coverage_percent{{{TEST_REPO_SELECTOR}}}'
+TEST_CI_TESTS = f'ariadne_ci_tests_total{{{TEST_REPO_SELECTOR}}}'
+TEST_SUCCESS_RATE = (
    "100 * "
-    'sum(max_over_time(ariadne_ci_tests_total{repo="ariadne",result="passed"}[30d])) '
+    f'sum(max_over_time(ariadne_ci_tests_total{{{TEST_REPO_SELECTOR},result="passed"}}[30d])) '
    "/ clamp_min("
-    'sum(max_over_time(ariadne_ci_tests_total{repo="ariadne",result=~"passed|failed|error"}[30d])), 1)'
+    f'sum(max_over_time(ariadne_ci_tests_total{{{TEST_REPO_SELECTOR},result=~"passed|failed|error"}}[30d])), 1)'
 )
-ARIADNE_TEST_FAILURES_24H = (
-    'sum by (result) (max_over_time(ariadne_ci_tests_total{repo="ariadne",result=~"failed|error"}[24h]))'
+TEST_FAILURES_24H = (
+    f'sum by (result) (max_over_time(ariadne_ci_tests_total{{{TEST_REPO_SELECTOR},result=~"failed|error"}}[24h]))'
 )
 POSTGRES_CONN_USED = (
    'label_replace(sum(pg_stat_activity_count), "conn", "used", "__name__", ".*") '
@ -1294,48 +1295,53 @@ def build_overview():
            },
        }
    )
-    panels.append(
-        timeseries_panel(
-            42,
-            "Ariadne Test Success Rate",
-            ARIADNE_TEST_SUCCESS_RATE,
-            {"h": 6, "w": 6, "x": 12, "y": 14},
-            unit="percent",
-            max_value=100,
-            legend=None,
-            legend_display="list",
-        )
+    test_success = timeseries_panel(
+        42,
+        "Platform Test Success Rate",
+        TEST_SUCCESS_RATE,
+        {"h": 6, "w": 6, "x": 12, "y": 14},
+        unit="percent",
+        max_value=100,
+        legend=None,
+        legend_display="list",
    )
-    panels.append(
-        bargauge_panel(
-            43,
-            "Tests with Failures (24h)",
-            ARIADNE_TEST_FAILURES_24H,
-            {"h": 6, "w": 6, "x": 18, "y": 14},
-            unit="none",
-            instant=True,
-            legend="{{result}}",
-            overrides=[
-                {
-                    "matcher": {"id": "byName", "options": "error"},
-                    "properties": [{"id": "color", "value": {"mode": "fixed", "fixedColor": "yellow"}}],
-                },
-                {
-                    "matcher": {"id": "byName", "options": "failed"},
-                    "properties": [{"id": "color", "value": {"mode": "fixed", "fixedColor": "red"}}],
-                },
-            ],
-            thresholds={
-                "mode": "absolute",
-                "steps": [
-                    {"color": "green", "value": None},
-                    {"color": "yellow", "value": 1},
-                    {"color": "orange", "value": 5},
-                    {"color": "red", "value": 10},
-                ],
+    test_success["description"] = (
+        "Atlas Overview mirrors the Atlas Jobs internal dashboard for automation test health. "
+        "Add new test series there first so they roll up here."
+    )
+    panels.append(test_success)
+    test_failures = bargauge_panel(
+        43,
+        "Platform Tests with Failures (24h)",
+        TEST_FAILURES_24H,
+        {"h": 6, "w": 6, "x": 18, "y": 14},
+        unit="none",
+        instant=True,
+        legend="{{result}}",
+        overrides=[
+            {
+                "matcher": {"id": "byName", "options": "error"},
+                "properties": [{"id": "color", "value": {"mode": "fixed", "fixedColor": "yellow"}}],
            },
-        )
+            {
+                "matcher": {"id": "byName", "options": "failed"},
+                "properties": [{"id": "color", "value": {"mode": "fixed", "fixedColor": "red"}}],
+            },
+        ],
+        thresholds={
+            "mode": "absolute",
+            "steps": [
+                {"color": "green", "value": None},
+                {"color": "yellow", "value": 1},
+                {"color": "orange", "value": 5},
+                {"color": "red", "value": 10},
+            ],
+        },
    )
+    test_failures["description"] = (
+        "This summary is sourced from the Atlas Jobs internal dashboard rather than a separate overview-only query."
+    )
+    panels.append(test_failures)

    cpu_scope = "$namespace_scope_cpu"
    gpu_scope = "$namespace_scope_gpu"
@ -2653,29 +2659,31 @@ def build_jobs_dashboard():
            legend="{{status}}",
        )
    )
-    panels.append(
-        stat_panel(
-            17,
-            "Ariadne CI Coverage (%)",
-            ARIADNE_CI_COVERAGE,
-            {"h": 6, "w": 4, "x": 8, "y": 11},
-            unit="percent",
-            decimals=1,
-            instant=True,
-            legend="{{branch}}",
-        )
+    coverage_panel = stat_panel(
+        17,
+        "Platform CI Coverage (%)",
+        TEST_CI_COVERAGE,
+        {"h": 6, "w": 4, "x": 8, "y": 11},
+        unit="percent",
+        decimals=1,
+        instant=True,
+        legend="{{branch}}",
    )
-    panels.append(
-        table_panel(
-            18,
-            "Ariadne CI Tests (latest)",
-            ARIADNE_CI_TESTS,
-            {"h": 6, "w": 12, "x": 12, "y": 11},
-            unit="none",
-            transformations=[{"id": "labelsToFields", "options": {}}, {"id": "sortBy", "options": {"fields": ["Value"], "order": "desc"}}],
-            instant=True,
-        )
+    coverage_panel["description"] = "Internal source panel for Atlas Overview automation test rollups."
+    panels.append(coverage_panel)
+    tests_panel = table_panel(
+        18,
+        "Platform CI Tests (latest)",
+        TEST_CI_TESTS,
+        {"h": 6, "w": 12, "x": 12, "y": 11},
+        unit="none",
+        transformations=[{"id": "labelsToFields", "options": {}}, {"id": "sortBy", "options": {"fields": ["Value"], "order": "desc"}}],
+        instant=True,
    )
+    tests_panel["description"] = (
+        "Atlas Overview test panels depend on these internal repo-tagged CI series."
+    )
+    panels.append(tests_panel)

    return {
        "uid": "atlas-jobs",
--- a/services/harbor/helmrelease.yaml
+++ b/services/harbor/helmrelease.yaml
@ -437,8 +437,7 @@ spec:
                          - $patch: replace
                          - name: VAULT_ENV_FILE
                            value: /vault/secrets/harbor-jobservice-env.sh
-                        envFrom:
-                          - $patch: replace
+                        envFrom: []
                          - configMapRef:
                              name: harbor-jobservice-env
                        volumeMounts:
--- a/services/jenkins/configmap-jcasc.yaml
+++ b/services/jenkins/configmap-jcasc.yaml
@ -167,6 +167,58 @@ data:
              }
            }
          }
+          pipelineJob('metis') {
+            properties {
+              pipelineTriggers {
+                triggers {
+                  scmTrigger {
+                    scmpoll_spec('H/2 * * * *')
+                    ignorePostCommitHooks(false)
+                  }
+                }
+              }
+            }
+            definition {
+              cpsScm {
+                scm {
+                  git {
+                    remote {
+                      url('https://scm.bstein.dev/bstein/metis.git')
+                      credentials('gitea-pat')
+                    }
+                    branches('*/master')
+                  }
+                }
+                scriptPath('Jenkinsfile')
+              }
+            }
+          }
+          pipelineJob('metis') {
+            properties {
+              pipelineTriggers {
+                triggers {
+                  scmTrigger {
+                    scmpoll_spec('H/5 * * * *')
+                    ignorePostCommitHooks(false)
+                  }
+                }
+              }
+            }
+            definition {
+              cpsScm {
+                scm {
+                  git {
+                    remote {
+                      url('https://scm.bstein.dev/bstein/metis.git')
+                      credentials('gitea-pat')
+                    }
+                    branches('*/master')
+                  }
+                }
+                scriptPath('Jenkinsfile')
+              }
+            }
+          }
          pipelineJob('atlasbot') {
            properties {
              pipelineTriggers {
--- a/services/maintenance/ariadne-deployment.yaml
+++ b/services/maintenance/ariadne-deployment.yaml
@ -302,11 +302,11 @@ spec:
            - name: ARIADNE_SCHEDULE_FIREFLY_CRON
              value: "0 3 * * *"
            - name: ARIADNE_SCHEDULE_POD_CLEANER
-              value: "0 * * * *"
+              value: "*/30 * * * *"
            - name: ARIADNE_SCHEDULE_OPENSEARCH_PRUNE
              value: "23 3 * * *"
            - name: ARIADNE_SCHEDULE_IMAGE_SWEEPER
-              value: "30 4 * * *"
+              value: "0 */4 * * *"
            - name: ARIADNE_SCHEDULE_VAULT_K8S_AUTH
              value: "*/15 * * * *"
            - name: ARIADNE_SCHEDULE_VAULT_OIDC
@ -320,9 +320,9 @@ spec:
            - name: ARIADNE_SCHEDULE_COMMS_SEED_ROOM
              value: "*/10 * * * *"
            - name: ARIADNE_SCHEDULE_CLUSTER_STATE
-              value: "*/15 * * * *"
+              value: "*/10 * * * *"
            - name: ARIADNE_CLUSTER_STATE_KEEP
-              value: "168"
+              value: "720"
            - name: WELCOME_EMAIL_ENABLED
              value: "true"
            - name: K8S_API_TIMEOUT_SEC
@ -339,6 +339,12 @@ spec:
              value: "1099511627776"
            - name: OPENSEARCH_INDEX_PATTERNS
              value: kube-*,journald-*,trace-analytics-*
+            - name: METIS_BASE_URL
+              value: http://metis.maintenance.svc.cluster.local
+            - name: METIS_TIMEOUT_SEC
+              value: "15"
+            - name: ARIADNE_SCHEDULE_METIS_SENTINEL_WATCH
+              value: "*/30 * * * *"
            - name: METRICS_PATH
              value: "/metrics"
          resources:
--- a/services/maintenance/image.yaml
+++ b/services/maintenance/image.yaml
@ -24,6 +24,52 @@ spec:
 ---
 apiVersion: image.toolkit.fluxcd.io/v1beta2
 kind: ImageRepository
+metadata:
+  name: metis
+  namespace: maintenance
+spec:
+  image: registry.bstein.dev/bstein/metis
+  interval: 1m0s
+  secretRef:
+    name: harbor-regcred
+---
+apiVersion: image.toolkit.fluxcd.io/v1beta2
+kind: ImagePolicy
+metadata:
+  name: metis
+  namespace: maintenance
+spec:
+  imageRepositoryRef:
+    name: metis
+  policy:
+    semver:
+      range: ">=0.1.0-0"
+---
+apiVersion: image.toolkit.fluxcd.io/v1beta2
+kind: ImageRepository
+metadata:
+  name: metis-sentinel
+  namespace: maintenance
+spec:
+  image: registry.bstein.dev/bstein/metis-sentinel
+  interval: 1m0s
+  secretRef:
+    name: harbor-regcred
+---
+apiVersion: image.toolkit.fluxcd.io/v1beta2
+kind: ImagePolicy
+metadata:
+  name: metis-sentinel
+  namespace: maintenance
+spec:
+  imageRepositoryRef:
+    name: metis-sentinel
+  policy:
+    semver:
+      range: ">=0.1.0-0"
+---
+apiVersion: image.toolkit.fluxcd.io/v1beta2
+kind: ImageRepository
 metadata:
  name: soteria
  namespace: maintenance
--- a/services/maintenance/kustomization.yaml
+++ b/services/maintenance/kustomization.yaml
@ -6,32 +6,47 @@ resources:
  - image.yaml
  - secretproviderclass.yaml
  - soteria-configmap.yaml
+  - metis-configmap.yaml
+  - metis-data-pvc.yaml
  - vault-serviceaccount.yaml
  - vault-sync-deployment.yaml
  - ariadne-serviceaccount.yaml
  - ariadne-rbac.yaml
  - disable-k3s-traefik-serviceaccount.yaml
  - k3s-traefik-cleanup-rbac.yaml
+  - metis-serviceaccount.yaml
+  - metis-rbac.yaml
+  - metis-token-sync-serviceaccount.yaml
+  - metis-token-sync-rbac.yaml
  - node-nofile-serviceaccount.yaml
  - pod-cleaner-rbac.yaml
  - soteria-serviceaccount.yaml
  - soteria-rbac.yaml
  - ariadne-deployment.yaml
+  - metis-deployment.yaml
  - oneoffs/ariadne-migrate-job.yaml
  - ariadne-service.yaml
  - soteria-deployment.yaml
  - disable-k3s-traefik-daemonset.yaml
  - oneoffs/k3s-traefik-cleanup-job.yaml
  - node-nofile-daemonset.yaml
+  - metis-sentinel-daemonset.yaml
+  - metis-k3s-token-sync-cronjob.yaml
  - k3s-agent-restart-daemonset.yaml
  - pod-cleaner-cronjob.yaml
  - node-image-sweeper-serviceaccount.yaml
  - node-image-sweeper-daemonset.yaml
  - image-sweeper-cronjob.yaml
+  - metis-service.yaml
+  - metis-ingress.yaml
  - soteria-service.yaml
 images:
  - name: registry.bstein.dev/bstein/ariadne
    newTag: 0.1.0-22 # {"$imagepolicy": "maintenance:ariadne:tag"}
+  - name: registry.bstein.dev/bstein/metis
+    newTag: 0.1.0-0 # {"$imagepolicy": "maintenance:metis:tag"}
+  - name: registry.bstein.dev/bstein/metis-sentinel
+    newTag: 0.1.0-0 # {"$imagepolicy": "maintenance:metis-sentinel:tag"}
  - name: registry.bstein.dev/bstein/soteria
    newTag: 0.1.0-11 # {"$imagepolicy": "maintenance:soteria:tag"}
 configMapGenerator:
--- a/services/maintenance/metis-configmap.yaml
+++ b/services/maintenance/metis-configmap.yaml
@ -0,0 +1,20 @@
+# services/maintenance/metis-configmap.yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: metis
+  namespace: maintenance
+data:
+  METIS_BIND_ADDR: :8080
+  METIS_INVENTORY_PATH: /app/inventory.titan-rpi4.yaml
+  METIS_DATA_DIR: /var/lib/metis
+  METIS_DEFAULT_FLASH_HOST: titan-22
+  METIS_FLASH_HOSTS: titan-22
+  METIS_LOCAL_HOST: titan-22
+  METIS_ALLOWED_GROUPS: admin,maintainer
+  METIS_MAX_DEVICE_BYTES: "300000000000"
+  METIS_SENTINEL_PUSH_URL: http://metis.maintenance.svc.cluster.local/internal/sentinel/snapshot
+  METIS_SENTINEL_INTERVAL_SEC: "1800"
+  METIS_SENTINEL_NSENTER: "1"
+  METIS_IMAGE_RPI4_ARMBIAN_LONGHORN: https://armbian.chi.auroradev.org/dl/rpi4b/archive/Armbian_26.2.1_Rpi4b_noble_current_6.18.9_minimal.img.xz
+  METIS_IMAGE_RPI4_ARMBIAN_LONGHORN_SHA256: sha256:c450687adf4cc6a59725c43aefd58baf42ec71bdd379227d403cdde281768e46
--- a/services/maintenance/metis-data-pvc.yaml
+++ b/services/maintenance/metis-data-pvc.yaml
@ -0,0 +1,13 @@
+# services/maintenance/metis-data-pvc.yaml
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: metis-data
+  namespace: maintenance
+spec:
+  accessModes:
+    - ReadWriteOnce
+  resources:
+    requests:
+      storage: 40Gi
+  storageClassName: local-path
--- a/services/maintenance/metis-deployment.yaml
+++ b/services/maintenance/metis-deployment.yaml
@ -0,0 +1,47 @@
+# services/maintenance/metis-deployment.yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: metis
+  namespace: maintenance
+spec:
+  replicas: 1
+  revisionHistoryLimit: 3
+  selector:
+    matchLabels:
+      app: metis
+  template:
+    metadata:
+      labels:
+        app: metis
+      annotations:
+        prometheus.io/scrape: "true"
+        prometheus.io/port: "8080"
+        prometheus.io/path: "/metrics"
+    spec:
+      serviceAccountName: metis
+      nodeSelector:
+        kubernetes.io/hostname: titan-22
+        kubernetes.io/arch: amd64
+        node-role.kubernetes.io/worker: "true"
+      containers:
+        - name: metis
+          image: registry.bstein.dev/bstein/metis:latest
+          imagePullPolicy: Always
+          envFrom:
+            - configMapRef:
+                name: metis
+          ports:
+            - name: http
+              containerPort: 8080
+          resources:
+            requests:
+              cpu: 100m
+              memory: 128Mi
+            limits:
+              cpu: 500m
+              memory: 512Mi
+          securityContext:
+            allowPrivilegeEscalation: false
+            capabilities:
+              drop: ["ALL"]
--- a/services/maintenance/metis-ingress.yaml
+++ b/services/maintenance/metis-ingress.yaml
@ -0,0 +1,27 @@
+# services/maintenance/metis-ingress.yaml
+apiVersion: networking.k8s.io/v1
+kind: Ingress
+metadata:
+  name: metis
+  namespace: maintenance
+  annotations:
+    kubernetes.io/ingress.class: traefik
+    cert-manager.io/cluster-issuer: letsencrypt
+    traefik.ingress.kubernetes.io/router.entrypoints: websecure
+    traefik.ingress.kubernetes.io/router.tls: "true"
+    traefik.ingress.kubernetes.io/router.middlewares: sso-oauth2-proxy-forward-auth@kubernetescrd
+spec:
+  tls:
+    - hosts: ["metis.bstein.dev"]
+      secretName: metis-tls
+  rules:
+    - host: metis.bstein.dev
+      http:
+        paths:
+          - path: /
+            pathType: Prefix
+            backend:
+              service:
+                name: metis
+                port:
+                  number: 80
--- a/services/maintenance/metis-k3s-token-sync-cronjob.yaml
+++ b/services/maintenance/metis-k3s-token-sync-cronjob.yaml
@ -0,0 +1,51 @@
+# services/maintenance/metis-k3s-token-sync-cronjob.yaml
+apiVersion: batch/v1
+kind: CronJob
+metadata:
+  name: metis-k3s-token-sync
+  namespace: maintenance
+spec:
+  schedule: "11 */6 * * *"
+  concurrencyPolicy: Forbid
+  successfulJobsHistoryLimit: 1
+  failedJobsHistoryLimit: 2
+  jobTemplate:
+    spec:
+      template:
+        spec:
+          serviceAccountName: metis-token-sync
+          restartPolicy: OnFailure
+          nodeSelector:
+            kubernetes.io/arch: arm64
+            node-role.kubernetes.io/control-plane: "true"
+          tolerations:
+            - key: node-role.kubernetes.io/control-plane
+              operator: Exists
+              effect: NoSchedule
+            - key: node-role.kubernetes.io/master
+              operator: Exists
+              effect: NoSchedule
+          containers:
+            - name: sync
+              image: registry.bstein.dev/bstein/kubectl:1.35.0
+              imagePullPolicy: IfNotPresent
+              command:
+                - /bin/sh
+                - -c
+              args:
+                - |
+                  set -euo pipefail
+                  token="$(tr -d '\n' < /host/var/lib/rancher/k3s/server/node-token)"
+                  kubectl -n maintenance create secret generic metis-runtime \
+                    --from-literal=k3s_token="${token}" \
+                    --dry-run=client -o yaml | kubectl apply -f -
+              securityContext:
+                runAsUser: 0
+              volumeMounts:
+                - name: k3s-server
+                  mountPath: /host/var/lib/rancher/k3s/server
+                  readOnly: true
+          volumes:
+            - name: k3s-server
+              hostPath:
+                path: /var/lib/rancher/k3s/server
--- a/services/maintenance/metis-rbac.yaml
+++ b/services/maintenance/metis-rbac.yaml
@ -0,0 +1,27 @@
+# services/maintenance/metis-rbac.yaml
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: metis-node-manager
+rules:
+  - apiGroups: [""]
+    resources:
+      - nodes
+    verbs:
+      - get
+      - list
+      - watch
+      - delete
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: metis-node-manager
+subjects:
+  - kind: ServiceAccount
+    name: metis
+    namespace: maintenance
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: metis-node-manager
--- a/services/maintenance/metis-sentinel-daemonset.yaml
+++ b/services/maintenance/metis-sentinel-daemonset.yaml
@ -0,0 +1,133 @@
+# services/maintenance/metis-sentinel-daemonset.yaml
+apiVersion: apps/v1
+kind: DaemonSet
+metadata:
+  name: metis-sentinel
+  namespace: maintenance
+spec:
+  selector:
+    matchLabels:
+      app: metis-sentinel
+  updateStrategy:
+    type: RollingUpdate
+  template:
+    metadata:
+      labels:
+        app: metis-sentinel
+      annotations:
+        prometheus.io/scrape: "true"
+        prometheus.io/port: "8080"
+        prometheus.io/path: "/metrics"
+    spec:
+      serviceAccountName: metis
+      nodeSelector:
+        kubernetes.io/os: linux
+        node-role.kubernetes.io/worker: "true"
+      containers:
+        - name: metis-sentinel
+          image: registry.bstein.dev/bstein/metis-sentinel:latest
+          imagePullPolicy: Always
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - |
+              set -eu
+              out_dir="${METIS_SENTINEL_OUT:-/var/run/metis-sentinel}"
+              interval="${METIS_SENTINEL_INTERVAL_SEC:-120}"
+              mkdir -p "${out_dir}"
+              while true; do
+                ts="$(date -u +%Y%m%dT%H%M%SZ)"
+                node="${METIS_SENTINEL_NODE:-unknown}"
+                tmp="${out_dir}/${node}-${ts}.json.tmp"
+                out="${out_dir}/${node}-${ts}.json"
+                if metis-sentinel > "${tmp}"; then
+                  mv "${tmp}" "${out}"
+                else
+                  rm -f "${tmp}" || true
+                fi
+                sleep "${interval}"
+              done
+          envFrom:
+            - configMapRef:
+                name: metis
+          env:
+            - name: METIS_SENTINEL_NODE
+              valueFrom:
+                fieldRef:
+                  fieldPath: spec.nodeName
+          ports:
+            - name: http
+              containerPort: 8080
+          volumeMounts:
+            - name: sentinel-output
+              mountPath: /var/run/metis-sentinel
+          resources:
+            requests:
+              cpu: 25m
+              memory: 64Mi
+            limits:
+              cpu: 250m
+              memory: 256Mi
+          securityContext:
+            allowPrivilegeEscalation: false
+            runAsUser: 0
+            capabilities:
+              drop: ["ALL"]
+        - name: sentinel-pusher
+          image: curlimages/curl:8.12.1
+          imagePullPolicy: IfNotPresent
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - |
+              set -eu
+              out_dir="${METIS_SENTINEL_OUT:-/var/run/metis-sentinel}"
+              push_url="${METIS_SENTINEL_PUSH_URL:-}"
+              interval="${METIS_SENTINEL_PUSH_INTERVAL_SEC:-120}"
+              timeout="${METIS_SENTINEL_PUSH_TIMEOUT_SEC:-10}"
+              mkdir -p "${out_dir}"
+              while true; do
+                for snapshot in "${out_dir}"/*.json; do
+                  [ -f "${snapshot}" ] || continue
+                  if [ -z "${push_url}" ]; then
+                    break
+                  fi
+                  if curl -fsS --connect-timeout "${timeout}" --max-time "${timeout}" \
+                    -X POST \
+                    -H "Content-Type: application/json" \
+                    -H "X-Metis-Node: ${METIS_SENTINEL_NODE:-unknown}" \
+                    --data-binary "@${snapshot}" \
+                    "${push_url}"; then
+                    rm -f "${snapshot}"
+                  fi
+                done
+                sleep "${interval}"
+              done
+          envFrom:
+            - configMapRef:
+                name: metis
+          env:
+            - name: METIS_SENTINEL_NODE
+              valueFrom:
+                fieldRef:
+                  fieldPath: spec.nodeName
+          volumeMounts:
+            - name: sentinel-output
+              mountPath: /var/run/metis-sentinel
+          resources:
+            requests:
+              cpu: 10m
+              memory: 32Mi
+            limits:
+              cpu: 100m
+              memory: 128Mi
+          securityContext:
+            allowPrivilegeEscalation: false
+            runAsUser: 0
+            capabilities:
+              drop: ["ALL"]
+      volumes:
+        - name: sentinel-output
+          emptyDir: {}
--- a/services/maintenance/metis-service.yaml
+++ b/services/maintenance/metis-service.yaml
@ -0,0 +1,18 @@
+# services/maintenance/metis-service.yaml
+apiVersion: v1
+kind: Service
+metadata:
+  name: metis
+  namespace: maintenance
+  annotations:
+    prometheus.io/scrape: "true"
+    prometheus.io/port: "80"
+    prometheus.io/path: "/metrics"
+spec:
+  type: ClusterIP
+  selector:
+    app: metis
+  ports:
+    - name: http
+      port: 80
+      targetPort: http
--- a/services/maintenance/metis-serviceaccount.yaml
+++ b/services/maintenance/metis-serviceaccount.yaml
@ -0,0 +1,6 @@
+# services/maintenance/metis-serviceaccount.yaml
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: metis
+  namespace: maintenance
--- a/services/maintenance/metis-token-sync-rbac.yaml
+++ b/services/maintenance/metis-token-sync-rbac.yaml
@ -0,0 +1,30 @@
+# services/maintenance/metis-token-sync-rbac.yaml
+apiVersion: rbac.authorization.k8s.io/v1
+kind: Role
+metadata:
+  name: metis-token-sync
+  namespace: maintenance
+rules:
+  - apiGroups: [""]
+    resources:
+      - secrets
+    verbs:
+      - get
+      - list
+      - create
+      - update
+      - patch
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: RoleBinding
+metadata:
+  name: metis-token-sync
+  namespace: maintenance
+subjects:
+  - kind: ServiceAccount
+    name: metis-token-sync
+    namespace: maintenance
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: Role
+  name: metis-token-sync
--- a/services/maintenance/metis-token-sync-serviceaccount.yaml
+++ b/services/maintenance/metis-token-sync-serviceaccount.yaml
@ -0,0 +1,6 @@
+# services/maintenance/metis-token-sync-serviceaccount.yaml
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: metis-token-sync
+  namespace: maintenance
--- a/services/monitoring/dashboards/atlas-jobs.json
+++ b/services/monitoring/dashboards/atlas-jobs.json
@ -1125,7 +1125,7 @@
    {
      "id": 17,
      "type": "stat",
-      "title": "Ariadne CI Coverage (%)",
+      "title": "Platform CI Coverage (%)",
      "datasource": {
        "type": "prometheus",
        "uid": "atlas-vm"
@ -1138,7 +1138,7 @@
      },
      "targets": [
        {
-          "expr": "ariadne_ci_coverage_percent{repo=\"ariadne\"}",
+          "expr": "ariadne_ci_coverage_percent{repo=~\"ariadne|metis\"}",
          "refId": "A",
          "legendFormat": "{{branch}}",
          "instant": true
@ -1183,12 +1183,13 @@
          "values": false
        },
        "textMode": "value"
-      }
+      },
+      "description": "Internal source panel for Atlas Overview automation test rollups."
    },
    {
      "id": 18,
      "type": "table",
-      "title": "Ariadne CI Tests (latest)",
+      "title": "Platform CI Tests (latest)",
      "datasource": {
        "type": "prometheus",
        "uid": "atlas-vm"
@ -1201,7 +1202,7 @@
      },
      "targets": [
        {
-          "expr": "ariadne_ci_tests_total{repo=\"ariadne\"}",
+          "expr": "ariadne_ci_tests_total{repo=~\"ariadne|metis\"}",
          "refId": "A",
          "instant": true
        }
@ -1233,7 +1234,8 @@
            "order": "desc"
          }
        }
-      ]
+      ],
+      "description": "Atlas Overview test panels depend on these internal repo-tagged CI series."
    }
  ],
  "time": {
--- a/services/monitoring/dashboards/atlas-overview.json
+++ b/services/monitoring/dashboards/atlas-overview.json
@ -1677,7 +1677,7 @@
    {
      "id": 42,
      "type": "timeseries",
-      "title": "Ariadne Test Success Rate",
+      "title": "Platform Test Success Rate",
      "datasource": {
        "type": "prometheus",
        "uid": "atlas-vm"
@ -1690,7 +1690,7 @@
      },
      "targets": [
        {
-          "expr": "100 * sum(max_over_time(ariadne_ci_tests_total{repo=\"ariadne\",result=\"passed\"}[30d])) / clamp_min(sum(max_over_time(ariadne_ci_tests_total{repo=\"ariadne\",result=~\"passed|failed|error\"}[30d])), 1)",
+          "expr": "100 * sum(max_over_time(ariadne_ci_tests_total{repo=~\"ariadne|metis\",result=\"passed\"}[30d])) / clamp_min(sum(max_over_time(ariadne_ci_tests_total{repo=~\"ariadne|metis\",result=~\"passed|failed|error\"}[30d])), 1)",
          "refId": "A"
        }
      ],
@ -1709,12 +1709,13 @@
        "tooltip": {
          "mode": "multi"
        }
-      }
+      },
+      "description": "Atlas Overview mirrors the Atlas Jobs internal dashboard for automation test health. Add new test series there first so they roll up here."
    },
    {
      "id": 43,
      "type": "bargauge",
-      "title": "Tests with Failures (24h)",
+      "title": "Platform Tests with Failures (24h)",
      "datasource": {
        "type": "prometheus",
        "uid": "atlas-vm"
@ -1727,7 +1728,7 @@
      },
      "targets": [
        {
-          "expr": "sort_desc(sum by (result) (max_over_time(ariadne_ci_tests_total{repo=\"ariadne\",result=~\"failed|error\"}[24h])))",
+          "expr": "sort_desc(sum by (result) (max_over_time(ariadne_ci_tests_total{repo=~\"ariadne|metis\",result=~\"failed|error\"}[24h])))",
          "refId": "A",
          "legendFormat": "{{result}}",
          "instant": true
@ -1814,7 +1815,8 @@
            "order": "desc"
          }
        }
-      ]
+      ],
+      "description": "This summary is sourced from the Atlas Jobs internal dashboard rather than a separate overview-only query."
    },
    {
      "id": 11,
--- a/services/monitoring/grafana-alerting-config.yaml
+++ b/services/monitoring/grafana-alerting-config.yaml
@ -49,7 +49,7 @@ data:
        interval: 1m
        rules:
          - uid: disk-pressure-root
-            title: "Node rootfs high (>80%)"
+            title: "Node rootfs high (>85%)"
            condition: C
            for: "10m"
            data:
@ -83,7 +83,7 @@ data:
                  type: threshold
                  conditions:
                    - evaluator:
-                        params: [80]
+                        params: [85]
                        type: gt
                      operator:
                        type: and
@ -93,7 +93,7 @@ data:
            noDataState: NoData
            execErrState: Error
            annotations:
-              summary: "{{ $labels.node }} rootfs >80% for 10m"
+              summary: "{{ $labels.node }} rootfs >85% for 10m"
            labels:
              severity: warning
          - uid: disk-growth-1h
--- a/services/monitoring/grafana-dashboard-jobs.yaml
+++ b/services/monitoring/grafana-dashboard-jobs.yaml
@ -1134,7 +1134,7 @@ data:
        {
          "id": 17,
          "type": "stat",
-          "title": "Ariadne CI Coverage (%)",
+          "title": "Platform CI Coverage (%)",
          "datasource": {
            "type": "prometheus",
            "uid": "atlas-vm"
@ -1147,7 +1147,7 @@ data:
          },
          "targets": [
            {
-              "expr": "ariadne_ci_coverage_percent{repo=\"ariadne\"}",
+              "expr": "ariadne_ci_coverage_percent{repo=~\"ariadne|metis\"}",
              "refId": "A",
              "legendFormat": "{{branch}}",
              "instant": true
@ -1192,12 +1192,13 @@ data:
              "values": false
            },
            "textMode": "value"
-          }
+          },
+          "description": "Internal source panel for Atlas Overview automation test rollups."
        },
        {
          "id": 18,
          "type": "table",
-          "title": "Ariadne CI Tests (latest)",
+          "title": "Platform CI Tests (latest)",
          "datasource": {
            "type": "prometheus",
            "uid": "atlas-vm"
@ -1210,7 +1211,7 @@ data:
          },
          "targets": [
            {
-              "expr": "ariadne_ci_tests_total{repo=\"ariadne\"}",
+              "expr": "ariadne_ci_tests_total{repo=~\"ariadne|metis\"}",
              "refId": "A",
              "instant": true
            }
@ -1242,7 +1243,8 @@ data:
                "order": "desc"
              }
            }
-          ]
+          ],
+          "description": "Atlas Overview test panels depend on these internal repo-tagged CI series."
        }
      ],
      "time": {
--- a/services/monitoring/grafana-dashboard-overview.yaml
+++ b/services/monitoring/grafana-dashboard-overview.yaml
@ -1686,7 +1686,7 @@ data:
        {
          "id": 42,
          "type": "timeseries",
-          "title": "Ariadne Test Success Rate",
+          "title": "Platform Test Success Rate",
          "datasource": {
            "type": "prometheus",
            "uid": "atlas-vm"
@ -1699,7 +1699,7 @@ data:
          },
          "targets": [
            {
-              "expr": "100 * sum(max_over_time(ariadne_ci_tests_total{repo=\"ariadne\",result=\"passed\"}[30d])) / clamp_min(sum(max_over_time(ariadne_ci_tests_total{repo=\"ariadne\",result=~\"passed|failed|error\"}[30d])), 1)",
+              "expr": "100 * sum(max_over_time(ariadne_ci_tests_total{repo=~\"ariadne|metis\",result=\"passed\"}[30d])) / clamp_min(sum(max_over_time(ariadne_ci_tests_total{repo=~\"ariadne|metis\",result=~\"passed|failed|error\"}[30d])), 1)",
              "refId": "A"
            }
          ],
@ -1718,12 +1718,13 @@ data:
            "tooltip": {
              "mode": "multi"
            }
-          }
+          },
+          "description": "Atlas Overview mirrors the Atlas Jobs internal dashboard for automation test health. Add new test series there first so they roll up here."
        },
        {
          "id": 43,
          "type": "bargauge",
-          "title": "Tests with Failures (24h)",
+          "title": "Platform Tests with Failures (24h)",
          "datasource": {
            "type": "prometheus",
            "uid": "atlas-vm"
@ -1736,7 +1737,7 @@ data:
          },
          "targets": [
            {
-              "expr": "sort_desc(sum by (result) (max_over_time(ariadne_ci_tests_total{repo=\"ariadne\",result=~\"failed|error\"}[24h])))",
+              "expr": "sort_desc(sum by (result) (max_over_time(ariadne_ci_tests_total{repo=~\"ariadne|metis\",result=~\"failed|error\"}[24h])))",
              "refId": "A",
              "legendFormat": "{{result}}",
              "instant": true
@ -1823,7 +1824,8 @@ data:
                "order": "desc"
              }
            }
-          ]
+          ],
+          "description": "This summary is sourced from the Atlas Jobs internal dashboard rather than a separate overview-only query."
        },
        {
          "id": 11,
--- a/services/monitoring/helmrelease.yaml
+++ b/services/monitoring/helmrelease.yaml
@ -286,7 +286,7 @@ spec:
    podAnnotations:
      vault.hashicorp.com/agent-inject: "true"
      vault.hashicorp.com/role: "monitoring"
-      monitoring.bstein.dev/restart-rev: "5"
+      monitoring.bstein.dev/restart-rev: "6"
      vault.hashicorp.com/agent-inject-secret-grafana-env.sh: "kv/data/atlas/monitoring/grafana-admin"
      vault.hashicorp.com/agent-inject-template-grafana-env.sh: |
        {{ with secret "kv/data/atlas/monitoring/grafana-admin" }}
Author	SHA1	Message	Date
Brad Stein	cc316c472b	ananke: harden recovery checks and finalize naming migration	2026-04-07 12:30:28 -03:00
Brad Stein	c1dc50cace	hecate: add controlled drill checklist to runbook	2026-04-06 04:59:37 -03:00
Brad Stein	65de56b2ac	hecate: harden titan-24 cleanup and ups telemetry	2026-04-06 04:47:05 -03:00
Brad Stein	31f5709929	hecate: add cluster power recovery tooling	2026-04-06 04:21:04 -03:00
Brad Stein	e4a074f53e	maintenance: harden metis recovery and fix harbor rollout	2026-03-31 14:51:49 -03:00
Brad Stein	b56222f40b	maintenance/jenkins: align Metis ingress, sentinel push, and CI job	2026-03-31 14:21:53 -03:00
Brad Stein	30c677e6ed	maintenance: add Metis service and sentinel manifests	2026-03-31 14:07:17 -03:00
Brad Stein	6c3c1342cd	monitoring: combine Ariadne and Metis tests	2026-03-31 13:54:04 -03:00
Brad Stein	7b43043838	monitoring: roll grafana to apply latest alert rules	2026-03-30 18:41:21 -03:00
Brad Stein	af74172b2d	monitoring: raise rootfs warning threshold to 85 percent	2026-03-30 18:40:59 -03:00