Compare commits
10 Commits
df5ba74ab7
...
cc316c472b
| Author | SHA1 | Date | |
|---|---|---|---|
| cc316c472b | |||
| c1dc50cace | |||
| 65de56b2ac | |||
| 31f5709929 | |||
| e4a074f53e | |||
| b56222f40b | |||
| 30c677e6ed | |||
| 6c3c1342cd | |||
| 7b43043838 | |||
| af74172b2d |
79
README.md
79
README.md
@ -1,3 +1,80 @@
|
||||
# titan-iac
|
||||
|
||||
Flux-managed Kubernetes cluster for bstein.dev services.
|
||||
Flux-managed Kubernetes cluster config for bstein.dev.
|
||||
|
||||
Canonical repo URL:
|
||||
- `ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git`
|
||||
|
||||
## Why `ananke`
|
||||
|
||||
`Ananke` is inevitability and constraint. That is exactly what this tooling is for:
|
||||
- power events happen
|
||||
- recovery windows are finite
|
||||
- bootstrap has to be deterministic
|
||||
|
||||
The point is not clever automation. The point is boring, repeatable recovery.
|
||||
|
||||
## Power Domains
|
||||
|
||||
Two UPS domains matter during shutdown/startup drills:
|
||||
- `Statera`: `titan-23`, `titan-24`, `titan-jh`
|
||||
- `Pyrphoros`: all other nodes
|
||||
|
||||
Default UPS checks in Ananke read from `Pyrphoros` (`pyrphoros@localhost`) unless overridden.
|
||||
|
||||
## Breakglass
|
||||
|
||||
If primary operator access is lost, breakglass is on the remote Magic Mirror.
|
||||
|
||||
## Ananke Commands
|
||||
|
||||
Ananke is the recovery orchestrator. Flux desired-state source remains `titan-iac.git`.
|
||||
|
||||
Use `titan-db` as the canonical control host. `tethys` (`titan-24`) is the backup operator host.
|
||||
|
||||
From `titan-db`:
|
||||
|
||||
```bash
|
||||
~/ananke-cluster-power status
|
||||
~/ananke-cluster-power prepare --execute
|
||||
~/ananke-cluster-power shutdown --execute --require-ups-battery
|
||||
~/ananke-cluster-power startup --execute --force-flux-branch main --require-ups-battery
|
||||
```
|
||||
|
||||
From `tethys` / `titan-24` (delegating to `titan-db`):
|
||||
|
||||
```bash
|
||||
~/ananke-tools/cluster_power_console.sh --delegate-host titan-db status
|
||||
~/ananke-tools/cluster_power_console.sh --delegate-host titan-db prepare --execute
|
||||
~/ananke-tools/cluster_power_console.sh --delegate-host titan-db shutdown --execute --require-ups-battery
|
||||
~/ananke-tools/cluster_power_console.sh --delegate-host titan-db startup --execute --force-flux-branch main --require-ups-battery
|
||||
```
|
||||
|
||||
## Shutdown Modes
|
||||
|
||||
`cluster_power_recovery.sh` supports two shutdown behaviors:
|
||||
- `--shutdown-mode host-poweroff` (default): graceful cluster shutdown plus scheduled host poweroff.
|
||||
- `--shutdown-mode cluster-only`: graceful cluster shutdown without host poweroff (stops `k3s` / `k3s-agent` only).
|
||||
|
||||
## Startup Completion Rules
|
||||
|
||||
Ananke startup is not “done” just because Flux says green once.
|
||||
|
||||
Startup now completes only after:
|
||||
- Flux source drift checks pass (expected URL and branch)
|
||||
- all non-optional Flux kustomizations report `Ready=True`
|
||||
- external service checklist passes (default includes Gitea, Grafana, Harbor)
|
||||
- generated ingress reachability checks pass (default accepted statuses: `200,301,302,307,308,401,403,404`)
|
||||
- a stability soak window passes with no `CrashLoopBackOff` / image-pull failures and checklist still healthy
|
||||
|
||||
If you intentionally need to correct Flux source during recovery, use:
|
||||
- `--force-flux-url ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git`
|
||||
- `--force-flux-branch main`
|
||||
|
||||
`--force-flux-url` is breakglass-only and requires `--allow-flux-source-mutation`.
|
||||
|
||||
The defaults live in:
|
||||
- `scripts/bootstrap/recovery-config.env`
|
||||
|
||||
Detailed runbook:
|
||||
- `knowledge/runbooks/cluster-power-recovery.md`
|
||||
|
||||
@ -9,7 +9,7 @@ metadata:
|
||||
spec:
|
||||
interval: 1m0s
|
||||
ref:
|
||||
branch: feature/atlasbot
|
||||
branch: main
|
||||
secretRef:
|
||||
name: flux-system-gitea
|
||||
url: ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git
|
||||
|
||||
12
dockerfiles/Dockerfile.ananke-node-helper
Normal file
12
dockerfiles/Dockerfile.ananke-node-helper
Normal file
@ -0,0 +1,12 @@
|
||||
FROM debian:bookworm-slim
|
||||
|
||||
RUN apt-get update \
|
||||
&& apt-get install -y --no-install-recommends \
|
||||
bash \
|
||||
ca-certificates \
|
||||
curl \
|
||||
util-linux \
|
||||
zstd \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
CMD ["/bin/sh"]
|
||||
152
knowledge/runbooks/cluster-power-recovery.md
Normal file
152
knowledge/runbooks/cluster-power-recovery.md
Normal file
@ -0,0 +1,152 @@
|
||||
Atlas Cluster Power Recovery (Graceful Shutdown/Startup)
|
||||
|
||||
Purpose
|
||||
- Provide a safe operator flow for planned power events and cold-boot recovery.
|
||||
- Avoid the Flux/Gitea bootstrap deadlock by using a local bootstrap fallback path.
|
||||
- Break the Harbor self-hosting deadlock by seeding Harbor runtime images from a control-host bundle.
|
||||
- Refuse bootstrap when UPS charge is too low, and fall back to fast shutdown if a second outage hits mid-recovery.
|
||||
|
||||
Bootstrapping risk to remember
|
||||
- Flux source is Git over SSH to `scm.bstein.dev` (Gitea).
|
||||
- Gitea itself is a Flux-managed workload and depends on storage + database.
|
||||
- Harbor is also critical, but it is not part of the first recovery stage because Harbor serves its own runtime images.
|
||||
- On cold boot, if Flux cannot fetch source before Gitea is up, reconciliation can stall.
|
||||
- Recovery path: bring control plane and workers up, then locally apply minimal platform stack (`core -> helm -> longhorn -> metallb -> traefik -> vault-csi -> vault-injector -> vault -> postgres -> gitea`), then seed Harbor images onto the Harbor node from a control-host bundle, then resume/reconcile Flux. Harbor is a later recovery stage after storage, Vault, Postgres, and Gitea are back.
|
||||
|
||||
Script
|
||||
- `scripts/cluster_power_recovery.sh`
|
||||
- `scripts/cluster_power_console.sh`
|
||||
- Modes:
|
||||
- `prepare`
|
||||
- `shutdown`
|
||||
- `harbor-seed`
|
||||
- `startup`
|
||||
- `status`
|
||||
- Default is dry-run. Add `--execute` to actually perform actions.
|
||||
|
||||
Dry-run examples
|
||||
- Shutdown preview:
|
||||
- `scripts/cluster_power_recovery.sh shutdown --skip-etcd-snapshot --skip-drain`
|
||||
- Startup preview:
|
||||
- `scripts/cluster_power_recovery.sh startup`
|
||||
- Harbor seed preview:
|
||||
- `scripts/cluster_power_recovery.sh harbor-seed`
|
||||
|
||||
Execute examples
|
||||
- Prepare helper image on every node:
|
||||
- `scripts/cluster_power_recovery.sh prepare --execute`
|
||||
- Seed Harbor runtime images onto `titan-05` from the control-host bundle:
|
||||
- `scripts/cluster_power_recovery.sh harbor-seed --execute`
|
||||
- Planned shutdown:
|
||||
- `scripts/cluster_power_recovery.sh shutdown --execute`
|
||||
- Planned startup (canonical branch):
|
||||
- `scripts/cluster_power_recovery.sh startup --execute --force-flux-branch main`
|
||||
|
||||
Manual remote console examples
|
||||
- Canonical operator hosts:
|
||||
- `titan-db`
|
||||
- `tethys` (`titan-24`)
|
||||
- Both hosts now have:
|
||||
- `~/ananke-tools/cluster_power_recovery.sh`
|
||||
- `~/ananke-tools/cluster_power_console.sh`
|
||||
- `~/ananke-tools/bootstrap/recovery-config.env`
|
||||
- `~/ananke-tools/bootstrap/harbor-bootstrap-images.txt`
|
||||
- `~/ananke-tools/kubeconfig`
|
||||
- `~/ananke-cluster-power`
|
||||
- `~/bin/ananke-cluster-power`
|
||||
- `~/ananke-repo/{infrastructure,services,scripts}`
|
||||
- Both hosts also keep the Harbor bootstrap bundle at:
|
||||
- `~/.local/share/ananke/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`
|
||||
- Remote usage:
|
||||
- `ssh titan-db`
|
||||
- `~/ananke-cluster-power status`
|
||||
- `~/ananke-cluster-power prepare --execute`
|
||||
- `~/ananke-cluster-power shutdown --execute`
|
||||
- `~/ananke-cluster-power startup --execute --force-flux-branch main`
|
||||
- `ssh tethys`
|
||||
- `~/ananke-cluster-power status`
|
||||
- `~/ananke-cluster-power prepare --execute`
|
||||
- `~/ananke-cluster-power shutdown --execute`
|
||||
- `~/ananke-cluster-power startup --execute --force-flux-branch main`
|
||||
|
||||
Useful options
|
||||
- `--shutdown-mode host-poweroff|cluster-only`
|
||||
- `--expected-flux-branch main`
|
||||
- `--expected-flux-url ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git`
|
||||
- `--force-flux-url ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git`
|
||||
- `--force-flux-branch main`
|
||||
- `--allow-flux-source-mutation` (required with `--force-flux-url`; breakglass only)
|
||||
- `--skip-local-bootstrap` (not recommended for cold-start recovery)
|
||||
- `--skip-harbor-bootstrap` (skip the Harbor recovery stage if you know Harbor should stay deferred)
|
||||
- `--skip-harbor-seed` (skip bundle import if Harbor images are already cached on the target node)
|
||||
- `--skip-helper-prewarm`
|
||||
- `--min-startup-battery 35`
|
||||
- `--ups-host pyrphoros@localhost`
|
||||
- `--require-ups-battery`
|
||||
- `--drain-timeout 180`
|
||||
- `--emergency-drain-timeout 45`
|
||||
- `--flux-ready-timeout 1200`
|
||||
- `--startup-checklist-timeout 900`
|
||||
- `--startup-stability-window 180`
|
||||
- `--startup-stability-timeout 900`
|
||||
- `--recovery-state-file ~/.local/share/ananke/cluster_power_recovery.state`
|
||||
- `--harbor-bundle-file ~/.local/share/ananke/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`
|
||||
|
||||
Controlled drill checklist (recommended)
|
||||
- Operator host: use `titan-db` as canonical control host for the drill.
|
||||
- On-site coordination:
|
||||
- Have on-site operator ready before shutdown starts.
|
||||
- Confirm they will manually power cluster nodes back on after shutdown completes.
|
||||
- Confirm who will announce "all nodes powered on" to resume startup.
|
||||
- Preflight on `titan-db`:
|
||||
- `mkdir -p ~/ananke-logs`
|
||||
- `~/ananke-cluster-power status` and verify:
|
||||
- `ups_host=pyrphoros@localhost`
|
||||
- `ups_battery` is numeric
|
||||
- `flux_source_ready=True`
|
||||
- Warm helper image just before shutdown:
|
||||
- `~/ananke-cluster-power prepare --execute`
|
||||
- Run in a persistent shell and capture logs:
|
||||
- `tmux new -s ananke-drill`
|
||||
- `script -q -a ~/ananke-logs/ananke-drill-$(date +%Y%m%d-%H%M%S).log`
|
||||
- Execute controlled shutdown with telemetry enforcement:
|
||||
- `~/ananke-cluster-power shutdown --execute --require-ups-battery`
|
||||
- After on-site power-on confirmation, execute startup:
|
||||
- `~/ananke-cluster-power startup --execute --force-flux-branch main --require-ups-battery`
|
||||
- Post-check:
|
||||
- `~/ananke-cluster-power status`
|
||||
- Verify critical services (`longhorn`, `vault`, `postgres`, `gitea`, `harbor`, `pegasus`) and no widespread pull/crash failures.
|
||||
|
||||
Operational notes
|
||||
- The flow suspends Flux Kustomizations/HelmReleases during shutdown to prevent churn.
|
||||
- Shutdown behavior is explicit:
|
||||
- `host-poweroff` schedules host poweroff after service stop.
|
||||
- `cluster-only` stops `k3s`/`k3s-agent` without powering hosts off.
|
||||
- Worker drain is no longer best-effort only. The script now escalates from normal drain, to `--force`, to `--disable-eviction` once the configured timeout is exhausted.
|
||||
- Startup fails fast if Flux source URL/branch drift from expected values (unless branch override is explicitly requested with `--force-flux-branch`).
|
||||
- Flux desired-state source remains `titan-iac.git`. Ananke orchestrates runtime recovery and should not be used as the normal Flux source repo.
|
||||
- During startup, if Flux source is not `Ready`, local bootstrap fallback is applied first using the repo snapshot under `~/ananke-repo`.
|
||||
- Longhorn is reconciled before Vault/Postgres/Gitea so storage-backed services are not racing the volume layer.
|
||||
- Harbor is reconciled after the first critical stateful services.
|
||||
- Harbor bootstrap is now designed around a control-host bundle:
|
||||
- Build the Harbor bundle locally with `scripts/build_harbor_bootstrap_bundle.sh`.
|
||||
- Stage it on the operator host at `~/.local/share/ananke/bundles/harbor-bootstrap-v2.14.1-arm64.tar.zst`.
|
||||
- Use `harbor-seed --execute` or a full `startup --execute` to stream/import that bundle onto `titan-05`.
|
||||
- The Harbor bundle remains arm64-only because Harbor is pinned to arm64 nodes. The node-helper image is multi-arch because Ananke uses it across both arm64 and amd64 nodes during prepare/shutdown operations.
|
||||
- Ananke uses a temporary privileged helper pod for host-side operations. The helper image is prewarmed with `prepare --execute` so later shutdown/startup steps do not stall on image pulls.
|
||||
- The script persists outage state in `~/.local/share/ananke/cluster_power_recovery.state` by default. If startup is attempted during an outage window and power becomes unstable again, rerunning startup with insufficient UPS charge will flip into the emergency shutdown path instead of continuing to bootstrap.
|
||||
- Startup completion is strict now:
|
||||
- all non-optional Flux kustomizations must be `Ready=True`
|
||||
- external service checklist must pass (defaults include Gitea, Grafana, Harbor)
|
||||
- generated ingress reachability checks must pass (default accepted codes: `200,301,302,307,308,401,403,404`)
|
||||
- stability soak must pass with no crashloop/pull-failure churn
|
||||
- If Flux hits immutable one-off Job drift during reconcile, Ananke now attempts self-heal by pruning failed Flux-managed Jobs and retrying reconcile.
|
||||
- In dry-run mode, the script now skips the live API wait step so preview runs do not stall on an offline cluster.
|
||||
- Dry-run mode no longer mutates outage recovery state.
|
||||
- `harbor-seed --execute` was validated by:
|
||||
- prewarming the helper image across all nodes
|
||||
- streaming the Harbor bootstrap bundle to `titan-05`
|
||||
- importing Harbor runtime images into host `containerd`
|
||||
- successfully running a Harbor-backed canary pod (`harbor-canary-ok`)
|
||||
- After bootstrap, Flux resources are resumed and reconciled.
|
||||
- Keep this runbook aligned with `clusters/atlas/flux-system/gotk-sync.yaml`.
|
||||
9
scripts/bootstrap/harbor-bootstrap-images.txt
Normal file
9
scripts/bootstrap/harbor-bootstrap-images.txt
Normal file
@ -0,0 +1,9 @@
|
||||
# Harbor cold-start bootstrap images.
|
||||
registry.bstein.dev/infra/harbor-core:v2.14.1-arm64
|
||||
registry.bstein.dev/infra/harbor-jobservice:v2.14.1-arm64
|
||||
registry.bstein.dev/infra/harbor-portal:v2.14.1-arm64
|
||||
registry.bstein.dev/infra/harbor-registry:v2.14.1-arm64
|
||||
registry.bstein.dev/infra/harbor-registryctl:v2.14.1-arm64
|
||||
registry.bstein.dev/infra/harbor-redis:v2.14.1-arm64
|
||||
registry.bstein.dev/infra/harbor-nginx:v2.14.1-arm64
|
||||
registry.bstein.dev/infra/harbor-prepare:v2.14.1-arm64
|
||||
36
scripts/bootstrap/recovery-config.env
Normal file
36
scripts/bootstrap/recovery-config.env
Normal file
@ -0,0 +1,36 @@
|
||||
CANONICAL_CONTROL_HOST="titan-db"
|
||||
DEFAULT_FLUX_BRANCH="main"
|
||||
EXPECTED_FLUX_URL="ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git"
|
||||
SHUTDOWN_MODE="host-poweroff"
|
||||
STATE_SUBDIR=".local/share/ananke"
|
||||
HARBOR_BUNDLE_BASENAME="harbor-bootstrap-v2.14.1-arm64.tar.zst"
|
||||
HARBOR_TARGET_NODE=""
|
||||
HARBOR_CANARY_NODE=""
|
||||
HARBOR_HOST_LABEL_KEY="ananke.bstein.dev/harbor-bootstrap"
|
||||
HARBOR_CANARY_IMAGE="registry.bstein.dev/bstein/kubectl:1.35.0"
|
||||
NODE_HELPER_IMAGE="registry.bstein.dev/bstein/ananke-node-helper:0.1.0"
|
||||
NODE_HELPER_NAMESPACE="maintenance"
|
||||
NODE_HELPER_SERVICE_ACCOUNT="default"
|
||||
REGISTRY_PULL_SECRET="harbor-regcred"
|
||||
BUNDLE_HTTP_PORT="8877"
|
||||
UPS_HOST="pyrphoros@localhost"
|
||||
UPS_BATTERY_KEY="battery.charge"
|
||||
FLUX_READY_TIMEOUT_SECONDS="1200"
|
||||
FLUX_READY_POLL_SECONDS="10"
|
||||
STARTUP_CHECKLIST_TIMEOUT_SECONDS="900"
|
||||
STARTUP_CHECKLIST_POLL_SECONDS="10"
|
||||
STARTUP_WORKLOAD_TIMEOUT_SECONDS="900"
|
||||
STARTUP_WORKLOAD_POLL_SECONDS="10"
|
||||
STARTUP_STABILITY_WINDOW_SECONDS="180"
|
||||
STARTUP_STABILITY_TIMEOUT_SECONDS="900"
|
||||
STARTUP_STABILITY_POLL_SECONDS="10"
|
||||
STARTUP_OPTIONAL_KUSTOMIZATIONS=""
|
||||
STARTUP_IGNORE_PODS_REGEX=""
|
||||
STARTUP_IGNORE_WORKLOADS_REGEX=""
|
||||
STARTUP_WORKLOAD_NAMESPACE_EXCLUDES_REGEX="^(kube-system|kube-public|kube-node-lease|flux-system)$"
|
||||
STARTUP_SERVICE_CHECK_TIMEOUT_SECONDS="10"
|
||||
STARTUP_INCLUDE_INGRESS_CHECKS="1"
|
||||
STARTUP_INGRESS_ALLOWED_STATUSES="200,301,302,307,308,401,403,404"
|
||||
STARTUP_IGNORE_INGRESS_HOSTS_REGEX=""
|
||||
STARTUP_INGRESS_CHECK_TIMEOUT_SECONDS="10"
|
||||
STARTUP_SERVICE_CHECKLIST='gitea|https://scm.bstein.dev/api/healthz|200|"status":"pass"||;grafana|https://metrics.bstein.dev/api/health|200|"database":"ok"||;harbor|https://registry.bstein.dev/v2/|200,401|||'
|
||||
56
scripts/build_ananke_node_helper.sh
Executable file
56
scripts/build_ananke_node_helper.sh
Executable file
@ -0,0 +1,56 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
IMAGE="registry.bstein.dev/bstein/ananke-node-helper:0.1.0"
|
||||
DOCKER_CONFIG_PATH=""
|
||||
PLATFORMS="linux/amd64,linux/arm64"
|
||||
BUILDER_NAME="ananke-node-helper-builder"
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--image)
|
||||
IMAGE="${2:?missing image}"
|
||||
shift 2
|
||||
;;
|
||||
--docker-config)
|
||||
DOCKER_CONFIG_PATH="${2:?missing docker config path}"
|
||||
shift 2
|
||||
;;
|
||||
--platforms)
|
||||
PLATFORMS="${2:?missing platforms}"
|
||||
shift 2
|
||||
;;
|
||||
--builder)
|
||||
BUILDER_NAME="${2:?missing builder}"
|
||||
shift 2
|
||||
;;
|
||||
-h|--help)
|
||||
cat <<USAGE
|
||||
Usage: scripts/build_ananke_node_helper.sh [--image <image>] [--docker-config <path>] [--platforms <csv>] [--builder <name>]
|
||||
USAGE
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
echo "Unknown option: $1" >&2
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ -n "${DOCKER_CONFIG_PATH}" ]]; then
|
||||
export DOCKER_CONFIG="${DOCKER_CONFIG_PATH}"
|
||||
fi
|
||||
|
||||
if ! docker buildx inspect "${BUILDER_NAME}" >/dev/null 2>&1; then
|
||||
docker buildx create --name "${BUILDER_NAME}" --driver docker-container --use >/dev/null
|
||||
else
|
||||
docker buildx use "${BUILDER_NAME}" >/dev/null
|
||||
fi
|
||||
|
||||
docker buildx inspect --bootstrap >/dev/null
|
||||
docker buildx build \
|
||||
--platform "${PLATFORMS}" \
|
||||
-f dockerfiles/Dockerfile.ananke-node-helper \
|
||||
-t "${IMAGE}" \
|
||||
--push \
|
||||
.
|
||||
58
scripts/build_harbor_bootstrap_bundle.sh
Executable file
58
scripts/build_harbor_bootstrap_bundle.sh
Executable file
@ -0,0 +1,58 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
IMAGES_FILE="scripts/bootstrap/harbor-bootstrap-images.txt"
|
||||
BUNDLE_FILE="artifacts/harbor-bootstrap-v2.14.1-arm64.tar.zst"
|
||||
DOCKER_CONFIG_PATH=""
|
||||
PLATFORM="linux/arm64"
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--images-file)
|
||||
IMAGES_FILE="${2:?missing images file}"
|
||||
shift 2
|
||||
;;
|
||||
--bundle-file)
|
||||
BUNDLE_FILE="${2:?missing bundle file}"
|
||||
shift 2
|
||||
;;
|
||||
--docker-config)
|
||||
DOCKER_CONFIG_PATH="${2:?missing docker config path}"
|
||||
shift 2
|
||||
;;
|
||||
--platform)
|
||||
PLATFORM="${2:?missing platform}"
|
||||
shift 2
|
||||
;;
|
||||
-h|--help)
|
||||
cat <<USAGE
|
||||
Usage: scripts/build_harbor_bootstrap_bundle.sh [--images-file <path>] [--bundle-file <path>] [--docker-config <path>] [--platform <linux/arm64>]
|
||||
USAGE
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
echo "Unknown option: $1" >&2
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ -n "${DOCKER_CONFIG_PATH}" ]]; then
|
||||
export DOCKER_CONFIG="${DOCKER_CONFIG_PATH}"
|
||||
fi
|
||||
|
||||
mapfile -t IMAGES < <(grep -v '^[[:space:]]*#' "${IMAGES_FILE}" | sed '/^[[:space:]]*$/d')
|
||||
if [[ ${#IMAGES[@]} -eq 0 ]]; then
|
||||
echo "No images found in ${IMAGES_FILE}" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
mkdir -p "$(dirname "${BUNDLE_FILE}")"
|
||||
for image in "${IMAGES[@]}"; do
|
||||
echo "Pulling ${image}" >&2
|
||||
docker pull --platform "${PLATFORM}" "${image}" >/dev/null
|
||||
|
||||
done
|
||||
|
||||
docker save "${IMAGES[@]}" | zstd -T0 -19 -o "${BUNDLE_FILE}"
|
||||
echo "Wrote ${BUNDLE_FILE}" >&2
|
||||
82
scripts/cluster_power_console.sh
Executable file
82
scripts/cluster_power_console.sh
Executable file
@ -0,0 +1,82 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
usage() {
|
||||
cat <<USAGE
|
||||
Usage:
|
||||
scripts/cluster_power_console.sh [--repo-dir <path>] [--delegate-host <host>] [--allow-local] <prepare|status|shutdown|startup> [recovery-script-options...]
|
||||
|
||||
Purpose:
|
||||
Friendly manual entrypoint for running Ananke from a remote console.
|
||||
The canonical control host is titan-db by default so bundle/state handling stays in one place.
|
||||
|
||||
Defaults:
|
||||
--repo-dir \$HOME/Development/ananke (fallback: \$HOME/Development/titan-iac)
|
||||
--delegate-host titan-db
|
||||
|
||||
Examples:
|
||||
scripts/cluster_power_console.sh status
|
||||
scripts/cluster_power_console.sh prepare --execute
|
||||
scripts/cluster_power_console.sh shutdown --execute
|
||||
scripts/cluster_power_console.sh startup --execute --force-flux-branch main
|
||||
USAGE
|
||||
}
|
||||
|
||||
if [[ -d "${HOME}/Development/ananke" ]]; then
|
||||
REPO_DIR="${HOME}/Development/ananke"
|
||||
else
|
||||
REPO_DIR="${HOME}/Development/titan-iac"
|
||||
fi
|
||||
DELEGATE_HOST="titan-db"
|
||||
ALLOW_LOCAL=0
|
||||
REMOTE_REPO_DIR="${ANANKE_REMOTE_REPO_DIR:-}"
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--repo-dir)
|
||||
REPO_DIR="${2:-}"
|
||||
shift 2
|
||||
;;
|
||||
--delegate-host)
|
||||
DELEGATE_HOST="${2:-}"
|
||||
shift 2
|
||||
;;
|
||||
--allow-local)
|
||||
ALLOW_LOCAL=1
|
||||
shift
|
||||
;;
|
||||
-h|--help)
|
||||
usage
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
break
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ $# -lt 1 ]]; then
|
||||
usage
|
||||
exit 1
|
||||
fi
|
||||
|
||||
LOCAL_SCRIPT="${REPO_DIR}/scripts/cluster_power_recovery.sh"
|
||||
CURRENT_HOST="$(hostname -s 2>/dev/null || hostname)"
|
||||
|
||||
if [[ -x "${LOCAL_SCRIPT}" ]] && command -v kubectl >/dev/null 2>&1; then
|
||||
if [[ "${ALLOW_LOCAL}" -eq 1 || "${CURRENT_HOST}" == "${DELEGATE_HOST}" ]]; then
|
||||
exec "${LOCAL_SCRIPT}" "$@"
|
||||
fi
|
||||
fi
|
||||
|
||||
if [[ -z "${DELEGATE_HOST}" ]]; then
|
||||
echo "cluster-power-console: no delegate host configured" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
quoted_args="$(printf '%q ' "$@")"
|
||||
remote_prefix=""
|
||||
if [[ -n "${REMOTE_REPO_DIR}" ]]; then
|
||||
remote_prefix="ANANKE_REPO_DIR=$(printf '%q' "${REMOTE_REPO_DIR}") "
|
||||
fi
|
||||
exec ssh -o BatchMode=yes -o ConnectTimeout=8 "${DELEGATE_HOST}" "${remote_prefix}~/ananke-tools/cluster_power_recovery.sh ${quoted_args}"
|
||||
1840
scripts/cluster_power_recovery.sh
Executable file
1840
scripts/cluster_power_recovery.sh
Executable file
File diff suppressed because it is too large
Load Diff
@ -423,16 +423,17 @@ ARIADNE_SCHEDULE_LAST_ERROR_RANGE_HOURS = (
|
||||
"(time() - max_over_time(ariadne_schedule_last_error_timestamp_seconds[$__range])) / 3600"
|
||||
)
|
||||
ARIADNE_ACCESS_REQUESTS = "ariadne_access_requests_total"
|
||||
ARIADNE_CI_COVERAGE = 'ariadne_ci_coverage_percent{repo="ariadne"}'
|
||||
ARIADNE_CI_TESTS = 'ariadne_ci_tests_total{repo="ariadne"}'
|
||||
ARIADNE_TEST_SUCCESS_RATE = (
|
||||
TEST_REPO_SELECTOR = 'repo=~"ariadne|metis"'
|
||||
TEST_CI_COVERAGE = f'ariadne_ci_coverage_percent{{{TEST_REPO_SELECTOR}}}'
|
||||
TEST_CI_TESTS = f'ariadne_ci_tests_total{{{TEST_REPO_SELECTOR}}}'
|
||||
TEST_SUCCESS_RATE = (
|
||||
"100 * "
|
||||
'sum(max_over_time(ariadne_ci_tests_total{repo="ariadne",result="passed"}[30d])) '
|
||||
f'sum(max_over_time(ariadne_ci_tests_total{{{TEST_REPO_SELECTOR},result="passed"}}[30d])) '
|
||||
"/ clamp_min("
|
||||
'sum(max_over_time(ariadne_ci_tests_total{repo="ariadne",result=~"passed|failed|error"}[30d])), 1)'
|
||||
f'sum(max_over_time(ariadne_ci_tests_total{{{TEST_REPO_SELECTOR},result=~"passed|failed|error"}}[30d])), 1)'
|
||||
)
|
||||
ARIADNE_TEST_FAILURES_24H = (
|
||||
'sum by (result) (max_over_time(ariadne_ci_tests_total{repo="ariadne",result=~"failed|error"}[24h]))'
|
||||
TEST_FAILURES_24H = (
|
||||
f'sum by (result) (max_over_time(ariadne_ci_tests_total{{{TEST_REPO_SELECTOR},result=~"failed|error"}}[24h]))'
|
||||
)
|
||||
POSTGRES_CONN_USED = (
|
||||
'label_replace(sum(pg_stat_activity_count), "conn", "used", "__name__", ".*") '
|
||||
@ -1294,48 +1295,53 @@ def build_overview():
|
||||
},
|
||||
}
|
||||
)
|
||||
panels.append(
|
||||
timeseries_panel(
|
||||
42,
|
||||
"Ariadne Test Success Rate",
|
||||
ARIADNE_TEST_SUCCESS_RATE,
|
||||
{"h": 6, "w": 6, "x": 12, "y": 14},
|
||||
unit="percent",
|
||||
max_value=100,
|
||||
legend=None,
|
||||
legend_display="list",
|
||||
)
|
||||
test_success = timeseries_panel(
|
||||
42,
|
||||
"Platform Test Success Rate",
|
||||
TEST_SUCCESS_RATE,
|
||||
{"h": 6, "w": 6, "x": 12, "y": 14},
|
||||
unit="percent",
|
||||
max_value=100,
|
||||
legend=None,
|
||||
legend_display="list",
|
||||
)
|
||||
panels.append(
|
||||
bargauge_panel(
|
||||
43,
|
||||
"Tests with Failures (24h)",
|
||||
ARIADNE_TEST_FAILURES_24H,
|
||||
{"h": 6, "w": 6, "x": 18, "y": 14},
|
||||
unit="none",
|
||||
instant=True,
|
||||
legend="{{result}}",
|
||||
overrides=[
|
||||
{
|
||||
"matcher": {"id": "byName", "options": "error"},
|
||||
"properties": [{"id": "color", "value": {"mode": "fixed", "fixedColor": "yellow"}}],
|
||||
},
|
||||
{
|
||||
"matcher": {"id": "byName", "options": "failed"},
|
||||
"properties": [{"id": "color", "value": {"mode": "fixed", "fixedColor": "red"}}],
|
||||
},
|
||||
],
|
||||
thresholds={
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{"color": "green", "value": None},
|
||||
{"color": "yellow", "value": 1},
|
||||
{"color": "orange", "value": 5},
|
||||
{"color": "red", "value": 10},
|
||||
],
|
||||
test_success["description"] = (
|
||||
"Atlas Overview mirrors the Atlas Jobs internal dashboard for automation test health. "
|
||||
"Add new test series there first so they roll up here."
|
||||
)
|
||||
panels.append(test_success)
|
||||
test_failures = bargauge_panel(
|
||||
43,
|
||||
"Platform Tests with Failures (24h)",
|
||||
TEST_FAILURES_24H,
|
||||
{"h": 6, "w": 6, "x": 18, "y": 14},
|
||||
unit="none",
|
||||
instant=True,
|
||||
legend="{{result}}",
|
||||
overrides=[
|
||||
{
|
||||
"matcher": {"id": "byName", "options": "error"},
|
||||
"properties": [{"id": "color", "value": {"mode": "fixed", "fixedColor": "yellow"}}],
|
||||
},
|
||||
)
|
||||
{
|
||||
"matcher": {"id": "byName", "options": "failed"},
|
||||
"properties": [{"id": "color", "value": {"mode": "fixed", "fixedColor": "red"}}],
|
||||
},
|
||||
],
|
||||
thresholds={
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{"color": "green", "value": None},
|
||||
{"color": "yellow", "value": 1},
|
||||
{"color": "orange", "value": 5},
|
||||
{"color": "red", "value": 10},
|
||||
],
|
||||
},
|
||||
)
|
||||
test_failures["description"] = (
|
||||
"This summary is sourced from the Atlas Jobs internal dashboard rather than a separate overview-only query."
|
||||
)
|
||||
panels.append(test_failures)
|
||||
|
||||
cpu_scope = "$namespace_scope_cpu"
|
||||
gpu_scope = "$namespace_scope_gpu"
|
||||
@ -2653,29 +2659,31 @@ def build_jobs_dashboard():
|
||||
legend="{{status}}",
|
||||
)
|
||||
)
|
||||
panels.append(
|
||||
stat_panel(
|
||||
17,
|
||||
"Ariadne CI Coverage (%)",
|
||||
ARIADNE_CI_COVERAGE,
|
||||
{"h": 6, "w": 4, "x": 8, "y": 11},
|
||||
unit="percent",
|
||||
decimals=1,
|
||||
instant=True,
|
||||
legend="{{branch}}",
|
||||
)
|
||||
coverage_panel = stat_panel(
|
||||
17,
|
||||
"Platform CI Coverage (%)",
|
||||
TEST_CI_COVERAGE,
|
||||
{"h": 6, "w": 4, "x": 8, "y": 11},
|
||||
unit="percent",
|
||||
decimals=1,
|
||||
instant=True,
|
||||
legend="{{branch}}",
|
||||
)
|
||||
panels.append(
|
||||
table_panel(
|
||||
18,
|
||||
"Ariadne CI Tests (latest)",
|
||||
ARIADNE_CI_TESTS,
|
||||
{"h": 6, "w": 12, "x": 12, "y": 11},
|
||||
unit="none",
|
||||
transformations=[{"id": "labelsToFields", "options": {}}, {"id": "sortBy", "options": {"fields": ["Value"], "order": "desc"}}],
|
||||
instant=True,
|
||||
)
|
||||
coverage_panel["description"] = "Internal source panel for Atlas Overview automation test rollups."
|
||||
panels.append(coverage_panel)
|
||||
tests_panel = table_panel(
|
||||
18,
|
||||
"Platform CI Tests (latest)",
|
||||
TEST_CI_TESTS,
|
||||
{"h": 6, "w": 12, "x": 12, "y": 11},
|
||||
unit="none",
|
||||
transformations=[{"id": "labelsToFields", "options": {}}, {"id": "sortBy", "options": {"fields": ["Value"], "order": "desc"}}],
|
||||
instant=True,
|
||||
)
|
||||
tests_panel["description"] = (
|
||||
"Atlas Overview test panels depend on these internal repo-tagged CI series."
|
||||
)
|
||||
panels.append(tests_panel)
|
||||
|
||||
return {
|
||||
"uid": "atlas-jobs",
|
||||
|
||||
@ -437,8 +437,7 @@ spec:
|
||||
- $patch: replace
|
||||
- name: VAULT_ENV_FILE
|
||||
value: /vault/secrets/harbor-jobservice-env.sh
|
||||
envFrom:
|
||||
- $patch: replace
|
||||
envFrom: []
|
||||
- configMapRef:
|
||||
name: harbor-jobservice-env
|
||||
volumeMounts:
|
||||
|
||||
@ -167,6 +167,58 @@ data:
|
||||
}
|
||||
}
|
||||
}
|
||||
pipelineJob('metis') {
|
||||
properties {
|
||||
pipelineTriggers {
|
||||
triggers {
|
||||
scmTrigger {
|
||||
scmpoll_spec('H/2 * * * *')
|
||||
ignorePostCommitHooks(false)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
definition {
|
||||
cpsScm {
|
||||
scm {
|
||||
git {
|
||||
remote {
|
||||
url('https://scm.bstein.dev/bstein/metis.git')
|
||||
credentials('gitea-pat')
|
||||
}
|
||||
branches('*/master')
|
||||
}
|
||||
}
|
||||
scriptPath('Jenkinsfile')
|
||||
}
|
||||
}
|
||||
}
|
||||
pipelineJob('metis') {
|
||||
properties {
|
||||
pipelineTriggers {
|
||||
triggers {
|
||||
scmTrigger {
|
||||
scmpoll_spec('H/5 * * * *')
|
||||
ignorePostCommitHooks(false)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
definition {
|
||||
cpsScm {
|
||||
scm {
|
||||
git {
|
||||
remote {
|
||||
url('https://scm.bstein.dev/bstein/metis.git')
|
||||
credentials('gitea-pat')
|
||||
}
|
||||
branches('*/master')
|
||||
}
|
||||
}
|
||||
scriptPath('Jenkinsfile')
|
||||
}
|
||||
}
|
||||
}
|
||||
pipelineJob('atlasbot') {
|
||||
properties {
|
||||
pipelineTriggers {
|
||||
|
||||
@ -302,11 +302,11 @@ spec:
|
||||
- name: ARIADNE_SCHEDULE_FIREFLY_CRON
|
||||
value: "0 3 * * *"
|
||||
- name: ARIADNE_SCHEDULE_POD_CLEANER
|
||||
value: "0 * * * *"
|
||||
value: "*/30 * * * *"
|
||||
- name: ARIADNE_SCHEDULE_OPENSEARCH_PRUNE
|
||||
value: "23 3 * * *"
|
||||
- name: ARIADNE_SCHEDULE_IMAGE_SWEEPER
|
||||
value: "30 4 * * *"
|
||||
value: "0 */4 * * *"
|
||||
- name: ARIADNE_SCHEDULE_VAULT_K8S_AUTH
|
||||
value: "*/15 * * * *"
|
||||
- name: ARIADNE_SCHEDULE_VAULT_OIDC
|
||||
@ -320,9 +320,9 @@ spec:
|
||||
- name: ARIADNE_SCHEDULE_COMMS_SEED_ROOM
|
||||
value: "*/10 * * * *"
|
||||
- name: ARIADNE_SCHEDULE_CLUSTER_STATE
|
||||
value: "*/15 * * * *"
|
||||
value: "*/10 * * * *"
|
||||
- name: ARIADNE_CLUSTER_STATE_KEEP
|
||||
value: "168"
|
||||
value: "720"
|
||||
- name: WELCOME_EMAIL_ENABLED
|
||||
value: "true"
|
||||
- name: K8S_API_TIMEOUT_SEC
|
||||
@ -339,6 +339,12 @@ spec:
|
||||
value: "1099511627776"
|
||||
- name: OPENSEARCH_INDEX_PATTERNS
|
||||
value: kube-*,journald-*,trace-analytics-*
|
||||
- name: METIS_BASE_URL
|
||||
value: http://metis.maintenance.svc.cluster.local
|
||||
- name: METIS_TIMEOUT_SEC
|
||||
value: "15"
|
||||
- name: ARIADNE_SCHEDULE_METIS_SENTINEL_WATCH
|
||||
value: "*/30 * * * *"
|
||||
- name: METRICS_PATH
|
||||
value: "/metrics"
|
||||
resources:
|
||||
|
||||
@ -24,6 +24,52 @@ spec:
|
||||
---
|
||||
apiVersion: image.toolkit.fluxcd.io/v1beta2
|
||||
kind: ImageRepository
|
||||
metadata:
|
||||
name: metis
|
||||
namespace: maintenance
|
||||
spec:
|
||||
image: registry.bstein.dev/bstein/metis
|
||||
interval: 1m0s
|
||||
secretRef:
|
||||
name: harbor-regcred
|
||||
---
|
||||
apiVersion: image.toolkit.fluxcd.io/v1beta2
|
||||
kind: ImagePolicy
|
||||
metadata:
|
||||
name: metis
|
||||
namespace: maintenance
|
||||
spec:
|
||||
imageRepositoryRef:
|
||||
name: metis
|
||||
policy:
|
||||
semver:
|
||||
range: ">=0.1.0-0"
|
||||
---
|
||||
apiVersion: image.toolkit.fluxcd.io/v1beta2
|
||||
kind: ImageRepository
|
||||
metadata:
|
||||
name: metis-sentinel
|
||||
namespace: maintenance
|
||||
spec:
|
||||
image: registry.bstein.dev/bstein/metis-sentinel
|
||||
interval: 1m0s
|
||||
secretRef:
|
||||
name: harbor-regcred
|
||||
---
|
||||
apiVersion: image.toolkit.fluxcd.io/v1beta2
|
||||
kind: ImagePolicy
|
||||
metadata:
|
||||
name: metis-sentinel
|
||||
namespace: maintenance
|
||||
spec:
|
||||
imageRepositoryRef:
|
||||
name: metis-sentinel
|
||||
policy:
|
||||
semver:
|
||||
range: ">=0.1.0-0"
|
||||
---
|
||||
apiVersion: image.toolkit.fluxcd.io/v1beta2
|
||||
kind: ImageRepository
|
||||
metadata:
|
||||
name: soteria
|
||||
namespace: maintenance
|
||||
|
||||
@ -6,32 +6,47 @@ resources:
|
||||
- image.yaml
|
||||
- secretproviderclass.yaml
|
||||
- soteria-configmap.yaml
|
||||
- metis-configmap.yaml
|
||||
- metis-data-pvc.yaml
|
||||
- vault-serviceaccount.yaml
|
||||
- vault-sync-deployment.yaml
|
||||
- ariadne-serviceaccount.yaml
|
||||
- ariadne-rbac.yaml
|
||||
- disable-k3s-traefik-serviceaccount.yaml
|
||||
- k3s-traefik-cleanup-rbac.yaml
|
||||
- metis-serviceaccount.yaml
|
||||
- metis-rbac.yaml
|
||||
- metis-token-sync-serviceaccount.yaml
|
||||
- metis-token-sync-rbac.yaml
|
||||
- node-nofile-serviceaccount.yaml
|
||||
- pod-cleaner-rbac.yaml
|
||||
- soteria-serviceaccount.yaml
|
||||
- soteria-rbac.yaml
|
||||
- ariadne-deployment.yaml
|
||||
- metis-deployment.yaml
|
||||
- oneoffs/ariadne-migrate-job.yaml
|
||||
- ariadne-service.yaml
|
||||
- soteria-deployment.yaml
|
||||
- disable-k3s-traefik-daemonset.yaml
|
||||
- oneoffs/k3s-traefik-cleanup-job.yaml
|
||||
- node-nofile-daemonset.yaml
|
||||
- metis-sentinel-daemonset.yaml
|
||||
- metis-k3s-token-sync-cronjob.yaml
|
||||
- k3s-agent-restart-daemonset.yaml
|
||||
- pod-cleaner-cronjob.yaml
|
||||
- node-image-sweeper-serviceaccount.yaml
|
||||
- node-image-sweeper-daemonset.yaml
|
||||
- image-sweeper-cronjob.yaml
|
||||
- metis-service.yaml
|
||||
- metis-ingress.yaml
|
||||
- soteria-service.yaml
|
||||
images:
|
||||
- name: registry.bstein.dev/bstein/ariadne
|
||||
newTag: 0.1.0-22 # {"$imagepolicy": "maintenance:ariadne:tag"}
|
||||
- name: registry.bstein.dev/bstein/metis
|
||||
newTag: 0.1.0-0 # {"$imagepolicy": "maintenance:metis:tag"}
|
||||
- name: registry.bstein.dev/bstein/metis-sentinel
|
||||
newTag: 0.1.0-0 # {"$imagepolicy": "maintenance:metis-sentinel:tag"}
|
||||
- name: registry.bstein.dev/bstein/soteria
|
||||
newTag: 0.1.0-11 # {"$imagepolicy": "maintenance:soteria:tag"}
|
||||
configMapGenerator:
|
||||
|
||||
20
services/maintenance/metis-configmap.yaml
Normal file
20
services/maintenance/metis-configmap.yaml
Normal file
@ -0,0 +1,20 @@
|
||||
# services/maintenance/metis-configmap.yaml
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: metis
|
||||
namespace: maintenance
|
||||
data:
|
||||
METIS_BIND_ADDR: :8080
|
||||
METIS_INVENTORY_PATH: /app/inventory.titan-rpi4.yaml
|
||||
METIS_DATA_DIR: /var/lib/metis
|
||||
METIS_DEFAULT_FLASH_HOST: titan-22
|
||||
METIS_FLASH_HOSTS: titan-22
|
||||
METIS_LOCAL_HOST: titan-22
|
||||
METIS_ALLOWED_GROUPS: admin,maintainer
|
||||
METIS_MAX_DEVICE_BYTES: "300000000000"
|
||||
METIS_SENTINEL_PUSH_URL: http://metis.maintenance.svc.cluster.local/internal/sentinel/snapshot
|
||||
METIS_SENTINEL_INTERVAL_SEC: "1800"
|
||||
METIS_SENTINEL_NSENTER: "1"
|
||||
METIS_IMAGE_RPI4_ARMBIAN_LONGHORN: https://armbian.chi.auroradev.org/dl/rpi4b/archive/Armbian_26.2.1_Rpi4b_noble_current_6.18.9_minimal.img.xz
|
||||
METIS_IMAGE_RPI4_ARMBIAN_LONGHORN_SHA256: sha256:c450687adf4cc6a59725c43aefd58baf42ec71bdd379227d403cdde281768e46
|
||||
13
services/maintenance/metis-data-pvc.yaml
Normal file
13
services/maintenance/metis-data-pvc.yaml
Normal file
@ -0,0 +1,13 @@
|
||||
# services/maintenance/metis-data-pvc.yaml
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: metis-data
|
||||
namespace: maintenance
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
resources:
|
||||
requests:
|
||||
storage: 40Gi
|
||||
storageClassName: local-path
|
||||
47
services/maintenance/metis-deployment.yaml
Normal file
47
services/maintenance/metis-deployment.yaml
Normal file
@ -0,0 +1,47 @@
|
||||
# services/maintenance/metis-deployment.yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: metis
|
||||
namespace: maintenance
|
||||
spec:
|
||||
replicas: 1
|
||||
revisionHistoryLimit: 3
|
||||
selector:
|
||||
matchLabels:
|
||||
app: metis
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: metis
|
||||
annotations:
|
||||
prometheus.io/scrape: "true"
|
||||
prometheus.io/port: "8080"
|
||||
prometheus.io/path: "/metrics"
|
||||
spec:
|
||||
serviceAccountName: metis
|
||||
nodeSelector:
|
||||
kubernetes.io/hostname: titan-22
|
||||
kubernetes.io/arch: amd64
|
||||
node-role.kubernetes.io/worker: "true"
|
||||
containers:
|
||||
- name: metis
|
||||
image: registry.bstein.dev/bstein/metis:latest
|
||||
imagePullPolicy: Always
|
||||
envFrom:
|
||||
- configMapRef:
|
||||
name: metis
|
||||
ports:
|
||||
- name: http
|
||||
containerPort: 8080
|
||||
resources:
|
||||
requests:
|
||||
cpu: 100m
|
||||
memory: 128Mi
|
||||
limits:
|
||||
cpu: 500m
|
||||
memory: 512Mi
|
||||
securityContext:
|
||||
allowPrivilegeEscalation: false
|
||||
capabilities:
|
||||
drop: ["ALL"]
|
||||
27
services/maintenance/metis-ingress.yaml
Normal file
27
services/maintenance/metis-ingress.yaml
Normal file
@ -0,0 +1,27 @@
|
||||
# services/maintenance/metis-ingress.yaml
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: metis
|
||||
namespace: maintenance
|
||||
annotations:
|
||||
kubernetes.io/ingress.class: traefik
|
||||
cert-manager.io/cluster-issuer: letsencrypt
|
||||
traefik.ingress.kubernetes.io/router.entrypoints: websecure
|
||||
traefik.ingress.kubernetes.io/router.tls: "true"
|
||||
traefik.ingress.kubernetes.io/router.middlewares: sso-oauth2-proxy-forward-auth@kubernetescrd
|
||||
spec:
|
||||
tls:
|
||||
- hosts: ["metis.bstein.dev"]
|
||||
secretName: metis-tls
|
||||
rules:
|
||||
- host: metis.bstein.dev
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend:
|
||||
service:
|
||||
name: metis
|
||||
port:
|
||||
number: 80
|
||||
51
services/maintenance/metis-k3s-token-sync-cronjob.yaml
Normal file
51
services/maintenance/metis-k3s-token-sync-cronjob.yaml
Normal file
@ -0,0 +1,51 @@
|
||||
# services/maintenance/metis-k3s-token-sync-cronjob.yaml
|
||||
apiVersion: batch/v1
|
||||
kind: CronJob
|
||||
metadata:
|
||||
name: metis-k3s-token-sync
|
||||
namespace: maintenance
|
||||
spec:
|
||||
schedule: "11 */6 * * *"
|
||||
concurrencyPolicy: Forbid
|
||||
successfulJobsHistoryLimit: 1
|
||||
failedJobsHistoryLimit: 2
|
||||
jobTemplate:
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
serviceAccountName: metis-token-sync
|
||||
restartPolicy: OnFailure
|
||||
nodeSelector:
|
||||
kubernetes.io/arch: arm64
|
||||
node-role.kubernetes.io/control-plane: "true"
|
||||
tolerations:
|
||||
- key: node-role.kubernetes.io/control-plane
|
||||
operator: Exists
|
||||
effect: NoSchedule
|
||||
- key: node-role.kubernetes.io/master
|
||||
operator: Exists
|
||||
effect: NoSchedule
|
||||
containers:
|
||||
- name: sync
|
||||
image: registry.bstein.dev/bstein/kubectl:1.35.0
|
||||
imagePullPolicy: IfNotPresent
|
||||
command:
|
||||
- /bin/sh
|
||||
- -c
|
||||
args:
|
||||
- |
|
||||
set -euo pipefail
|
||||
token="$(tr -d '\n' < /host/var/lib/rancher/k3s/server/node-token)"
|
||||
kubectl -n maintenance create secret generic metis-runtime \
|
||||
--from-literal=k3s_token="${token}" \
|
||||
--dry-run=client -o yaml | kubectl apply -f -
|
||||
securityContext:
|
||||
runAsUser: 0
|
||||
volumeMounts:
|
||||
- name: k3s-server
|
||||
mountPath: /host/var/lib/rancher/k3s/server
|
||||
readOnly: true
|
||||
volumes:
|
||||
- name: k3s-server
|
||||
hostPath:
|
||||
path: /var/lib/rancher/k3s/server
|
||||
27
services/maintenance/metis-rbac.yaml
Normal file
27
services/maintenance/metis-rbac.yaml
Normal file
@ -0,0 +1,27 @@
|
||||
# services/maintenance/metis-rbac.yaml
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: ClusterRole
|
||||
metadata:
|
||||
name: metis-node-manager
|
||||
rules:
|
||||
- apiGroups: [""]
|
||||
resources:
|
||||
- nodes
|
||||
verbs:
|
||||
- get
|
||||
- list
|
||||
- watch
|
||||
- delete
|
||||
---
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: ClusterRoleBinding
|
||||
metadata:
|
||||
name: metis-node-manager
|
||||
subjects:
|
||||
- kind: ServiceAccount
|
||||
name: metis
|
||||
namespace: maintenance
|
||||
roleRef:
|
||||
apiGroup: rbac.authorization.k8s.io
|
||||
kind: ClusterRole
|
||||
name: metis-node-manager
|
||||
133
services/maintenance/metis-sentinel-daemonset.yaml
Normal file
133
services/maintenance/metis-sentinel-daemonset.yaml
Normal file
@ -0,0 +1,133 @@
|
||||
# services/maintenance/metis-sentinel-daemonset.yaml
|
||||
apiVersion: apps/v1
|
||||
kind: DaemonSet
|
||||
metadata:
|
||||
name: metis-sentinel
|
||||
namespace: maintenance
|
||||
spec:
|
||||
selector:
|
||||
matchLabels:
|
||||
app: metis-sentinel
|
||||
updateStrategy:
|
||||
type: RollingUpdate
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: metis-sentinel
|
||||
annotations:
|
||||
prometheus.io/scrape: "true"
|
||||
prometheus.io/port: "8080"
|
||||
prometheus.io/path: "/metrics"
|
||||
spec:
|
||||
serviceAccountName: metis
|
||||
nodeSelector:
|
||||
kubernetes.io/os: linux
|
||||
node-role.kubernetes.io/worker: "true"
|
||||
containers:
|
||||
- name: metis-sentinel
|
||||
image: registry.bstein.dev/bstein/metis-sentinel:latest
|
||||
imagePullPolicy: Always
|
||||
command:
|
||||
- /bin/sh
|
||||
- -c
|
||||
args:
|
||||
- |
|
||||
set -eu
|
||||
out_dir="${METIS_SENTINEL_OUT:-/var/run/metis-sentinel}"
|
||||
interval="${METIS_SENTINEL_INTERVAL_SEC:-120}"
|
||||
mkdir -p "${out_dir}"
|
||||
while true; do
|
||||
ts="$(date -u +%Y%m%dT%H%M%SZ)"
|
||||
node="${METIS_SENTINEL_NODE:-unknown}"
|
||||
tmp="${out_dir}/${node}-${ts}.json.tmp"
|
||||
out="${out_dir}/${node}-${ts}.json"
|
||||
if metis-sentinel > "${tmp}"; then
|
||||
mv "${tmp}" "${out}"
|
||||
else
|
||||
rm -f "${tmp}" || true
|
||||
fi
|
||||
sleep "${interval}"
|
||||
done
|
||||
envFrom:
|
||||
- configMapRef:
|
||||
name: metis
|
||||
env:
|
||||
- name: METIS_SENTINEL_NODE
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: spec.nodeName
|
||||
ports:
|
||||
- name: http
|
||||
containerPort: 8080
|
||||
volumeMounts:
|
||||
- name: sentinel-output
|
||||
mountPath: /var/run/metis-sentinel
|
||||
resources:
|
||||
requests:
|
||||
cpu: 25m
|
||||
memory: 64Mi
|
||||
limits:
|
||||
cpu: 250m
|
||||
memory: 256Mi
|
||||
securityContext:
|
||||
allowPrivilegeEscalation: false
|
||||
runAsUser: 0
|
||||
capabilities:
|
||||
drop: ["ALL"]
|
||||
- name: sentinel-pusher
|
||||
image: curlimages/curl:8.12.1
|
||||
imagePullPolicy: IfNotPresent
|
||||
command:
|
||||
- /bin/sh
|
||||
- -c
|
||||
args:
|
||||
- |
|
||||
set -eu
|
||||
out_dir="${METIS_SENTINEL_OUT:-/var/run/metis-sentinel}"
|
||||
push_url="${METIS_SENTINEL_PUSH_URL:-}"
|
||||
interval="${METIS_SENTINEL_PUSH_INTERVAL_SEC:-120}"
|
||||
timeout="${METIS_SENTINEL_PUSH_TIMEOUT_SEC:-10}"
|
||||
mkdir -p "${out_dir}"
|
||||
while true; do
|
||||
for snapshot in "${out_dir}"/*.json; do
|
||||
[ -f "${snapshot}" ] || continue
|
||||
if [ -z "${push_url}" ]; then
|
||||
break
|
||||
fi
|
||||
if curl -fsS --connect-timeout "${timeout}" --max-time "${timeout}" \
|
||||
-X POST \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "X-Metis-Node: ${METIS_SENTINEL_NODE:-unknown}" \
|
||||
--data-binary "@${snapshot}" \
|
||||
"${push_url}"; then
|
||||
rm -f "${snapshot}"
|
||||
fi
|
||||
done
|
||||
sleep "${interval}"
|
||||
done
|
||||
envFrom:
|
||||
- configMapRef:
|
||||
name: metis
|
||||
env:
|
||||
- name: METIS_SENTINEL_NODE
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: spec.nodeName
|
||||
volumeMounts:
|
||||
- name: sentinel-output
|
||||
mountPath: /var/run/metis-sentinel
|
||||
resources:
|
||||
requests:
|
||||
cpu: 10m
|
||||
memory: 32Mi
|
||||
limits:
|
||||
cpu: 100m
|
||||
memory: 128Mi
|
||||
securityContext:
|
||||
allowPrivilegeEscalation: false
|
||||
runAsUser: 0
|
||||
capabilities:
|
||||
drop: ["ALL"]
|
||||
volumes:
|
||||
- name: sentinel-output
|
||||
emptyDir: {}
|
||||
18
services/maintenance/metis-service.yaml
Normal file
18
services/maintenance/metis-service.yaml
Normal file
@ -0,0 +1,18 @@
|
||||
# services/maintenance/metis-service.yaml
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: metis
|
||||
namespace: maintenance
|
||||
annotations:
|
||||
prometheus.io/scrape: "true"
|
||||
prometheus.io/port: "80"
|
||||
prometheus.io/path: "/metrics"
|
||||
spec:
|
||||
type: ClusterIP
|
||||
selector:
|
||||
app: metis
|
||||
ports:
|
||||
- name: http
|
||||
port: 80
|
||||
targetPort: http
|
||||
6
services/maintenance/metis-serviceaccount.yaml
Normal file
6
services/maintenance/metis-serviceaccount.yaml
Normal file
@ -0,0 +1,6 @@
|
||||
# services/maintenance/metis-serviceaccount.yaml
|
||||
apiVersion: v1
|
||||
kind: ServiceAccount
|
||||
metadata:
|
||||
name: metis
|
||||
namespace: maintenance
|
||||
30
services/maintenance/metis-token-sync-rbac.yaml
Normal file
30
services/maintenance/metis-token-sync-rbac.yaml
Normal file
@ -0,0 +1,30 @@
|
||||
# services/maintenance/metis-token-sync-rbac.yaml
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: Role
|
||||
metadata:
|
||||
name: metis-token-sync
|
||||
namespace: maintenance
|
||||
rules:
|
||||
- apiGroups: [""]
|
||||
resources:
|
||||
- secrets
|
||||
verbs:
|
||||
- get
|
||||
- list
|
||||
- create
|
||||
- update
|
||||
- patch
|
||||
---
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: RoleBinding
|
||||
metadata:
|
||||
name: metis-token-sync
|
||||
namespace: maintenance
|
||||
subjects:
|
||||
- kind: ServiceAccount
|
||||
name: metis-token-sync
|
||||
namespace: maintenance
|
||||
roleRef:
|
||||
apiGroup: rbac.authorization.k8s.io
|
||||
kind: Role
|
||||
name: metis-token-sync
|
||||
@ -0,0 +1,6 @@
|
||||
# services/maintenance/metis-token-sync-serviceaccount.yaml
|
||||
apiVersion: v1
|
||||
kind: ServiceAccount
|
||||
metadata:
|
||||
name: metis-token-sync
|
||||
namespace: maintenance
|
||||
@ -1125,7 +1125,7 @@
|
||||
{
|
||||
"id": 17,
|
||||
"type": "stat",
|
||||
"title": "Ariadne CI Coverage (%)",
|
||||
"title": "Platform CI Coverage (%)",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "atlas-vm"
|
||||
@ -1138,7 +1138,7 @@
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "ariadne_ci_coverage_percent{repo=\"ariadne\"}",
|
||||
"expr": "ariadne_ci_coverage_percent{repo=~\"ariadne|metis\"}",
|
||||
"refId": "A",
|
||||
"legendFormat": "{{branch}}",
|
||||
"instant": true
|
||||
@ -1183,12 +1183,13 @@
|
||||
"values": false
|
||||
},
|
||||
"textMode": "value"
|
||||
}
|
||||
},
|
||||
"description": "Internal source panel for Atlas Overview automation test rollups."
|
||||
},
|
||||
{
|
||||
"id": 18,
|
||||
"type": "table",
|
||||
"title": "Ariadne CI Tests (latest)",
|
||||
"title": "Platform CI Tests (latest)",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "atlas-vm"
|
||||
@ -1201,7 +1202,7 @@
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "ariadne_ci_tests_total{repo=\"ariadne\"}",
|
||||
"expr": "ariadne_ci_tests_total{repo=~\"ariadne|metis\"}",
|
||||
"refId": "A",
|
||||
"instant": true
|
||||
}
|
||||
@ -1233,7 +1234,8 @@
|
||||
"order": "desc"
|
||||
}
|
||||
}
|
||||
]
|
||||
],
|
||||
"description": "Atlas Overview test panels depend on these internal repo-tagged CI series."
|
||||
}
|
||||
],
|
||||
"time": {
|
||||
|
||||
@ -1677,7 +1677,7 @@
|
||||
{
|
||||
"id": 42,
|
||||
"type": "timeseries",
|
||||
"title": "Ariadne Test Success Rate",
|
||||
"title": "Platform Test Success Rate",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "atlas-vm"
|
||||
@ -1690,7 +1690,7 @@
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "100 * sum(max_over_time(ariadne_ci_tests_total{repo=\"ariadne\",result=\"passed\"}[30d])) / clamp_min(sum(max_over_time(ariadne_ci_tests_total{repo=\"ariadne\",result=~\"passed|failed|error\"}[30d])), 1)",
|
||||
"expr": "100 * sum(max_over_time(ariadne_ci_tests_total{repo=~\"ariadne|metis\",result=\"passed\"}[30d])) / clamp_min(sum(max_over_time(ariadne_ci_tests_total{repo=~\"ariadne|metis\",result=~\"passed|failed|error\"}[30d])), 1)",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
@ -1709,12 +1709,13 @@
|
||||
"tooltip": {
|
||||
"mode": "multi"
|
||||
}
|
||||
}
|
||||
},
|
||||
"description": "Atlas Overview mirrors the Atlas Jobs internal dashboard for automation test health. Add new test series there first so they roll up here."
|
||||
},
|
||||
{
|
||||
"id": 43,
|
||||
"type": "bargauge",
|
||||
"title": "Tests with Failures (24h)",
|
||||
"title": "Platform Tests with Failures (24h)",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "atlas-vm"
|
||||
@ -1727,7 +1728,7 @@
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sort_desc(sum by (result) (max_over_time(ariadne_ci_tests_total{repo=\"ariadne\",result=~\"failed|error\"}[24h])))",
|
||||
"expr": "sort_desc(sum by (result) (max_over_time(ariadne_ci_tests_total{repo=~\"ariadne|metis\",result=~\"failed|error\"}[24h])))",
|
||||
"refId": "A",
|
||||
"legendFormat": "{{result}}",
|
||||
"instant": true
|
||||
@ -1814,7 +1815,8 @@
|
||||
"order": "desc"
|
||||
}
|
||||
}
|
||||
]
|
||||
],
|
||||
"description": "This summary is sourced from the Atlas Jobs internal dashboard rather than a separate overview-only query."
|
||||
},
|
||||
{
|
||||
"id": 11,
|
||||
|
||||
@ -49,7 +49,7 @@ data:
|
||||
interval: 1m
|
||||
rules:
|
||||
- uid: disk-pressure-root
|
||||
title: "Node rootfs high (>80%)"
|
||||
title: "Node rootfs high (>85%)"
|
||||
condition: C
|
||||
for: "10m"
|
||||
data:
|
||||
@ -83,7 +83,7 @@ data:
|
||||
type: threshold
|
||||
conditions:
|
||||
- evaluator:
|
||||
params: [80]
|
||||
params: [85]
|
||||
type: gt
|
||||
operator:
|
||||
type: and
|
||||
@ -93,7 +93,7 @@ data:
|
||||
noDataState: NoData
|
||||
execErrState: Error
|
||||
annotations:
|
||||
summary: "{{ $labels.node }} rootfs >80% for 10m"
|
||||
summary: "{{ $labels.node }} rootfs >85% for 10m"
|
||||
labels:
|
||||
severity: warning
|
||||
- uid: disk-growth-1h
|
||||
|
||||
@ -1134,7 +1134,7 @@ data:
|
||||
{
|
||||
"id": 17,
|
||||
"type": "stat",
|
||||
"title": "Ariadne CI Coverage (%)",
|
||||
"title": "Platform CI Coverage (%)",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "atlas-vm"
|
||||
@ -1147,7 +1147,7 @@ data:
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "ariadne_ci_coverage_percent{repo=\"ariadne\"}",
|
||||
"expr": "ariadne_ci_coverage_percent{repo=~\"ariadne|metis\"}",
|
||||
"refId": "A",
|
||||
"legendFormat": "{{branch}}",
|
||||
"instant": true
|
||||
@ -1192,12 +1192,13 @@ data:
|
||||
"values": false
|
||||
},
|
||||
"textMode": "value"
|
||||
}
|
||||
},
|
||||
"description": "Internal source panel for Atlas Overview automation test rollups."
|
||||
},
|
||||
{
|
||||
"id": 18,
|
||||
"type": "table",
|
||||
"title": "Ariadne CI Tests (latest)",
|
||||
"title": "Platform CI Tests (latest)",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "atlas-vm"
|
||||
@ -1210,7 +1211,7 @@ data:
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "ariadne_ci_tests_total{repo=\"ariadne\"}",
|
||||
"expr": "ariadne_ci_tests_total{repo=~\"ariadne|metis\"}",
|
||||
"refId": "A",
|
||||
"instant": true
|
||||
}
|
||||
@ -1242,7 +1243,8 @@ data:
|
||||
"order": "desc"
|
||||
}
|
||||
}
|
||||
]
|
||||
],
|
||||
"description": "Atlas Overview test panels depend on these internal repo-tagged CI series."
|
||||
}
|
||||
],
|
||||
"time": {
|
||||
|
||||
@ -1686,7 +1686,7 @@ data:
|
||||
{
|
||||
"id": 42,
|
||||
"type": "timeseries",
|
||||
"title": "Ariadne Test Success Rate",
|
||||
"title": "Platform Test Success Rate",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "atlas-vm"
|
||||
@ -1699,7 +1699,7 @@ data:
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "100 * sum(max_over_time(ariadne_ci_tests_total{repo=\"ariadne\",result=\"passed\"}[30d])) / clamp_min(sum(max_over_time(ariadne_ci_tests_total{repo=\"ariadne\",result=~\"passed|failed|error\"}[30d])), 1)",
|
||||
"expr": "100 * sum(max_over_time(ariadne_ci_tests_total{repo=~\"ariadne|metis\",result=\"passed\"}[30d])) / clamp_min(sum(max_over_time(ariadne_ci_tests_total{repo=~\"ariadne|metis\",result=~\"passed|failed|error\"}[30d])), 1)",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
@ -1718,12 +1718,13 @@ data:
|
||||
"tooltip": {
|
||||
"mode": "multi"
|
||||
}
|
||||
}
|
||||
},
|
||||
"description": "Atlas Overview mirrors the Atlas Jobs internal dashboard for automation test health. Add new test series there first so they roll up here."
|
||||
},
|
||||
{
|
||||
"id": 43,
|
||||
"type": "bargauge",
|
||||
"title": "Tests with Failures (24h)",
|
||||
"title": "Platform Tests with Failures (24h)",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "atlas-vm"
|
||||
@ -1736,7 +1737,7 @@ data:
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sort_desc(sum by (result) (max_over_time(ariadne_ci_tests_total{repo=\"ariadne\",result=~\"failed|error\"}[24h])))",
|
||||
"expr": "sort_desc(sum by (result) (max_over_time(ariadne_ci_tests_total{repo=~\"ariadne|metis\",result=~\"failed|error\"}[24h])))",
|
||||
"refId": "A",
|
||||
"legendFormat": "{{result}}",
|
||||
"instant": true
|
||||
@ -1823,7 +1824,8 @@ data:
|
||||
"order": "desc"
|
||||
}
|
||||
}
|
||||
]
|
||||
],
|
||||
"description": "This summary is sourced from the Atlas Jobs internal dashboard rather than a separate overview-only query."
|
||||
},
|
||||
{
|
||||
"id": 11,
|
||||
|
||||
@ -286,7 +286,7 @@ spec:
|
||||
podAnnotations:
|
||||
vault.hashicorp.com/agent-inject: "true"
|
||||
vault.hashicorp.com/role: "monitoring"
|
||||
monitoring.bstein.dev/restart-rev: "5"
|
||||
monitoring.bstein.dev/restart-rev: "6"
|
||||
vault.hashicorp.com/agent-inject-secret-grafana-env.sh: "kv/data/atlas/monitoring/grafana-admin"
|
||||
vault.hashicorp.com/agent-inject-template-grafana-env.sh: |
|
||||
{{ with secret "kv/data/atlas/monitoring/grafana-admin" }}
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user