2026-05-22 06:57:01 -03:00

5.8 KiB

Soteria PVC Restore Drill (backup.bstein.dev)

Use this checklist after meaningful Soteria backup, restore, auth, or alerting changes.

Production Restore Drill Checklist

  1. Verify baseline health before touching restores.
    • flux get kustomizations -n flux-system maintenance
    • kubectl -n maintenance get deploy soteria oauth2-proxy-soteria
  2. Confirm operator access and source safety.
    • Operator must be in Keycloak group admin or maintenance.
    • Choose a real source PVC that is expected to be backed up, not a throwaway test PVC.
  3. Run the UI flow at https://backup.bstein.dev.
    • Sign in via Keycloak.
    • In PVC Inventory, select source namespace and PVC.
    • Click Backup now and wait for success in Last Action.
    • Click Restore and pick a completed snapshot.
    • Set Target namespace and unique Target PVC name (restore-<source-pvc>-<date>).
    • Click Create restore PVC.
  4. Validate restore output.
    • kubectl -n <target-namespace> get pvc <target-pvc>
    • If workload-level validation is required, attach a temporary pod and inspect expected files/data.
  5. Clean up.
    • kubectl -n <target-namespace> delete pvc <target-pvc>
    • Remove detached restore Longhorn volume from Longhorn UI/API if one remains.

Alert Query Verification (maint-soteria-*)

Start a local query endpoint:

kubectl -n monitoring port-forward svc/victoria-metrics-k8s-stack 8428:8428

Validate each alert expression directly.

  1. maint-soteria-refresh-stale (time() - soteria_inventory_refresh_timestamp_seconds, threshold > 900).
    • curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=time() - soteria_inventory_refresh_timestamp_seconds'
    • curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=(time() - soteria_inventory_refresh_timestamp_seconds) > bool 900'
    • Healthy expectation: age is below 900 and threshold query returns 0.
  2. maint-soteria-backup-unhealthy (sum((1 - pvc_backup_health{driver="longhorn"}) > bool 0) or on() vector(0), threshold > 0).
    • curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum((1 - pvc_backup_health{driver="longhorn"}) > bool 0) or on() vector(0)'
    • curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=(1 - pvc_backup_health{driver="longhorn"}) > bool 0'
    • curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=max by (namespace,pvc) (pvc_backup_age_hours{driver="longhorn"})'
    • Healthy expectation: unhealthy count is 0; no series should be 1 in the per-PVC unhealthy query.
  3. maint-soteria-authz-denials (sum(increase(soteria_authz_denials_total[15m])) or on() vector(0), threshold > 9 for 10m).
    • curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum(increase(soteria_authz_denials_total[15m])) or on() vector(0)'
    • curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum by (reason) (increase(soteria_authz_denials_total[15m]))'
    • Healthy expectation: total remains below 10 in normal operation; spikes should map to expected reason labels.

Failure Triage

  • 401/403 on UI or API:
    • Verify oauth2-proxy group claims include admin or maintenance.
  • Restore conflict:
    • Target PVC already exists; choose a new target PVC name.
  • maint-soteria-refresh-stale firing:
    • Check Soteria pod health and /metrics scrape reachability from monitoring.
  • maint-soteria-backup-unhealthy firing:
    • Inspect pvc_backup_health and pvc_backup_age_hours to identify stale or missing backups.
  • maint-soteria-authz-denials firing:
    • Confirm expected OIDC groups and inspect denial reason labels for policy or header regressions.

Emergency Recovery Notes (2026-05-22)

  • titan-04 reached Ready after reboot and simple canaries passed, but Collabora's default entrypoint exited immediately on that node while the same image stayed healthy on titan-08 and titan-11. A temporary recovery-suspect taint did not stick, so the node is cordoned until the runtime anomaly and taint management behavior are understood.
  • titan-05 stayed cordoned because real Jenkins agents overloaded it despite a simple runtime canary passing. The data-prepper job was temporarily disabled in Jenkins after repeated respawns and aborts.
  • titan-06 remained unreachable (No route to host) and needs out-of-band power or network recovery.
  • titan-14 passed a simple runtime canary after reboot, but Longhorn's instance-manager became stuck in ContainerCreating. It was cordoned, Longhorn scheduling was disabled on the Longhorn node object, and the longhorn-host label was removed because it is not one of the intended HDD Longhorn nodes.
  • titan-22 stayed cordoned because SSH and kubelet/metrics access were flaky or timing out.
  • Worker-selected pods stranded on control-plane nodes after label cleanup were bounced: Traefik and maintenance-vault-sync moved off titan-0a. Remaining titan-0a/titan-0c pods are daemonsets or HA control/storage helpers, not generic app load.
  • Crypto mining was throttled through Flux-tracked manifests: monerod and xmrig are scaled to zero / gated by atlas.bstein.dev/crypto-mining-enabled=true.
  • Ananke should not treat node Ready as sufficient recovery. A node should pass SSH reachability, kubelet/metrics scrape health, and a bound canary that actually reaches Completed; service-specific canaries are useful when the incident involves large images or storage/runtime paths.
  • Ananke should detect under-requested Jenkins agents during recovery. Several agents requested 25m CPU per build container while allowing 1500m, which let the scheduler place real CI load on fragile nodes.
  • Ananke should pause or constrain descheduler behavior during recovery; Collabora was evicted from a healthy node and then landed on a suspect node.