Soteria PVC Restore Drill (backup.bstein.dev)

Use this checklist after meaningful Soteria backup, restore, auth, or alerting changes.

Production Restore Drill Checklist

Verify baseline health before touching restores.
- flux get kustomizations -n flux-system maintenance
- kubectl -n maintenance get deploy soteria oauth2-proxy-soteria
Confirm operator access and source safety.
- Operator must be in Keycloak group admin or maintenance.
- Choose a real source PVC that is expected to be backed up, not a throwaway test PVC.
Run the UI flow at https://backup.bstein.dev.
- Sign in via Keycloak.
- In PVC Inventory, select source namespace and PVC.
- Click Backup now and wait for success in Last Action.
- Click Restore and pick a completed snapshot.
- Set Target namespace and unique Target PVC name (restore-<source-pvc>-<date>).
- Click Create restore PVC.
Validate restore output.
- kubectl -n <target-namespace> get pvc <target-pvc>
- If workload-level validation is required, attach a temporary pod and inspect expected files/data.
Clean up.
- kubectl -n <target-namespace> delete pvc <target-pvc>
- Remove detached restore Longhorn volume from Longhorn UI/API if one remains.

Alert Query Verification (`maint-soteria-*`)

Start a local query endpoint:

kubectl -n monitoring port-forward svc/victoria-metrics-k8s-stack 8428:8428

Validate each alert expression directly.

maint-soteria-refresh-stale (time() - soteria_inventory_refresh_timestamp_seconds, threshold > 900).
- curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=time() - soteria_inventory_refresh_timestamp_seconds'
- curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=(time() - soteria_inventory_refresh_timestamp_seconds) > bool 900'
- Healthy expectation: age is below 900 and threshold query returns 0.
maint-soteria-backup-unhealthy (sum((1 - pvc_backup_health{driver="longhorn"}) > bool 0) or on() vector(0), threshold > 0).
- curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum((1 - pvc_backup_health{driver="longhorn"}) > bool 0) or on() vector(0)'
- curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=(1 - pvc_backup_health{driver="longhorn"}) > bool 0'
- curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=max by (namespace,pvc) (pvc_backup_age_hours{driver="longhorn"})'
- Healthy expectation: unhealthy count is 0; no series should be 1 in the per-PVC unhealthy query.
maint-soteria-authz-denials (sum(increase(soteria_authz_denials_total[15m])) or on() vector(0), threshold > 9 for 10m).
- curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum(increase(soteria_authz_denials_total[15m])) or on() vector(0)'
- curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum by (reason) (increase(soteria_authz_denials_total[15m]))'
- Healthy expectation: total remains below 10 in normal operation; spikes should map to expected reason labels.

Failure Triage

401/403 on UI or API:
- Verify oauth2-proxy group claims include admin or maintenance.
Restore conflict:
- Target PVC already exists; choose a new target PVC name.
maint-soteria-refresh-stale firing:
- Check Soteria pod health and /metrics scrape reachability from monitoring.
maint-soteria-backup-unhealthy firing:
- Inspect pvc_backup_health and pvc_backup_age_hours to identify stale or missing backups.
maint-soteria-authz-denials firing:
- Confirm expected OIDC groups and inspect denial reason labels for policy or header regressions.

Emergency Recovery Notes (2026-05-22)

titan-04 reached Ready after reboot and simple canaries passed, but Collabora's default entrypoint exited immediately on that node while the same image stayed healthy on titan-08 and titan-11. A temporary recovery-suspect taint did not stick, so the node is cordoned until the runtime anomaly and taint management behavior are understood.
titan-05 stayed cordoned because real Jenkins agents overloaded it despite a simple runtime canary passing. The data-prepper job was temporarily disabled in Jenkins after repeated respawns and aborts.
titan-06 remained unreachable (No route to host) and needs out-of-band power or network recovery.
titan-14 passed a simple runtime canary after reboot, but Longhorn's instance-manager became stuck in ContainerCreating. It was cordoned, Longhorn scheduling was disabled on the Longhorn node object, and the longhorn-host label was removed because it is not one of the intended HDD Longhorn nodes.
titan-22 stayed cordoned because SSH and kubelet/metrics access were flaky or timing out.
Worker-selected pods stranded on control-plane nodes after label cleanup were bounced: Traefik and maintenance-vault-sync moved off titan-0a. Remaining titan-0a/titan-0c pods are daemonsets or HA control/storage helpers, not generic app load.
Crypto mining was throttled through Flux-tracked manifests: monerod and xmrig are scaled to zero / gated by atlas.bstein.dev/crypto-mining-enabled=true.
Ananke should not treat node Ready as sufficient recovery. A node should pass SSH reachability, kubelet/metrics scrape health, and a bound canary that actually reaches Completed; service-specific canaries are useful when the incident involves large images or storage/runtime paths.
Ananke should detect under-requested Jenkins agents during recovery. Several agents requested 25m CPU per build container while allowing 1500m, which let the scheduler place real CI load on fragile nodes.
Ananke should pause or constrain descheduler behavior during recovery; Collabora was evicted from a healthy node and then landed on a suspect node.

5.8 KiB Raw Blame History

Soteria PVC Restore Drill (backup.bstein.dev)

Production Restore Drill Checklist

Alert Query Verification (maint-soteria-*)

Failure Triage

Emergency Recovery Notes (2026-05-22)

5.8 KiB

Raw Blame History

Alert Query Verification (`maint-soteria-*`)