5.8 KiB
5.8 KiB
Soteria PVC Restore Drill (backup.bstein.dev)
Use this checklist after meaningful Soteria backup, restore, auth, or alerting changes.
Production Restore Drill Checklist
- Verify baseline health before touching restores.
flux get kustomizations -n flux-system maintenancekubectl -n maintenance get deploy soteria oauth2-proxy-soteria
- Confirm operator access and source safety.
- Operator must be in Keycloak group
adminormaintenance. - Choose a real source PVC that is expected to be backed up, not a throwaway test PVC.
- Operator must be in Keycloak group
- Run the UI flow at
https://backup.bstein.dev.- Sign in via Keycloak.
- In
PVC Inventory, select source namespace and PVC. - Click
Backup nowand wait for success inLast Action. - Click
Restoreand pick a completed snapshot. - Set
Target namespaceand uniqueTarget PVC name(restore-<source-pvc>-<date>). - Click
Create restore PVC.
- Validate restore output.
kubectl -n <target-namespace> get pvc <target-pvc>- If workload-level validation is required, attach a temporary pod and inspect expected files/data.
- Clean up.
kubectl -n <target-namespace> delete pvc <target-pvc>- Remove detached restore Longhorn volume from Longhorn UI/API if one remains.
Alert Query Verification (maint-soteria-*)
Start a local query endpoint:
kubectl -n monitoring port-forward svc/victoria-metrics-k8s-stack 8428:8428
Validate each alert expression directly.
maint-soteria-refresh-stale(time() - soteria_inventory_refresh_timestamp_seconds, threshold> 900).curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=time() - soteria_inventory_refresh_timestamp_seconds'curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=(time() - soteria_inventory_refresh_timestamp_seconds) > bool 900'- Healthy expectation: age is below
900and threshold query returns0.
maint-soteria-backup-unhealthy(sum((1 - pvc_backup_health{driver="longhorn"}) > bool 0) or on() vector(0), threshold> 0).curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum((1 - pvc_backup_health{driver="longhorn"}) > bool 0) or on() vector(0)'curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=(1 - pvc_backup_health{driver="longhorn"}) > bool 0'curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=max by (namespace,pvc) (pvc_backup_age_hours{driver="longhorn"})'- Healthy expectation: unhealthy count is
0; no series should be1in the per-PVC unhealthy query.
maint-soteria-authz-denials(sum(increase(soteria_authz_denials_total[15m])) or on() vector(0), threshold> 9for 10m).curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum(increase(soteria_authz_denials_total[15m])) or on() vector(0)'curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum by (reason) (increase(soteria_authz_denials_total[15m]))'- Healthy expectation: total remains below
10in normal operation; spikes should map to expectedreasonlabels.
Failure Triage
401/403on UI or API:- Verify oauth2-proxy group claims include
adminormaintenance.
- Verify oauth2-proxy group claims include
- Restore conflict:
- Target PVC already exists; choose a new target PVC name.
maint-soteria-refresh-stalefiring:- Check Soteria pod health and
/metricsscrape reachability frommonitoring.
- Check Soteria pod health and
maint-soteria-backup-unhealthyfiring:- Inspect
pvc_backup_healthandpvc_backup_age_hoursto identify stale or missing backups.
- Inspect
maint-soteria-authz-denialsfiring:- Confirm expected OIDC groups and inspect denial
reasonlabels for policy or header regressions.
- Confirm expected OIDC groups and inspect denial
Emergency Recovery Notes (2026-05-22)
titan-04reached Ready after reboot and simple canaries passed, but Collabora's default entrypoint exited immediately on that node while the same image stayed healthy ontitan-08andtitan-11. A temporaryrecovery-suspecttaint did not stick, so the node is cordoned until the runtime anomaly and taint management behavior are understood.titan-05stayed cordoned because real Jenkins agents overloaded it despite a simple runtime canary passing. The data-prepper job was temporarily disabled in Jenkins after repeated respawns and aborts.titan-06remained unreachable (No route to host) and needs out-of-band power or network recovery.titan-14passed a simple runtime canary after reboot, but Longhorn's instance-manager became stuck inContainerCreating. It was cordoned, Longhorn scheduling was disabled on the Longhorn node object, and thelonghorn-hostlabel was removed because it is not one of the intended HDD Longhorn nodes.titan-22stayed cordoned because SSH and kubelet/metrics access were flaky or timing out.- Worker-selected pods stranded on control-plane nodes after label cleanup were bounced: Traefik and
maintenance-vault-syncmoved offtitan-0a. Remainingtitan-0a/titan-0cpods are daemonsets or HA control/storage helpers, not generic app load. - Crypto mining was throttled through Flux-tracked manifests: monerod and xmrig are scaled to zero / gated by
atlas.bstein.dev/crypto-mining-enabled=true. - Ananke should not treat node
Readyas sufficient recovery. A node should pass SSH reachability, kubelet/metrics scrape health, and a bound canary that actually reachesCompleted; service-specific canaries are useful when the incident involves large images or storage/runtime paths. - Ananke should detect under-requested Jenkins agents during recovery. Several agents requested
25mCPU per build container while allowing1500m, which let the scheduler place real CI load on fragile nodes. - Ananke should pause or constrain descheduler behavior during recovery; Collabora was evicted from a healthy node and then landed on a suspect node.