3.7 KiB
3.7 KiB
Soteria PVC Restore Drill (backup.bstein.dev)
Use this checklist after meaningful Soteria backup, restore, auth, or alerting changes.
Production Restore Drill Checklist
- Verify baseline health before touching restores.
flux get kustomizations -n flux-system maintenancekubectl -n maintenance get deploy soteria oauth2-proxy-soteria
- Confirm operator access and source safety.
- Operator must be in Keycloak group
adminormaintenance. - Choose a real source PVC that is expected to be backed up, not a throwaway test PVC.
- Operator must be in Keycloak group
- Run the UI flow at
https://backup.bstein.dev.- Sign in via Keycloak.
- In
PVC Inventory, select source namespace and PVC. - Click
Backup nowand wait for success inLast Action. - Click
Restoreand pick a completed snapshot. - Set
Target namespaceand uniqueTarget PVC name(restore-<source-pvc>-<date>). - Click
Create restore PVC.
- Validate restore output.
kubectl -n <target-namespace> get pvc <target-pvc>- If workload-level validation is required, attach a temporary pod and inspect expected files/data.
- Clean up.
kubectl -n <target-namespace> delete pvc <target-pvc>- Remove detached restore Longhorn volume from Longhorn UI/API if one remains.
Alert Query Verification (maint-soteria-*)
Start a local query endpoint:
kubectl -n monitoring port-forward svc/victoria-metrics-k8s-stack 8428:8428
Validate each alert expression directly.
maint-soteria-refresh-stale(time() - soteria_inventory_refresh_timestamp_seconds, threshold> 900).curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=time() - soteria_inventory_refresh_timestamp_seconds'curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=(time() - soteria_inventory_refresh_timestamp_seconds) > bool 900'- Healthy expectation: age is below
900and threshold query returns0.
maint-soteria-backup-unhealthy(sum((1 - pvc_backup_health{driver="longhorn"}) > bool 0) or on() vector(0), threshold> 0).curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum((1 - pvc_backup_health{driver="longhorn"}) > bool 0) or on() vector(0)'curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=(1 - pvc_backup_health{driver="longhorn"}) > bool 0'curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=max by (namespace,pvc) (pvc_backup_age_hours{driver="longhorn"})'- Healthy expectation: unhealthy count is
0; no series should be1in the per-PVC unhealthy query.
maint-soteria-authz-denials(sum(increase(soteria_authz_denials_total[15m])) or on() vector(0), threshold> 9for 10m).curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum(increase(soteria_authz_denials_total[15m])) or on() vector(0)'curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum by (reason) (increase(soteria_authz_denials_total[15m]))'- Healthy expectation: total remains below
10in normal operation; spikes should map to expectedreasonlabels.
Failure Triage
401/403on UI or API:- Verify oauth2-proxy group claims include
adminormaintenance.
- Verify oauth2-proxy group claims include
- Restore conflict:
- Target PVC already exists; choose a new target PVC name.
maint-soteria-refresh-stalefiring:- Check Soteria pod health and
/metricsscrape reachability frommonitoring.
- Check Soteria pod health and
maint-soteria-backup-unhealthyfiring:- Inspect
pvc_backup_healthandpvc_backup_age_hoursto identify stale or missing backups.
- Inspect
maint-soteria-authz-denialsfiring:- Confirm expected OIDC groups and inspect denial
reasonlabels for policy or header regressions.
- Confirm expected OIDC groups and inspect denial