# Soteria PVC Restore Drill (backup.bstein.dev) Use this checklist after meaningful Soteria backup, restore, auth, or alerting changes. ## Production Restore Drill Checklist 1. Verify baseline health before touching restores. - `flux get kustomizations -n flux-system maintenance` - `kubectl -n maintenance get deploy soteria oauth2-proxy-soteria` 2. Confirm operator access and source safety. - Operator must be in Keycloak group `admin` or `maintenance`. - Choose a real source PVC that is expected to be backed up, not a throwaway test PVC. 3. Run the UI flow at `https://backup.bstein.dev`. - Sign in via Keycloak. - In `PVC Inventory`, select source namespace and PVC. - Click `Backup now` and wait for success in `Last Action`. - Click `Restore` and pick a completed snapshot. - Set `Target namespace` and unique `Target PVC name` (`restore--`). - Click `Create restore PVC`. 4. Validate restore output. - `kubectl -n get pvc ` - If workload-level validation is required, attach a temporary pod and inspect expected files/data. 5. Clean up. - `kubectl -n delete pvc ` - Remove detached restore Longhorn volume from Longhorn UI/API if one remains. ## Alert Query Verification (`maint-soteria-*`) Start a local query endpoint: `kubectl -n monitoring port-forward svc/victoria-metrics-k8s-stack 8428:8428` Validate each alert expression directly. 1. `maint-soteria-refresh-stale` (`time() - soteria_inventory_refresh_timestamp_seconds`, threshold `> 900`). - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=time() - soteria_inventory_refresh_timestamp_seconds'` - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=(time() - soteria_inventory_refresh_timestamp_seconds) > bool 900'` - Healthy expectation: age is below `900` and threshold query returns `0`. 2. `maint-soteria-backup-unhealthy` (`sum((1 - pvc_backup_health{driver="longhorn"}) > bool 0) or on() vector(0)`, threshold `> 0`). - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum((1 - pvc_backup_health{driver="longhorn"}) > bool 0) or on() vector(0)'` - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=(1 - pvc_backup_health{driver="longhorn"}) > bool 0'` - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=max by (namespace,pvc) (pvc_backup_age_hours{driver="longhorn"})'` - Healthy expectation: unhealthy count is `0`; no series should be `1` in the per-PVC unhealthy query. 3. `maint-soteria-authz-denials` (`sum(increase(soteria_authz_denials_total[15m])) or on() vector(0)`, threshold `> 9` for 10m). - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum(increase(soteria_authz_denials_total[15m])) or on() vector(0)'` - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum by (reason) (increase(soteria_authz_denials_total[15m]))'` - Healthy expectation: total remains below `10` in normal operation; spikes should map to expected `reason` labels. ## Failure Triage - `401/403` on UI or API: - Verify oauth2-proxy group claims include `admin` or `maintenance`. - Restore conflict: - Target PVC already exists; choose a new target PVC name. - `maint-soteria-refresh-stale` firing: - Check Soteria pod health and `/metrics` scrape reachability from `monitoring`. - `maint-soteria-backup-unhealthy` firing: - Inspect `pvc_backup_health` and `pvc_backup_age_hours` to identify stale or missing backups. - `maint-soteria-authz-denials` firing: - Confirm expected OIDC groups and inspect denial `reason` labels for policy or header regressions.