2.0 KiB
2.0 KiB
Soteria PVC Restore Drill (backup.bstein.dev)
Use this runbook for a minimal production-safe restore drill after each meaningful Soteria change.
Preconditions
maintenancekustomization is reconciled and healthy in Flux.soteriaandoauth2-proxy-soteriaDeployments are ready inmaintenance.- Operator account is in Keycloak group
adminormaintenance. - Source PVC is not ephemeral/test throwaway storage that should be excluded from backup policy.
Operator Flow (UI)
- Open
https://backup.bstein.devand sign in through Keycloak. - In
PVC Inventory, pick source namespace/PVC. - Click
Backup nowand wait for success response inLast Action. - Click
Restore, choose a completed backup snapshot, and set:Target namespace: destination namespace (defaults to source)Target PVC name: unique drill PVC name (restore-<source-pvc>-<date>)
- Click
Create restore PVC.
Verification
- Confirm restore target exists:
kubectl -n <target-namespace> get pvc <target-pvc>
- Confirm backup telemetry is present:
kubectl -n monitoring port-forward svc/victoria-metrics-k8s-stack 8428:8428curl -fsS 'http://127.0.0.1:8428/api/v1/query?query=max%20by%20(namespace%2Cpvc)(pvc_backup_age_hours)'
- Confirm alerting input stays healthy:
pvc_backup_health{namespace="<source-namespace>",pvc="<source-pvc>"} == 1
Cleanup
- Remove drill PVC after validation:
kubectl -n <target-namespace> delete pvc <target-pvc>
- If a detached restore Longhorn volume remains, remove it in Longhorn UI/API.
Failure Triage
401/403on UI/API:- Verify oauth2-proxy group claims include
adminormaintenance.
- Verify oauth2-proxy group claims include
- Restore conflict:
- Target PVC already exists; pick a new target PVC name.
- Freshness alert firing (
maint-soteria-refresh-stale):- Check Soteria pod health and
/metricsscrape reachability frommonitoring.
- Check Soteria pod health and
- Unhealthy PVC alert firing (
maint-soteria-backup-unhealthy):- Inspect
pvc_backup_healthandpvc_backup_age_hoursfor stale/missing backup coverage.
- Inspect