48 lines
2.0 KiB
Markdown
48 lines
2.0 KiB
Markdown
# Soteria PVC Restore Drill (backup.bstein.dev)
|
|
|
|
Use this runbook for a minimal production-safe restore drill after each meaningful Soteria change.
|
|
|
|
## Preconditions
|
|
|
|
- `maintenance` kustomization is reconciled and healthy in Flux.
|
|
- `soteria` and `oauth2-proxy-soteria` Deployments are ready in `maintenance`.
|
|
- Operator account is in Keycloak group `admin` or `maintenance`.
|
|
- Source PVC is not ephemeral/test throwaway storage that should be excluded from backup policy.
|
|
|
|
## Operator Flow (UI)
|
|
|
|
1. Open `https://backup.bstein.dev` and sign in through Keycloak.
|
|
2. In `PVC Inventory`, pick source namespace/PVC.
|
|
3. Click `Backup now` and wait for success response in `Last Action`.
|
|
4. Click `Restore`, choose a completed backup snapshot, and set:
|
|
- `Target namespace`: destination namespace (defaults to source)
|
|
- `Target PVC name`: unique drill PVC name (`restore-<source-pvc>-<date>`)
|
|
5. Click `Create restore PVC`.
|
|
|
|
## Verification
|
|
|
|
1. Confirm restore target exists:
|
|
- `kubectl -n <target-namespace> get pvc <target-pvc>`
|
|
2. Confirm backup telemetry is present:
|
|
- `kubectl -n monitoring port-forward svc/victoria-metrics-k8s-stack 8428:8428`
|
|
- `curl -fsS 'http://127.0.0.1:8428/api/v1/query?query=max%20by%20(namespace%2Cpvc)(pvc_backup_age_hours)'`
|
|
3. Confirm alerting input stays healthy:
|
|
- `pvc_backup_health{namespace="<source-namespace>",pvc="<source-pvc>"} == 1`
|
|
|
|
## Cleanup
|
|
|
|
1. Remove drill PVC after validation:
|
|
- `kubectl -n <target-namespace> delete pvc <target-pvc>`
|
|
2. If a detached restore Longhorn volume remains, remove it in Longhorn UI/API.
|
|
|
|
## Failure Triage
|
|
|
|
- `401/403` on UI/API:
|
|
- Verify oauth2-proxy group claims include `admin` or `maintenance`.
|
|
- Restore conflict:
|
|
- Target PVC already exists; pick a new target PVC name.
|
|
- Freshness alert firing (`maint-soteria-refresh-stale`):
|
|
- Check Soteria pod health and `/metrics` scrape reachability from `monitoring`.
|
|
- Unhealthy PVC alert firing (`maint-soteria-backup-unhealthy`):
|
|
- Inspect `pvc_backup_health` and `pvc_backup_age_hours` for stale/missing backup coverage.
|