74 lines
5.8 KiB
Markdown
74 lines
5.8 KiB
Markdown
# Soteria PVC Restore Drill (backup.bstein.dev)
|
|
|
|
Use this checklist after meaningful Soteria backup, restore, auth, or alerting changes.
|
|
|
|
## Production Restore Drill Checklist
|
|
|
|
1. Verify baseline health before touching restores.
|
|
- `flux get kustomizations -n flux-system maintenance`
|
|
- `kubectl -n maintenance get deploy soteria oauth2-proxy-soteria`
|
|
2. Confirm operator access and source safety.
|
|
- Operator must be in Keycloak group `admin` or `maintenance`.
|
|
- Choose a real source PVC that is expected to be backed up, not a throwaway test PVC.
|
|
3. Run the UI flow at `https://backup.bstein.dev`.
|
|
- Sign in via Keycloak.
|
|
- In `PVC Inventory`, select source namespace and PVC.
|
|
- Click `Backup now` and wait for success in `Last Action`.
|
|
- Click `Restore` and pick a completed snapshot.
|
|
- Set `Target namespace` and unique `Target PVC name` (`restore-<source-pvc>-<date>`).
|
|
- Click `Create restore PVC`.
|
|
4. Validate restore output.
|
|
- `kubectl -n <target-namespace> get pvc <target-pvc>`
|
|
- If workload-level validation is required, attach a temporary pod and inspect expected files/data.
|
|
5. Clean up.
|
|
- `kubectl -n <target-namespace> delete pvc <target-pvc>`
|
|
- Remove detached restore Longhorn volume from Longhorn UI/API if one remains.
|
|
|
|
## Alert Query Verification (`maint-soteria-*`)
|
|
|
|
Start a local query endpoint:
|
|
|
|
`kubectl -n monitoring port-forward svc/victoria-metrics-k8s-stack 8428:8428`
|
|
|
|
Validate each alert expression directly.
|
|
|
|
1. `maint-soteria-refresh-stale` (`time() - soteria_inventory_refresh_timestamp_seconds`, threshold `> 900`).
|
|
- `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=time() - soteria_inventory_refresh_timestamp_seconds'`
|
|
- `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=(time() - soteria_inventory_refresh_timestamp_seconds) > bool 900'`
|
|
- Healthy expectation: age is below `900` and threshold query returns `0`.
|
|
2. `maint-soteria-backup-unhealthy` (`sum((1 - pvc_backup_health{driver="longhorn"}) > bool 0) or on() vector(0)`, threshold `> 0`).
|
|
- `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum((1 - pvc_backup_health{driver="longhorn"}) > bool 0) or on() vector(0)'`
|
|
- `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=(1 - pvc_backup_health{driver="longhorn"}) > bool 0'`
|
|
- `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=max by (namespace,pvc) (pvc_backup_age_hours{driver="longhorn"})'`
|
|
- Healthy expectation: unhealthy count is `0`; no series should be `1` in the per-PVC unhealthy query.
|
|
3. `maint-soteria-authz-denials` (`sum(increase(soteria_authz_denials_total[15m])) or on() vector(0)`, threshold `> 9` for 10m).
|
|
- `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum(increase(soteria_authz_denials_total[15m])) or on() vector(0)'`
|
|
- `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum by (reason) (increase(soteria_authz_denials_total[15m]))'`
|
|
- Healthy expectation: total remains below `10` in normal operation; spikes should map to expected `reason` labels.
|
|
|
|
## Failure Triage
|
|
|
|
- `401/403` on UI or API:
|
|
- Verify oauth2-proxy group claims include `admin` or `maintenance`.
|
|
- Restore conflict:
|
|
- Target PVC already exists; choose a new target PVC name.
|
|
- `maint-soteria-refresh-stale` firing:
|
|
- Check Soteria pod health and `/metrics` scrape reachability from `monitoring`.
|
|
- `maint-soteria-backup-unhealthy` firing:
|
|
- Inspect `pvc_backup_health` and `pvc_backup_age_hours` to identify stale or missing backups.
|
|
- `maint-soteria-authz-denials` firing:
|
|
- Confirm expected OIDC groups and inspect denial `reason` labels for policy or header regressions.
|
|
|
|
## Emergency Recovery Notes (2026-05-22)
|
|
|
|
- `titan-04` reached Ready after reboot and simple canaries passed, but Collabora's default entrypoint exited immediately on that node while the same image stayed healthy on `titan-08` and `titan-11`. A temporary `recovery-suspect` taint did not stick, so the node is cordoned until the runtime anomaly and taint management behavior are understood.
|
|
- `titan-05` stayed cordoned because real Jenkins agents overloaded it despite a simple runtime canary passing. The data-prepper job was temporarily disabled in Jenkins after repeated respawns and aborts.
|
|
- `titan-06` remained unreachable (`No route to host`) and needs out-of-band power or network recovery.
|
|
- `titan-14` passed a simple runtime canary after reboot, but Longhorn's instance-manager became stuck in `ContainerCreating`. It was cordoned, Longhorn scheduling was disabled on the Longhorn node object, and the `longhorn-host` label was removed because it is not one of the intended HDD Longhorn nodes.
|
|
- `titan-22` stayed cordoned because SSH and kubelet/metrics access were flaky or timing out.
|
|
- Worker-selected pods stranded on control-plane nodes after label cleanup were bounced: Traefik and `maintenance-vault-sync` moved off `titan-0a`. Remaining `titan-0a`/`titan-0c` pods are daemonsets or HA control/storage helpers, not generic app load.
|
|
- Crypto mining was throttled through Flux-tracked manifests: monerod and xmrig are scaled to zero / gated by `atlas.bstein.dev/crypto-mining-enabled=true`.
|
|
- Ananke should not treat node `Ready` as sufficient recovery. A node should pass SSH reachability, kubelet/metrics scrape health, and a bound canary that actually reaches `Completed`; service-specific canaries are useful when the incident involves large images or storage/runtime paths.
|
|
- Ananke should detect under-requested Jenkins agents during recovery. Several agents requested `25m` CPU per build container while allowing `1500m`, which let the scheduler place real CI load on fragile nodes.
|
|
- Ananke should pause or constrain descheduler behavior during recovery; Collabora was evicted from a healthy node and then landed on a suspect node.
|