maintenance(soteria): tighten oauth2 ingress and drill validation

This commit is contained in:
Brad Stein 2026-04-12 14:58:25 -03:00
parent a87a5f7bff
commit 75a992b829
3 changed files with 72 additions and 35 deletions

View File

@ -1,47 +1,60 @@
# Soteria PVC Restore Drill (backup.bstein.dev) # Soteria PVC Restore Drill (backup.bstein.dev)
Use this runbook for a minimal production-safe restore drill after each meaningful Soteria change. Use this checklist after meaningful Soteria backup, restore, auth, or alerting changes.
## Preconditions ## Production Restore Drill Checklist
- `maintenance` kustomization is reconciled and healthy in Flux. 1. Verify baseline health before touching restores.
- `soteria` and `oauth2-proxy-soteria` Deployments are ready in `maintenance`. - `flux get kustomizations -n flux-system maintenance`
- Operator account is in Keycloak group `admin` or `maintenance`. - `kubectl -n maintenance get deploy soteria oauth2-proxy-soteria`
- Source PVC is not ephemeral/test throwaway storage that should be excluded from backup policy. 2. Confirm operator access and source safety.
- Operator must be in Keycloak group `admin` or `maintenance`.
## Operator Flow (UI) - Choose a real source PVC that is expected to be backed up, not a throwaway test PVC.
3. Run the UI flow at `https://backup.bstein.dev`.
1. Open `https://backup.bstein.dev` and sign in through Keycloak. - Sign in via Keycloak.
2. In `PVC Inventory`, pick source namespace/PVC. - In `PVC Inventory`, select source namespace and PVC.
3. Click `Backup now` and wait for success response in `Last Action`. - Click `Backup now` and wait for success in `Last Action`.
4. Click `Restore`, choose a completed backup snapshot, and set: - Click `Restore` and pick a completed snapshot.
- `Target namespace`: destination namespace (defaults to source) - Set `Target namespace` and unique `Target PVC name` (`restore-<source-pvc>-<date>`).
- `Target PVC name`: unique drill PVC name (`restore-<source-pvc>-<date>`) - Click `Create restore PVC`.
5. Click `Create restore PVC`. 4. Validate restore output.
## Verification
1. Confirm restore target exists:
- `kubectl -n <target-namespace> get pvc <target-pvc>` - `kubectl -n <target-namespace> get pvc <target-pvc>`
2. Confirm backup telemetry is present: - If workload-level validation is required, attach a temporary pod and inspect expected files/data.
- `kubectl -n monitoring port-forward svc/victoria-metrics-k8s-stack 8428:8428` 5. Clean up.
- `curl -fsS 'http://127.0.0.1:8428/api/v1/query?query=max%20by%20(namespace%2Cpvc)(pvc_backup_age_hours)'`
3. Confirm alerting input stays healthy:
- `pvc_backup_health{namespace="<source-namespace>",pvc="<source-pvc>"} == 1`
## Cleanup
1. Remove drill PVC after validation:
- `kubectl -n <target-namespace> delete pvc <target-pvc>` - `kubectl -n <target-namespace> delete pvc <target-pvc>`
2. If a detached restore Longhorn volume remains, remove it in Longhorn UI/API. - Remove detached restore Longhorn volume from Longhorn UI/API if one remains.
## Alert Query Verification (`maint-soteria-*`)
Start a local query endpoint:
`kubectl -n monitoring port-forward svc/victoria-metrics-k8s-stack 8428:8428`
Validate each alert expression directly.
1. `maint-soteria-refresh-stale` (`time() - soteria_inventory_refresh_timestamp_seconds`, threshold `> 900`).
- `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=time() - soteria_inventory_refresh_timestamp_seconds'`
- `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=(time() - soteria_inventory_refresh_timestamp_seconds) > bool 900'`
- Healthy expectation: age is below `900` and threshold query returns `0`.
2. `maint-soteria-backup-unhealthy` (`sum((1 - pvc_backup_health{driver="longhorn"}) > bool 0) or on() vector(0)`, threshold `> 0`).
- `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum((1 - pvc_backup_health{driver="longhorn"}) > bool 0) or on() vector(0)'`
- `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=(1 - pvc_backup_health{driver="longhorn"}) > bool 0'`
- `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=max by (namespace,pvc) (pvc_backup_age_hours{driver="longhorn"})'`
- Healthy expectation: unhealthy count is `0`; no series should be `1` in the per-PVC unhealthy query.
3. `maint-soteria-authz-denials` (`sum(increase(soteria_authz_denials_total[15m])) or on() vector(0)`, threshold `> 9` for 10m).
- `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum(increase(soteria_authz_denials_total[15m])) or on() vector(0)'`
- `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum by (reason) (increase(soteria_authz_denials_total[15m]))'`
- Healthy expectation: total remains below `10` in normal operation; spikes should map to expected `reason` labels.
## Failure Triage ## Failure Triage
- `401/403` on UI/API: - `401/403` on UI or API:
- Verify oauth2-proxy group claims include `admin` or `maintenance`. - Verify oauth2-proxy group claims include `admin` or `maintenance`.
- Restore conflict: - Restore conflict:
- Target PVC already exists; pick a new target PVC name. - Target PVC already exists; choose a new target PVC name.
- Freshness alert firing (`maint-soteria-refresh-stale`): - `maint-soteria-refresh-stale` firing:
- Check Soteria pod health and `/metrics` scrape reachability from `monitoring`. - Check Soteria pod health and `/metrics` scrape reachability from `monitoring`.
- Unhealthy PVC alert firing (`maint-soteria-backup-unhealthy`): - `maint-soteria-backup-unhealthy` firing:
- Inspect `pvc_backup_health` and `pvc_backup_age_hours` for stale/missing backup coverage. - Inspect `pvc_backup_health` and `pvc_backup_age_hours` to identify stale or missing backups.
- `maint-soteria-authz-denials` firing:
- Confirm expected OIDC groups and inspect denial `reason` labels for policy or header regressions.

View File

@ -38,6 +38,7 @@ resources:
- node-image-sweeper-daemonset.yaml - node-image-sweeper-daemonset.yaml
- metis-service.yaml - metis-service.yaml
- soteria-networkpolicy.yaml - soteria-networkpolicy.yaml
- oauth2-proxy-soteria-networkpolicy.yaml
- soteria-ingress.yaml - soteria-ingress.yaml
- soteria-certificate.yaml - soteria-certificate.yaml
- oauth2-proxy-soteria.yaml - oauth2-proxy-soteria.yaml

View File

@ -0,0 +1,23 @@
# services/maintenance/oauth2-proxy-soteria-networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: oauth2-proxy-soteria-ingress
namespace: maintenance
spec:
podSelector:
matchLabels:
app: oauth2-proxy-soteria
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: traefik
podSelector:
matchLabels:
app: traefik
ports:
- protocol: TCP
port: 4180