diff --git a/services/maintenance/NOTES.md b/services/maintenance/NOTES.md index 965d1ab5..1049cb59 100644 --- a/services/maintenance/NOTES.md +++ b/services/maintenance/NOTES.md @@ -1,47 +1,60 @@ # Soteria PVC Restore Drill (backup.bstein.dev) -Use this runbook for a minimal production-safe restore drill after each meaningful Soteria change. +Use this checklist after meaningful Soteria backup, restore, auth, or alerting changes. -## Preconditions +## Production Restore Drill Checklist -- `maintenance` kustomization is reconciled and healthy in Flux. -- `soteria` and `oauth2-proxy-soteria` Deployments are ready in `maintenance`. -- Operator account is in Keycloak group `admin` or `maintenance`. -- Source PVC is not ephemeral/test throwaway storage that should be excluded from backup policy. - -## Operator Flow (UI) - -1. Open `https://backup.bstein.dev` and sign in through Keycloak. -2. In `PVC Inventory`, pick source namespace/PVC. -3. Click `Backup now` and wait for success response in `Last Action`. -4. Click `Restore`, choose a completed backup snapshot, and set: - - `Target namespace`: destination namespace (defaults to source) - - `Target PVC name`: unique drill PVC name (`restore--`) -5. Click `Create restore PVC`. - -## Verification - -1. Confirm restore target exists: +1. Verify baseline health before touching restores. + - `flux get kustomizations -n flux-system maintenance` + - `kubectl -n maintenance get deploy soteria oauth2-proxy-soteria` +2. Confirm operator access and source safety. + - Operator must be in Keycloak group `admin` or `maintenance`. + - Choose a real source PVC that is expected to be backed up, not a throwaway test PVC. +3. Run the UI flow at `https://backup.bstein.dev`. + - Sign in via Keycloak. + - In `PVC Inventory`, select source namespace and PVC. + - Click `Backup now` and wait for success in `Last Action`. + - Click `Restore` and pick a completed snapshot. + - Set `Target namespace` and unique `Target PVC name` (`restore--`). + - Click `Create restore PVC`. +4. Validate restore output. - `kubectl -n get pvc ` -2. Confirm backup telemetry is present: - - `kubectl -n monitoring port-forward svc/victoria-metrics-k8s-stack 8428:8428` - - `curl -fsS 'http://127.0.0.1:8428/api/v1/query?query=max%20by%20(namespace%2Cpvc)(pvc_backup_age_hours)'` -3. Confirm alerting input stays healthy: - - `pvc_backup_health{namespace="",pvc=""} == 1` - -## Cleanup - -1. Remove drill PVC after validation: + - If workload-level validation is required, attach a temporary pod and inspect expected files/data. +5. Clean up. - `kubectl -n delete pvc ` -2. If a detached restore Longhorn volume remains, remove it in Longhorn UI/API. + - Remove detached restore Longhorn volume from Longhorn UI/API if one remains. + +## Alert Query Verification (`maint-soteria-*`) + +Start a local query endpoint: + +`kubectl -n monitoring port-forward svc/victoria-metrics-k8s-stack 8428:8428` + +Validate each alert expression directly. + +1. `maint-soteria-refresh-stale` (`time() - soteria_inventory_refresh_timestamp_seconds`, threshold `> 900`). + - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=time() - soteria_inventory_refresh_timestamp_seconds'` + - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=(time() - soteria_inventory_refresh_timestamp_seconds) > bool 900'` + - Healthy expectation: age is below `900` and threshold query returns `0`. +2. `maint-soteria-backup-unhealthy` (`sum((1 - pvc_backup_health{driver="longhorn"}) > bool 0) or on() vector(0)`, threshold `> 0`). + - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum((1 - pvc_backup_health{driver="longhorn"}) > bool 0) or on() vector(0)'` + - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=(1 - pvc_backup_health{driver="longhorn"}) > bool 0'` + - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=max by (namespace,pvc) (pvc_backup_age_hours{driver="longhorn"})'` + - Healthy expectation: unhealthy count is `0`; no series should be `1` in the per-PVC unhealthy query. +3. `maint-soteria-authz-denials` (`sum(increase(soteria_authz_denials_total[15m])) or on() vector(0)`, threshold `> 9` for 10m). + - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum(increase(soteria_authz_denials_total[15m])) or on() vector(0)'` + - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum by (reason) (increase(soteria_authz_denials_total[15m]))'` + - Healthy expectation: total remains below `10` in normal operation; spikes should map to expected `reason` labels. ## Failure Triage -- `401/403` on UI/API: +- `401/403` on UI or API: - Verify oauth2-proxy group claims include `admin` or `maintenance`. - Restore conflict: - - Target PVC already exists; pick a new target PVC name. -- Freshness alert firing (`maint-soteria-refresh-stale`): + - Target PVC already exists; choose a new target PVC name. +- `maint-soteria-refresh-stale` firing: - Check Soteria pod health and `/metrics` scrape reachability from `monitoring`. -- Unhealthy PVC alert firing (`maint-soteria-backup-unhealthy`): - - Inspect `pvc_backup_health` and `pvc_backup_age_hours` for stale/missing backup coverage. +- `maint-soteria-backup-unhealthy` firing: + - Inspect `pvc_backup_health` and `pvc_backup_age_hours` to identify stale or missing backups. +- `maint-soteria-authz-denials` firing: + - Confirm expected OIDC groups and inspect denial `reason` labels for policy or header regressions. diff --git a/services/maintenance/kustomization.yaml b/services/maintenance/kustomization.yaml index aca4c9be..37818a64 100644 --- a/services/maintenance/kustomization.yaml +++ b/services/maintenance/kustomization.yaml @@ -36,6 +36,7 @@ resources: - image-sweeper-cronjob.yaml - metis-service.yaml - soteria-networkpolicy.yaml + - oauth2-proxy-soteria-networkpolicy.yaml - soteria-ingress.yaml - soteria-certificate.yaml - oauth2-proxy-soteria.yaml diff --git a/services/maintenance/oauth2-proxy-soteria-networkpolicy.yaml b/services/maintenance/oauth2-proxy-soteria-networkpolicy.yaml new file mode 100644 index 00000000..b25eda10 --- /dev/null +++ b/services/maintenance/oauth2-proxy-soteria-networkpolicy.yaml @@ -0,0 +1,23 @@ +# services/maintenance/oauth2-proxy-soteria-networkpolicy.yaml +apiVersion: networking.k8s.io/v1 +kind: NetworkPolicy +metadata: + name: oauth2-proxy-soteria-ingress + namespace: maintenance +spec: + podSelector: + matchLabels: + app: oauth2-proxy-soteria + policyTypes: + - Ingress + ingress: + - from: + - namespaceSelector: + matchLabels: + kubernetes.io/metadata.name: traefik + podSelector: + matchLabels: + app: traefik + ports: + - protocol: TCP + port: 4180