maintenance(soteria): tighten oauth2 ingress and drill validation

monitoring: fix typhon low-threshold alert semantics
2026-04-12 14:58:25 -03:00 · 2026-04-12 14:56:34 -03:00
4 changed files with 511 additions and 37 deletions
--- a/services/maintenance/NOTES.md
+++ b/services/maintenance/NOTES.md
@ -1,47 +1,60 @@
 # Soteria PVC Restore Drill (backup.bstein.dev)

-Use this runbook for a minimal production-safe restore drill after each meaningful Soteria change.
+Use this checklist after meaningful Soteria backup, restore, auth, or alerting changes.

-## Preconditions
+## Production Restore Drill Checklist

- `maintenance` kustomization is reconciled and healthy in Flux.
- `soteria` and `oauth2-proxy-soteria` Deployments are ready in `maintenance`.
- Operator account is in Keycloak group `admin` or `maintenance`.
- Source PVC is not ephemeral/test throwaway storage that should be excluded from backup policy.
-
-## Operator Flow (UI)
-
-1. Open `https://backup.bstein.dev` and sign in through Keycloak.
-2. In `PVC Inventory`, pick source namespace/PVC.
-3. Click `Backup now` and wait for success response in `Last Action`.
-4. Click `Restore`, choose a completed backup snapshot, and set:
-   - `Target namespace`: destination namespace (defaults to source)
-   - `Target PVC name`: unique drill PVC name (`restore-<source-pvc>-<date>`)
-5. Click `Create restore PVC`.
-
-## Verification
-
-1. Confirm restore target exists:
+1. Verify baseline health before touching restores.
+   - `flux get kustomizations -n flux-system maintenance`
+   - `kubectl -n maintenance get deploy soteria oauth2-proxy-soteria`
+2. Confirm operator access and source safety.
+   - Operator must be in Keycloak group `admin` or `maintenance`.
+   - Choose a real source PVC that is expected to be backed up, not a throwaway test PVC.
+3. Run the UI flow at `https://backup.bstein.dev`.
+   - Sign in via Keycloak.
+   - In `PVC Inventory`, select source namespace and PVC.
+   - Click `Backup now` and wait for success in `Last Action`.
+   - Click `Restore` and pick a completed snapshot.
+   - Set `Target namespace` and unique `Target PVC name` (`restore-<source-pvc>-<date>`).
+   - Click `Create restore PVC`.
+4. Validate restore output.
   - `kubectl -n <target-namespace> get pvc <target-pvc>`
-2. Confirm backup telemetry is present:
-   - `kubectl -n monitoring port-forward svc/victoria-metrics-k8s-stack 8428:8428`
-   - `curl -fsS 'http://127.0.0.1:8428/api/v1/query?query=max%20by%20(namespace%2Cpvc)(pvc_backup_age_hours)'`
-3. Confirm alerting input stays healthy:
-   - `pvc_backup_health{namespace="<source-namespace>",pvc="<source-pvc>"} == 1`
-
-## Cleanup
-
-1. Remove drill PVC after validation:
+   - If workload-level validation is required, attach a temporary pod and inspect expected files/data.
+5. Clean up.
   - `kubectl -n <target-namespace> delete pvc <target-pvc>`
-2. If a detached restore Longhorn volume remains, remove it in Longhorn UI/API.
+   - Remove detached restore Longhorn volume from Longhorn UI/API if one remains.
+
+## Alert Query Verification (`maint-soteria-*`)
+
+Start a local query endpoint:
+
+`kubectl -n monitoring port-forward svc/victoria-metrics-k8s-stack 8428:8428`
+
+Validate each alert expression directly.
+
+1. `maint-soteria-refresh-stale` (`time() - soteria_inventory_refresh_timestamp_seconds`, threshold `> 900`).
+   - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=time() - soteria_inventory_refresh_timestamp_seconds'`
+   - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=(time() - soteria_inventory_refresh_timestamp_seconds) > bool 900'`
+   - Healthy expectation: age is below `900` and threshold query returns `0`.
+2. `maint-soteria-backup-unhealthy` (`sum((1 - pvc_backup_health{driver="longhorn"}) > bool 0) or on() vector(0)`, threshold `> 0`).
+   - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum((1 - pvc_backup_health{driver="longhorn"}) > bool 0) or on() vector(0)'`
+   - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=(1 - pvc_backup_health{driver="longhorn"}) > bool 0'`
+   - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=max by (namespace,pvc) (pvc_backup_age_hours{driver="longhorn"})'`
+   - Healthy expectation: unhealthy count is `0`; no series should be `1` in the per-PVC unhealthy query.
+3. `maint-soteria-authz-denials` (`sum(increase(soteria_authz_denials_total[15m])) or on() vector(0)`, threshold `> 9` for 10m).
+   - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum(increase(soteria_authz_denials_total[15m])) or on() vector(0)'`
+   - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum by (reason) (increase(soteria_authz_denials_total[15m]))'`
+   - Healthy expectation: total remains below `10` in normal operation; spikes should map to expected `reason` labels.

 ## Failure Triage

- `401/403` on UI/API:
+- `401/403` on UI or API:
  - Verify oauth2-proxy group claims include `admin` or `maintenance`.
 - Restore conflict:
-  - Target PVC already exists; pick a new target PVC name.
- Freshness alert firing (`maint-soteria-refresh-stale`):
+  - Target PVC already exists; choose a new target PVC name.
+- `maint-soteria-refresh-stale` firing:
  - Check Soteria pod health and `/metrics` scrape reachability from `monitoring`.
- Unhealthy PVC alert firing (`maint-soteria-backup-unhealthy`):
-  - Inspect `pvc_backup_health` and `pvc_backup_age_hours` for stale/missing backup coverage.
+- `maint-soteria-backup-unhealthy` firing:
+  - Inspect `pvc_backup_health` and `pvc_backup_age_hours` to identify stale or missing backups.
+- `maint-soteria-authz-denials` firing:
+  - Confirm expected OIDC groups and inspect denial `reason` labels for policy or header regressions.
--- a/services/maintenance/kustomization.yaml
+++ b/services/maintenance/kustomization.yaml
@ -38,6 +38,7 @@ resources:
  - node-image-sweeper-daemonset.yaml
  - metis-service.yaml
  - soteria-networkpolicy.yaml
+  - oauth2-proxy-soteria-networkpolicy.yaml
  - soteria-ingress.yaml
  - soteria-certificate.yaml
  - oauth2-proxy-soteria.yaml
--- a/services/maintenance/oauth2-proxy-soteria-networkpolicy.yaml
+++ b/services/maintenance/oauth2-proxy-soteria-networkpolicy.yaml
@ -0,0 +1,23 @@
+# services/maintenance/oauth2-proxy-soteria-networkpolicy.yaml
+apiVersion: networking.k8s.io/v1
+kind: NetworkPolicy
+metadata:
+  name: oauth2-proxy-soteria-ingress
+  namespace: maintenance
+spec:
+  podSelector:
+    matchLabels:
+      app: oauth2-proxy-soteria
+  policyTypes:
+    - Ingress
+  ingress:
+    - from:
+        - namespaceSelector:
+            matchLabels:
+              kubernetes.io/metadata.name: traefik
+          podSelector:
+            matchLabels:
+              app: traefik
+      ports:
+        - protocol: TCP
+          port: 4180
--- a/services/monitoring/grafana-alerting-config.yaml
+++ b/services/monitoring/grafana-alerting-config.yaml
@ -131,7 +131,7 @@ data:
                  type: threshold
                  conditions:
                    - evaluator:
-                        params: [3]
+                        params: [2]
                        type: gt
                      operator:
                        type: and
@ -578,7 +578,7 @@ data:
                  type: threshold
                  conditions:
                    - evaluator:
-                        params: [10]
+                        params: [9]
                        type: gt
                      operator:
                        type: and
@ -793,3 +793,440 @@ data:
              summary: "Postmark exporter reports sustained API outage"
            labels:
              severity: warning
+      - orgId: 1
+        name: typhon
+        folder: Alerts
+        interval: 1m
+        rules:
+          - uid: typhon-exporter-down
+            title: "Typhon exporter down (>10m)"
+            condition: C
+            for: "10m"
+            data:
+              - refId: A
+                relativeTimeRange:
+                  from: 600
+                  to: 0
+                datasourceUid: atlas-vm
+                model:
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  expr: max(typhon_up) or on() vector(0)
+                  legendFormat: typhon_up
+                  datasource:
+                    type: prometheus
+                    uid: atlas-vm
+              - refId: B
+                datasourceUid: __expr__
+                model:
+                  expression: A
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  reducer: last
+                  type: reduce
+              - refId: C
+                datasourceUid: __expr__
+                model:
+                  expression: B
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  type: threshold
+                  conditions:
+                    - evaluator:
+                        params: [1]
+                        type: lt
+                      operator:
+                        type: and
+                      reducer:
+                        type: last
+                      type: query
+            noDataState: Alerting
+            execErrState: Alerting
+            annotations:
+              summary: "Typhon has been down for >10m"
+            labels:
+              severity: critical
+          - uid: typhon-data-stale
+            title: "Typhon data stale (>180s for 10m)"
+            condition: C
+            for: "10m"
+            data:
+              - refId: A
+                relativeTimeRange:
+                  from: 600
+                  to: 0
+                datasourceUid: atlas-vm
+                model:
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  expr: max(typhon_data_age_seconds) or on() vector(0)
+                  legendFormat: data age
+                  datasource:
+                    type: prometheus
+                    uid: atlas-vm
+              - refId: B
+                datasourceUid: __expr__
+                model:
+                  expression: A
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  reducer: last
+                  type: reduce
+              - refId: C
+                datasourceUid: __expr__
+                model:
+                  expression: B
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  type: threshold
+                  conditions:
+                    - evaluator:
+                        params: [180]
+                        type: gt
+                      operator:
+                        type: and
+                      reducer:
+                        type: last
+                      type: query
+            noDataState: NoData
+            execErrState: Error
+            annotations:
+              summary: "Typhon data age >180s for >10m"
+            labels:
+              severity: warning
+          - uid: typhon-auth-failures
+            title: "Typhon auth failures burst"
+            condition: C
+            for: "5m"
+            data:
+              - refId: A
+                relativeTimeRange:
+                  from: 600
+                  to: 0
+                datasourceUid: atlas-vm
+                model:
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  expr: sum(increase(typhon_poll_errors_total{reason=\"auth\"}[10m])) or on() vector(0)
+                  legendFormat: auth failures 10m
+                  datasource:
+                    type: prometheus
+                    uid: atlas-vm
+              - refId: B
+                datasourceUid: __expr__
+                model:
+                  expression: A
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  reducer: last
+                  type: reduce
+              - refId: C
+                datasourceUid: __expr__
+                model:
+                  expression: B
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  type: threshold
+                  conditions:
+                    - evaluator:
+                        params: [3]
+                        type: gt
+                      operator:
+                        type: and
+                      reducer:
+                        type: last
+                      type: query
+            noDataState: NoData
+            execErrState: Error
+            annotations:
+              summary: "Typhon auth failures exceeded threshold in 10m"
+            labels:
+              severity: critical
+          - uid: typhon-api-errors
+            title: "Typhon API/timeouts burst"
+            condition: C
+            for: "15m"
+            data:
+              - refId: A
+                relativeTimeRange:
+                  from: 900
+                  to: 0
+                datasourceUid: atlas-vm
+                model:
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  expr: sum(increase(typhon_poll_errors_total{reason=~\"api|timeout|unknown\"}[15m])) or on() vector(0)
+                  legendFormat: poll errors 15m
+                  datasource:
+                    type: prometheus
+                    uid: atlas-vm
+              - refId: B
+                datasourceUid: __expr__
+                model:
+                  expression: A
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  reducer: last
+                  type: reduce
+              - refId: C
+                datasourceUid: __expr__
+                model:
+                  expression: B
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  type: threshold
+                  conditions:
+                    - evaluator:
+                        params: [10]
+                        type: gt
+                      operator:
+                        type: and
+                      reducer:
+                        type: last
+                      type: query
+            noDataState: OK
+            execErrState: Error
+            annotations:
+              summary: "Typhon API/timeouts exceeded threshold in 15m"
+            labels:
+              severity: warning
+          - uid: typhon-temp-critical
+            title: "Tent temperature critical (>34C)"
+            condition: C
+            for: "10m"
+            data:
+              - refId: A
+                relativeTimeRange:
+                  from: 600
+                  to: 0
+                datasourceUid: atlas-vm
+                model:
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  expr: max(typhon_temperature_celsius) or on() vector(0)
+                  legendFormat: max temp
+                  datasource:
+                    type: prometheus
+                    uid: atlas-vm
+              - refId: B
+                datasourceUid: __expr__
+                model:
+                  expression: A
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  reducer: last
+                  type: reduce
+              - refId: C
+                datasourceUid: __expr__
+                model:
+                  expression: B
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  type: threshold
+                  conditions:
+                    - evaluator:
+                        params: [34]
+                        type: gt
+                      operator:
+                        type: and
+                      reducer:
+                        type: last
+                      type: query
+            noDataState: OK
+            execErrState: Error
+            annotations:
+              summary: "Typhon reports tent temperature >34C for >10m"
+            labels:
+              severity: critical
+          - uid: typhon-humidity-high
+            title: "Tent humidity high (>75%)"
+            condition: C
+            for: "20m"
+            data:
+              - refId: A
+                relativeTimeRange:
+                  from: 1200
+                  to: 0
+                datasourceUid: atlas-vm
+                model:
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  expr: max(typhon_relative_humidity_percent) or on() vector(0)
+                  legendFormat: max humidity
+                  datasource:
+                    type: prometheus
+                    uid: atlas-vm
+              - refId: B
+                datasourceUid: __expr__
+                model:
+                  expression: A
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  reducer: last
+                  type: reduce
+              - refId: C
+                datasourceUid: __expr__
+                model:
+                  expression: B
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  type: threshold
+                  conditions:
+                    - evaluator:
+                        params: [75]
+                        type: gt
+                      operator:
+                        type: and
+                      reducer:
+                        type: last
+                      type: query
+            noDataState: OK
+            execErrState: Error
+            annotations:
+              summary: "Typhon reports relative humidity >75% for >20m"
+            labels:
+              severity: warning
+          - uid: typhon-humidity-low
+            title: "Tent humidity low (<30%)"
+            condition: C
+            for: "20m"
+            data:
+              - refId: A
+                relativeTimeRange:
+                  from: 1200
+                  to: 0
+                datasourceUid: atlas-vm
+                model:
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  expr: min(typhon_relative_humidity_percent)
+                  legendFormat: min humidity
+                  datasource:
+                    type: prometheus
+                    uid: atlas-vm
+              - refId: B
+                datasourceUid: __expr__
+                model:
+                  expression: A
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  reducer: last
+                  type: reduce
+              - refId: C
+                datasourceUid: __expr__
+                model:
+                  expression: B
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  type: threshold
+                  conditions:
+                    - evaluator:
+                        params: [30]
+                        type: lt
+                      operator:
+                        type: and
+                      reducer:
+                        type: last
+                      type: query
+            noDataState: OK
+            execErrState: Error
+            annotations:
+              summary: "Typhon reports relative humidity <30% for >20m"
+            labels:
+              severity: warning
+          - uid: typhon-vpd-high
+            title: "Tent VPD high (>2.0 kPa)"
+            condition: C
+            for: "20m"
+            data:
+              - refId: A
+                relativeTimeRange:
+                  from: 1200
+                  to: 0
+                datasourceUid: atlas-vm
+                model:
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  expr: max(typhon_vpd_kpa) or on() vector(0)
+                  legendFormat: max vpd
+                  datasource:
+                    type: prometheus
+                    uid: atlas-vm
+              - refId: B
+                datasourceUid: __expr__
+                model:
+                  expression: A
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  reducer: last
+                  type: reduce
+              - refId: C
+                datasourceUid: __expr__
+                model:
+                  expression: B
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  type: threshold
+                  conditions:
+                    - evaluator:
+                        params: [2.0]
+                        type: gt
+                      operator:
+                        type: and
+                      reducer:
+                        type: last
+                      type: query
+            noDataState: OK
+            execErrState: Error
+            annotations:
+              summary: "Typhon reports VPD >2.0 kPa for >20m"
+            labels:
+              severity: warning
+          - uid: typhon-vpd-low
+            title: "Tent VPD low (<0.4 kPa)"
+            condition: C
+            for: "20m"
+            data:
+              - refId: A
+                relativeTimeRange:
+                  from: 1200
+                  to: 0
+                datasourceUid: atlas-vm
+                model:
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  expr: min(typhon_vpd_kpa)
+                  legendFormat: min vpd
+                  datasource:
+                    type: prometheus
+                    uid: atlas-vm
+              - refId: B
+                datasourceUid: __expr__
+                model:
+                  expression: A
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  reducer: last
+                  type: reduce
+              - refId: C
+                datasourceUid: __expr__
+                model:
+                  expression: B
+                  intervalMs: 60000
+                  maxDataPoints: 43200
+                  type: threshold
+                  conditions:
+                    - evaluator:
+                        params: [0.4]
+                        type: lt
+                      operator:
+                        type: and
+                      reducer:
+                        type: last
+                      type: query
+            noDataState: OK
+            execErrState: Error
+            annotations:
+              summary: "Typhon reports VPD <0.4 kPa for >20m"
+            labels:
+              severity: warning
Author	SHA1	Message	Date
Brad Stein	75a992b829	maintenance(soteria): tighten oauth2 ingress and drill validation	2026-04-12 14:58:25 -03:00
Brad Stein	a87a5f7bff	monitoring: fix typhon low-threshold alert semantics	2026-04-12 14:56:34 -03:00