titan-iac/services/maintenance/NOTES.md

# Soteria PVC Restore Drill (backup.bstein.dev)

Use this checklist after meaningful Soteria backup, restore, auth, or alerting changes.

## Production Restore Drill Checklist

1. Verify baseline health before touching restores.
   - `flux get kustomizations -n flux-system maintenance`
   - `kubectl -n maintenance get deploy soteria oauth2-proxy-soteria`
2. Confirm operator access and source safety.
   - Operator must be in Keycloak group `admin` or `maintenance`.
   - Choose a real source PVC that is expected to be backed up, not a throwaway test PVC.
3. Run the UI flow at `https://backup.bstein.dev`.
   - Sign in via Keycloak.
   - In `PVC Inventory`, select source namespace and PVC.
   - Click `Backup now` and wait for success in `Last Action`.
   - Click `Restore` and pick a completed snapshot.
   - Set `Target namespace` and unique `Target PVC name` (`restore-<source-pvc>-<date>`).
   - Click `Create restore PVC`.
4. Validate restore output.
   - `kubectl -n <target-namespace> get pvc <target-pvc>`
   - If workload-level validation is required, attach a temporary pod and inspect expected files/data.
5. Clean up.
   - `kubectl -n <target-namespace> delete pvc <target-pvc>`
   - Remove detached restore Longhorn volume from Longhorn UI/API if one remains.

## Alert Query Verification (`maint-soteria-*`)

Start a local query endpoint:

`kubectl -n monitoring port-forward svc/victoria-metrics-k8s-stack 8428:8428`

Validate each alert expression directly.

1. `maint-soteria-refresh-stale` (`time() - soteria_inventory_refresh_timestamp_seconds`, threshold `> 900`).
   - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=time() - soteria_inventory_refresh_timestamp_seconds'`
   - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=(time() - soteria_inventory_refresh_timestamp_seconds) > bool 900'`
   - Healthy expectation: age is below `900` and threshold query returns `0`.
2. `maint-soteria-backup-unhealthy` (`sum((1 - pvc_backup_health{driver="longhorn"}) > bool 0) or on() vector(0)`, threshold `> 0`).
   - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum((1 - pvc_backup_health{driver="longhorn"}) > bool 0) or on() vector(0)'`
   - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=(1 - pvc_backup_health{driver="longhorn"}) > bool 0'`
   - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=max by (namespace,pvc) (pvc_backup_age_hours{driver="longhorn"})'`
   - Healthy expectation: unhealthy count is `0`; no series should be `1` in the per-PVC unhealthy query.
3. `maint-soteria-authz-denials` (`sum(increase(soteria_authz_denials_total[15m])) or on() vector(0)`, threshold `> 9` for 10m).
   - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum(increase(soteria_authz_denials_total[15m])) or on() vector(0)'`
   - `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum by (reason) (increase(soteria_authz_denials_total[15m]))'`
   - Healthy expectation: total remains below `10` in normal operation; spikes should map to expected `reason` labels.

## Failure Triage

- `401/403` on UI or API:
  - Verify oauth2-proxy group claims include `admin` or `maintenance`.
- Restore conflict:
  - Target PVC already exists; choose a new target PVC name.
- `maint-soteria-refresh-stale` firing:
  - Check Soteria pod health and `/metrics` scrape reachability from `monitoring`.
- `maint-soteria-backup-unhealthy` firing:
  - Inspect `pvc_backup_health` and `pvc_backup_age_hours` to identify stale or missing backups.
- `maint-soteria-authz-denials` firing:
  - Confirm expected OIDC groups and inspect denial `reason` labels for policy or header regressions.

## Emergency Recovery Notes (2026-05-22)

- `titan-04` reached Ready after reboot and simple canaries passed, but Collabora's default entrypoint exited immediately on that node while the same image stayed healthy on `titan-08` and `titan-11`. It was temporarily cordoned during containment and later returned to the worker pool after kubelet health, metrics, and real workload placement stayed stable.
- `titan-05` was temporarily cordoned because real Jenkins agents overloaded it despite a simple runtime canary passing. It was returned to the worker pool after Jenkins load moved off and node pressure stayed below alert levels.
- `titan-06` remained unreachable (`No route to host`) and needs out-of-band power or network recovery.
- `titan-14` passed a simple runtime canary after reboot, but Longhorn's instance-manager became stuck in `ContainerCreating`. It was temporarily cordoned during containment and later returned to the worker pool after Longhorn/system pods converged.
- `titan-22` was temporarily cordoned because SSH and kubelet/metrics access were flaky or timing out. Kernel logs showed the real failure mode: `igc` link flapped between up/down every few seconds on `enp5s0`, which caused k3s lease updates and kubelet proxy calls to fail. The link keeper now disables EEE, tries 1G-only autoneg, and falls back to 100M-only autoneg when 1G is unstable. This keeps the node usable, but the cable/switch/NIC path still needs physical follow-up.
- `node-prefer-noschedule` now runs every minute with short kubectl timeouts, keeps ordinary workers uncordoned and free of the `atlas.bstein.dev/spillover` label, marks `titan-22` as worker/amd64 last-resort compute, and keeps only `titan-13`, `titan-15`, `titan-17`, and `titan-19` soft-tainted as Longhorn spillover nodes.
- Worker-selected pods stranded on control-plane nodes after label cleanup were bounced: Traefik and `maintenance-vault-sync` moved off `titan-0a`. Remaining `titan-0a`/`titan-0c` pods are daemonsets or HA control/storage helpers, not generic app load.
- Crypto mining was throttled through Flux-tracked manifests: monerod and xmrig are scaled to zero / gated by `atlas.bstein.dev/crypto-mining-enabled=true`.
- Ananke should not treat node `Ready` as sufficient recovery. A node should pass SSH reachability, kubelet/metrics scrape health, and a bound canary that actually reaches `Completed`; service-specific canaries are useful when the incident involves large images or storage/runtime paths.
- Ananke should pair every emergency cordon with an explicit revalidation and uncordon loop. Temporary cordons must not outlive the condition that justified them.
- Ananke should recognize repeated node `Ready -> Unknown -> Ready` transitions plus kubelet proxy EOF/502 errors as a likely host-network/link issue and run link-stability checks before scheduling regular workloads.
- Ananke should detect under-requested Jenkins agents during recovery. Several agents requested `25m` CPU per build container while allowing `1500m`, which let the scheduler place real CI load on fragile nodes.
- Ananke should pause or constrain descheduler behavior during recovery; Collabora was evicted from a healthy node and then landed on a suspect node.
maintenance(soteria): harden ingress path and add backup alerts 2026-04-12 12:12:43 -03:00			`# Soteria PVC Restore Drill (backup.bstein.dev)`

maintenance(soteria): tighten oauth2 ingress and drill validation 2026-04-12 14:58:25 -03:00			`Use this checklist after meaningful Soteria backup, restore, auth, or alerting changes.`

			`## Production Restore Drill Checklist`

			`1. Verify baseline health before touching restores.`
			- `flux get kustomizations -n flux-system maintenance`
			- `kubectl -n maintenance get deploy soteria oauth2-proxy-soteria`
			`2. Confirm operator access and source safety.`
			- Operator must be in Keycloak group `admin` or `maintenance`.
			`- Choose a real source PVC that is expected to be backed up, not a throwaway test PVC.`
			3. Run the UI flow at `https://backup.bstein.dev`.
			`- Sign in via Keycloak.`
			- In `PVC Inventory`, select source namespace and PVC.
			- Click `Backup now` and wait for success in `Last Action`.
			- Click `Restore` and pick a completed snapshot.
			- Set `Target namespace` and unique `Target PVC name` (`restore-<source-pvc>-<date>`).
			- Click `Create restore PVC`.
			`4. Validate restore output.`
			- `kubectl -n <target-namespace> get pvc <target-pvc>`
			`- If workload-level validation is required, attach a temporary pod and inspect expected files/data.`
			`5. Clean up.`
			- `kubectl -n <target-namespace> delete pvc <target-pvc>`
			`- Remove detached restore Longhorn volume from Longhorn UI/API if one remains.`
maintenance(soteria): harden ingress path and add backup alerts 2026-04-12 12:12:43 -03:00
maintenance(soteria): tighten oauth2 ingress and drill validation 2026-04-12 14:58:25 -03:00			## Alert Query Verification (`maint-soteria-*`)
maintenance(soteria): harden ingress path and add backup alerts 2026-04-12 12:12:43 -03:00
maintenance(soteria): tighten oauth2 ingress and drill validation 2026-04-12 14:58:25 -03:00			`Start a local query endpoint:`
maintenance(soteria): harden ingress path and add backup alerts 2026-04-12 12:12:43 -03:00
maintenance(soteria): tighten oauth2 ingress and drill validation 2026-04-12 14:58:25 -03:00			`kubectl -n monitoring port-forward svc/victoria-metrics-k8s-stack 8428:8428`
maintenance(soteria): harden ingress path and add backup alerts 2026-04-12 12:12:43 -03:00
maintenance(soteria): tighten oauth2 ingress and drill validation 2026-04-12 14:58:25 -03:00			`Validate each alert expression directly.`
maintenance(soteria): harden ingress path and add backup alerts 2026-04-12 12:12:43 -03:00
maintenance(soteria): tighten oauth2 ingress and drill validation 2026-04-12 14:58:25 -03:00			1. `maint-soteria-refresh-stale` (`time() - soteria_inventory_refresh_timestamp_seconds`, threshold `> 900`).
			- `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=time() - soteria_inventory_refresh_timestamp_seconds'`
			- `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=(time() - soteria_inventory_refresh_timestamp_seconds) > bool 900'`
			- Healthy expectation: age is below `900` and threshold query returns `0`.
			2. `maint-soteria-backup-unhealthy` (`sum((1 - pvc_backup_health{driver="longhorn"}) > bool 0) or on() vector(0)`, threshold `> 0`).
			- `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum((1 - pvc_backup_health{driver="longhorn"}) > bool 0) or on() vector(0)'`
			- `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=(1 - pvc_backup_health{driver="longhorn"}) > bool 0'`
			- `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=max by (namespace,pvc) (pvc_backup_age_hours{driver="longhorn"})'`
			- Healthy expectation: unhealthy count is `0`; no series should be `1` in the per-PVC unhealthy query.
			3. `maint-soteria-authz-denials` (`sum(increase(soteria_authz_denials_total[15m])) or on() vector(0)`, threshold `> 9` for 10m).
			- `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum(increase(soteria_authz_denials_total[15m])) or on() vector(0)'`
			- `curl -fsS --get 'http://127.0.0.1:8428/api/v1/query' --data-urlencode 'query=sum by (reason) (increase(soteria_authz_denials_total[15m]))'`
			- Healthy expectation: total remains below `10` in normal operation; spikes should map to expected `reason` labels.
maintenance(soteria): harden ingress path and add backup alerts 2026-04-12 12:12:43 -03:00
			`## Failure Triage`

maintenance(soteria): tighten oauth2 ingress and drill validation 2026-04-12 14:58:25 -03:00			- `401/403` on UI or API:
maintenance(soteria): harden ingress path and add backup alerts 2026-04-12 12:12:43 -03:00			- Verify oauth2-proxy group claims include `admin` or `maintenance`.
			`- Restore conflict:`
maintenance(soteria): tighten oauth2 ingress and drill validation 2026-04-12 14:58:25 -03:00			`- Target PVC already exists; choose a new target PVC name.`
			- `maint-soteria-refresh-stale` firing:
maintenance(soteria): harden ingress path and add backup alerts 2026-04-12 12:12:43 -03:00			- Check Soteria pod health and `/metrics` scrape reachability from `monitoring`.
maintenance(soteria): tighten oauth2 ingress and drill validation 2026-04-12 14:58:25 -03:00			- `maint-soteria-backup-unhealthy` firing:
			- Inspect `pvc_backup_health` and `pvc_backup_age_hours` to identify stale or missing backups.
			- `maint-soteria-authz-denials` firing:
			- Confirm expected OIDC groups and inspect denial `reason` labels for policy or header regressions.
nextcloud: keep collabora off descheduler 2026-05-22 06:57:01 -03:00
			`## Emergency Recovery Notes (2026-05-22)`

maintenance: document node recovery guardrails 2026-05-22 17:21:59 -03:00			- `titan-04` reached Ready after reboot and simple canaries passed, but Collabora's default entrypoint exited immediately on that node while the same image stayed healthy on `titan-08` and `titan-11`. It was temporarily cordoned during containment and later returned to the worker pool after kubelet health, metrics, and real workload placement stayed stable.
			- `titan-05` was temporarily cordoned because real Jenkins agents overloaded it despite a simple runtime canary passing. It was returned to the worker pool after Jenkins load moved off and node pressure stayed below alert levels.
nextcloud: keep collabora off descheduler 2026-05-22 06:57:01 -03:00			- `titan-06` remained unreachable (`No route to host`) and needs out-of-band power or network recovery.
maintenance: document node recovery guardrails 2026-05-22 17:21:59 -03:00			- `titan-14` passed a simple runtime canary after reboot, but Longhorn's instance-manager became stuck in `ContainerCreating`. It was temporarily cordoned during containment and later returned to the worker pool after Longhorn/system pods converged.
			- `titan-22` was temporarily cordoned because SSH and kubelet/metrics access were flaky or timing out. Kernel logs showed the real failure mode: `igc` link flapped between up/down every few seconds on `enp5s0`, which caused k3s lease updates and kubelet proxy calls to fail. The link keeper now disables EEE, tries 1G-only autoneg, and falls back to 100M-only autoneg when 1G is unstable. This keeps the node usable, but the cable/switch/NIC path still needs physical follow-up.
			- `node-prefer-noschedule` now runs every minute with short kubectl timeouts, keeps ordinary workers uncordoned and free of the `atlas.bstein.dev/spillover` label, marks `titan-22` as worker/amd64 last-resort compute, and keeps only `titan-13`, `titan-15`, `titan-17`, and `titan-19` soft-tainted as Longhorn spillover nodes.
nextcloud: keep collabora off descheduler 2026-05-22 06:57:01 -03:00			- Worker-selected pods stranded on control-plane nodes after label cleanup were bounced: Traefik and `maintenance-vault-sync` moved off `titan-0a`. Remaining `titan-0a`/`titan-0c` pods are daemonsets or HA control/storage helpers, not generic app load.
			- Crypto mining was throttled through Flux-tracked manifests: monerod and xmrig are scaled to zero / gated by `atlas.bstein.dev/crypto-mining-enabled=true`.
			- Ananke should not treat node `Ready` as sufficient recovery. A node should pass SSH reachability, kubelet/metrics scrape health, and a bound canary that actually reaches `Completed`; service-specific canaries are useful when the incident involves large images or storage/runtime paths.
maintenance: document node recovery guardrails 2026-05-22 17:21:59 -03:00			`- Ananke should pair every emergency cordon with an explicit revalidation and uncordon loop. Temporary cordons must not outlive the condition that justified them.`
			- Ananke should recognize repeated node `Ready -> Unknown -> Ready` transitions plus kubelet proxy EOF/502 errors as a likely host-network/link issue and run link-stability checks before scheduling regular workloads.
nextcloud: keep collabora off descheduler 2026-05-22 06:57:01 -03:00			- Ananke should detect under-requested Jenkins agents during recovery. Several agents requested `25m` CPU per build container while allowing `1500m`, which let the scheduler place real CI load on fragile nodes.
			`- Ananke should pause or constrain descheduler behavior during recovery; Collabora was evicted from a healthy node and then landed on a suspect node.`