maintenance: document node recovery guardrails

2026-05-22 17:21:59 -03:00 · 2026-05-22 17:21:59 -03:00 · cf8baafed1
commit cf8baafed1
parent c7edc81239
1 changed files with 7 additions and 4 deletions
--- a/services/maintenance/NOTES.md
+++ b/services/maintenance/NOTES.md
@ -61,13 +61,16 @@ Validate each alert expression directly.

 ## Emergency Recovery Notes (2026-05-22)

- `titan-04` reached Ready after reboot and simple canaries passed, but Collabora's default entrypoint exited immediately on that node while the same image stayed healthy on `titan-08` and `titan-11`. A temporary `recovery-suspect` taint did not stick, so the node is cordoned until the runtime anomaly and taint management behavior are understood.
- `titan-05` stayed cordoned because real Jenkins agents overloaded it despite a simple runtime canary passing. The data-prepper job was temporarily disabled in Jenkins after repeated respawns and aborts.
+- `titan-04` reached Ready after reboot and simple canaries passed, but Collabora's default entrypoint exited immediately on that node while the same image stayed healthy on `titan-08` and `titan-11`. It was temporarily cordoned during containment and later returned to the worker pool after kubelet health, metrics, and real workload placement stayed stable.
+- `titan-05` was temporarily cordoned because real Jenkins agents overloaded it despite a simple runtime canary passing. It was returned to the worker pool after Jenkins load moved off and node pressure stayed below alert levels.
 - `titan-06` remained unreachable (`No route to host`) and needs out-of-band power or network recovery.
- `titan-14` passed a simple runtime canary after reboot, but Longhorn's instance-manager became stuck in `ContainerCreating`. It was cordoned, Longhorn scheduling was disabled on the Longhorn node object, and the `longhorn-host` label was removed because it is not one of the intended HDD Longhorn nodes.
- `titan-22` stayed cordoned because SSH and kubelet/metrics access were flaky or timing out.
+- `titan-14` passed a simple runtime canary after reboot, but Longhorn's instance-manager became stuck in `ContainerCreating`. It was temporarily cordoned during containment and later returned to the worker pool after Longhorn/system pods converged.
+- `titan-22` was temporarily cordoned because SSH and kubelet/metrics access were flaky or timing out. Kernel logs showed the real failure mode: `igc` link flapped between up/down every few seconds on `enp5s0`, which caused k3s lease updates and kubelet proxy calls to fail. The link keeper now disables EEE, tries 1G-only autoneg, and falls back to 100M-only autoneg when 1G is unstable. This keeps the node usable, but the cable/switch/NIC path still needs physical follow-up.
+- `node-prefer-noschedule` now runs every minute with short kubectl timeouts, keeps ordinary workers uncordoned and free of the `atlas.bstein.dev/spillover` label, marks `titan-22` as worker/amd64 last-resort compute, and keeps only `titan-13`, `titan-15`, `titan-17`, and `titan-19` soft-tainted as Longhorn spillover nodes.
 - Worker-selected pods stranded on control-plane nodes after label cleanup were bounced: Traefik and `maintenance-vault-sync` moved off `titan-0a`. Remaining `titan-0a`/`titan-0c` pods are daemonsets or HA control/storage helpers, not generic app load.
 - Crypto mining was throttled through Flux-tracked manifests: monerod and xmrig are scaled to zero / gated by `atlas.bstein.dev/crypto-mining-enabled=true`.
 - Ananke should not treat node `Ready` as sufficient recovery. A node should pass SSH reachability, kubelet/metrics scrape health, and a bound canary that actually reaches `Completed`; service-specific canaries are useful when the incident involves large images or storage/runtime paths.
+- Ananke should pair every emergency cordon with an explicit revalidation and uncordon loop. Temporary cordons must not outlive the condition that justified them.
+- Ananke should recognize repeated node `Ready -> Unknown -> Ready` transitions plus kubelet proxy EOF/502 errors as a likely host-network/link issue and run link-stability checks before scheduling regular workloads.
 - Ananke should detect under-requested Jenkins agents during recovery. Several agents requested `25m` CPU per build container while allowing `1500m`, which let the scheduler place real CI load on fragile nodes.
 - Ananke should pause or constrain descheduler behavior during recovery; Collabora was evicted from a healthy node and then landed on a suspect node.