nextcloud: keep collabora off descheduler
This commit is contained in:
parent
1fe125b8b3
commit
f383818f93
@ -58,3 +58,16 @@ Validate each alert expression directly.
|
||||
- Inspect `pvc_backup_health` and `pvc_backup_age_hours` to identify stale or missing backups.
|
||||
- `maint-soteria-authz-denials` firing:
|
||||
- Confirm expected OIDC groups and inspect denial `reason` labels for policy or header regressions.
|
||||
|
||||
## Emergency Recovery Notes (2026-05-22)
|
||||
|
||||
- `titan-04` reached Ready after reboot and simple canaries passed, but Collabora's default entrypoint exited immediately on that node while the same image stayed healthy on `titan-08` and `titan-11`. A temporary `recovery-suspect` taint did not stick, so the node is cordoned until the runtime anomaly and taint management behavior are understood.
|
||||
- `titan-05` stayed cordoned because real Jenkins agents overloaded it despite a simple runtime canary passing. The data-prepper job was temporarily disabled in Jenkins after repeated respawns and aborts.
|
||||
- `titan-06` remained unreachable (`No route to host`) and needs out-of-band power or network recovery.
|
||||
- `titan-14` passed a simple runtime canary after reboot, but Longhorn's instance-manager became stuck in `ContainerCreating`. It was cordoned, Longhorn scheduling was disabled on the Longhorn node object, and the `longhorn-host` label was removed because it is not one of the intended HDD Longhorn nodes.
|
||||
- `titan-22` stayed cordoned because SSH and kubelet/metrics access were flaky or timing out.
|
||||
- Worker-selected pods stranded on control-plane nodes after label cleanup were bounced: Traefik and `maintenance-vault-sync` moved off `titan-0a`. Remaining `titan-0a`/`titan-0c` pods are daemonsets or HA control/storage helpers, not generic app load.
|
||||
- Crypto mining was throttled through Flux-tracked manifests: monerod and xmrig are scaled to zero / gated by `atlas.bstein.dev/crypto-mining-enabled=true`.
|
||||
- Ananke should not treat node `Ready` as sufficient recovery. A node should pass SSH reachability, kubelet/metrics scrape health, and a bound canary that actually reaches `Completed`; service-specific canaries are useful when the incident involves large images or storage/runtime paths.
|
||||
- Ananke should detect under-requested Jenkins agents during recovery. Several agents requested `25m` CPU per build container while allowing `1500m`, which let the scheduler place real CI load on fragile nodes.
|
||||
- Ananke should pause or constrain descheduler behavior during recovery; Collabora was evicted from a healthy node and then landed on a suspect node.
|
||||
|
||||
@ -13,6 +13,8 @@ spec:
|
||||
app: collabora
|
||||
template:
|
||||
metadata:
|
||||
annotations:
|
||||
descheduler.alpha.kubernetes.io/evict: "false"
|
||||
labels:
|
||||
app: collabora
|
||||
spec:
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user