391 Commits

Author SHA1 Message Date
3096e0d7de monitoring(overview): tighten climate labels and drop duplicate temp line 2026-04-12 18:50:25 -03:00
6b0d6b017c monitoring(overview): tune climate row and restore ups card density 2026-04-12 18:35:15 -03:00
de3272e160 merge: atlas jobs ariadne schedule observability 2026-04-12 18:33:07 -03:00
cb27592272 monitoring(overview): reflow UPS/climate rows and add jenkins weather 2026-04-12 18:14:54 -03:00
f67ca30f94 monitoring(climate): add C/F history and dedupe typhon series 2026-04-12 17:56:54 -03:00
b6b1e533ed monitoring(jobs): add Ariadne schedule inventory signals 2026-04-12 17:29:27 -03:00
58ccbfb130 monitoring: add humidity and dew point to climate panels 2026-04-12 17:28:15 -03:00
a20fd995a1 monitoring: switch climate dashboards to typhon metrics 2026-04-12 17:20:05 -03:00
c325744540 monitoring(alerts): watch soteria authz denial spikes 2026-04-12 15:07:54 -03:00
241a405c05 maintenance(soteria): harden ingress path and add backup alerts 2026-04-12 15:07:54 -03:00
091e743d0e maintenance(soteria): add protected UI, OIDC bootstrap, and backup health panel wiring 2026-04-12 15:07:53 -03:00
3774b600ee scheduling: keep app workloads off control-plane 2026-04-12 04:27:43 -03:00
3ea296b552 maintenance: enforce Astraios + tmpfs /tmp on worker Pis 2026-04-11 11:55:49 -03:00
b723382ff4 dashboards: unify suite pass-rate metrics on platform counters 2026-04-10 16:39:55 -03:00
32b6e55467 monitoring: use CI-only series for platform test success panels 2026-04-10 04:52:57 -03:00
99eda351df monitoring/jenkins: add pegasus CI job and separate health probe suite 2026-04-10 03:26:51 -03:00
5f4641553c monitoring: replace failure table with 24h suite pass snapshot 2026-04-09 20:16:44 -03:00
530f440679 monitoring: add suite probe metrics and align fan labels 2026-04-09 20:10:52 -03:00
5e3aadc640 monitoring: set overview platform test panel to 7d 2026-04-09 20:05:10 -03:00
12b85f4597 monitoring: add platform quality push gateway for test metrics 2026-04-09 19:30:16 -03:00
ad1cbd6f85 monitoring: make test panel point-based and failure-by-suite 2026-04-09 19:27:48 -03:00
5cf9a16d97 monitoring: align overview panels with jobs and point-based suite rates 2026-04-09 16:35:14 -03:00
f8c1243dfd monitoring: add generic suite metric slots for platform tests 2026-04-09 16:16:35 -03:00
7b0e9acbb1 monitoring: make suite pass rate 30d rolling for sparse tests 2026-04-09 16:14:26 -03:00
0273727cb4 monitoring: make platform test success one line per suite 2026-04-09 15:21:59 -03:00
09fa3e716c monitoring/atlas: merge top rows and fix platform test pass-rate panel 2026-04-09 14:56:43 -03:00
293cd83999 monitoring/atlas: resize test/ops rows and source overview tests from atlas-jobs 2026-04-09 13:39:55 -03:00
764bfe189e monitoring/recovery: harden ananke checks and OIDC-gated service validation 2026-04-09 01:44:26 -03:00
e0b124ca4e monitoring: switch power telemetry to ananke metrics 2026-04-08 23:33:17 -03:00
3ce7b2eeb7 maintenance/monitoring: wire reciprocal metis hecate key + dampen alert flapping 2026-04-05 13:51:57 -03:00
96bc93670b monitoring(power): rename hecate UPS peers to Pyrphoros and Statera 2026-04-04 05:54:16 -03:00
82e1b87b8f monitoring(overview): refine ups-climate row and climate/fan stat display 2026-04-04 04:40:22 -03:00
1b682cc60f monitoring(grafana): restart to pick up latest overview layout 2026-04-04 04:35:26 -03:00
5059d2918d monitoring(overview): swap jobs and power rows; tighten climate/fan display 2026-04-04 04:34:18 -03:00
d5fc6c89c4 monitoring(grafana): bump restart revision for overview dashboard reload 2026-04-04 01:34:36 -03:00
55b96c0675 monitoring(overview): place six power/climate panels on one row and fix test/job data 2026-04-04 01:33:15 -03:00
cdc3c081f5 monitoring(overview): replace power/climate summary row with six-panel layout 2026-04-03 22:16:02 -03:00
7ef4c895ba monitoring(grafana): bump restart revision to reload provisioned dashboards 2026-04-03 20:54:12 -03:00
69a02a3352 monitoring(power): implement six-panel UPS and climate layout 2026-04-03 20:45:40 -03:00
4167f0f988 monitoring(power): add UPS status snapshot table and climate placeholders 2026-04-03 17:53:42 -03:00
fd71c6644b monitoring(power): wire generated power dashboard and split per-UPS panels 2026-04-03 17:49:09 -03:00
7ae4746d10 monitoring: scope hecate power queries to hecate-power job 2026-04-03 15:23:27 -03:00
bc9bf0310a monitoring: add power dashboard and reorder atlas overview rows 2026-04-03 14:55:16 -03:00
5a577630df platform: expose metis on sentinel and move gitea to rpi5 2026-03-31 16:44:41 -03:00
9a8030bf68 maintenance: harden metis recovery and fix harbor rollout 2026-03-31 14:55:48 -03:00
10ae47110a monitoring: combine Ariadne and Metis tests 2026-03-31 14:54:54 -03:00
03ae79df3e maintenance: harden sd-write controls and recovery workflow 2026-03-31 00:06:44 -03:00
3cf426e23a monitoring: roll grafana to apply latest alert rules 2026-03-30 18:41:26 -03:00
8006540645 monitoring: raise rootfs warning threshold to 85 percent 2026-03-30 18:41:05 -03:00
0aeb08d375 monitoring: fix noisy grafana email alerts and reload rules 2026-03-30 18:33:02 -03:00