94 Commits

Author SHA1 Message Date
a255c60aed monitoring: fix gpu idle label 2026-01-27 21:46:58 -03:00
b4f5fbeb2b monitoring: unify gpu namespace usage 2026-01-27 21:43:37 -03:00
577e2a158d monitoring: keep idle label in gpu share 2026-01-27 18:44:58 -03:00
86cd5194ea monitoring: fix gpu idle share 2026-01-27 17:51:13 -03:00
995050f544 monitoring: unify jetson gpu metrics 2026-01-26 22:26:24 -03:00
f7d4425740 ariadne: reduce comms noise, fix gpu labels 2026-01-26 20:54:33 -03:00
8b8766b0f0 monitoring: add postgres metrics and update overview 2026-01-22 18:23:26 -03:00
307d1bf7a6 ops: restore portal/ariadne and add postgres panels 2026-01-22 15:23:23 -03:00
e0308b89fd monitoring: enforce sorted job lists 2026-01-21 15:12:53 -03:00
9db260e482 monitoring: tighten jobs/overview ordering 2026-01-21 15:01:02 -03:00
2fd87aea45 monitoring: refine jobs/overview panels 2026-01-21 14:31:11 -03:00
fc87432fdf monitoring: refresh jobs dashboards 2026-01-21 13:37:36 -03:00
b0698887a4 monitoring: add testing dashboard and switch postmark apikey 2026-01-18 09:21:33 -03:00
2b9a8eb8eb monitoring: add glue row and fix mail dns 2026-01-18 08:12:06 -03:00
84710b99e8 monitoring: add glue dashboard and tag cronjobs 2026-01-18 02:50:07 -03:00
fddf58346d monitoring: treat cert-manager as infrastructure 2026-01-12 00:26:46 -03:00
98d405bc42 monitoring: regenerate dashboards with expanded infra namespaces 2026-01-11 23:55:43 -03:00
879ff7c16b monitoring: fix infra scopes and add jetson metrics 2026-01-11 23:46:24 -03:00
f500e81606 monitoring: maintenance panels, extra alerts, update overview 2026-01-11 02:28:39 -03:00
25907da229 monitoring: remove titan-16 and add titan-20/21 to worker dashboards 2026-01-11 02:20:47 -03:00
4a01632f6b monitoring: add alert rules and include titan-20/21 in dashboards 2026-01-11 02:02:47 -03:00
7225e28712 mailu: harden relay + fix postmark exporter 2026-01-06 14:00:14 -03:00
29e8cb5857 monitoring: add titan-jh control plane node 2026-01-06 09:50:40 -03:00
c58583fd74 monitoring: refine mail overview panels 2026-01-06 02:34:52 -03:00
aa58115318 monitoring: refine mail stats and add send-limit usage 2026-01-06 02:06:20 -03:00
7e4b0e1eb0 monitoring: add Postmark mail dashboard 2026-01-05 21:55:59 -03:00
05a888aeb6 monitoring(dashboards): tune namespace share metrics 2026-01-05 13:30:51 -03:00
ceea2539bc monitoring: per-panel namespace share filters 2026-01-01 14:44:33 -03:00
bcc1ceef6d monitoring: ensure gpu idle share renders 2026-01-01 14:21:43 -03:00
91de1c1d8d gpu: enable time-slicing and refresh dashboards 2026-01-01 14:16:08 -03:00
1b57ea7adb Increase Atlas availability stat to 4 decimals 2025-12-19 15:18:14 -03:00
2ab38d6205 Reduce Atlas availability query density 2025-12-19 14:56:29 -03:00
2f6988189b Expand Atlas availability window to 1y 2025-12-19 13:46:34 -03:00
c85961e1fe Regenerate dashboards after availability thresholds tweak 2025-12-15 22:14:26 -03:00
c9c13372a8 atlas overview: include titan-db in control plane panels 2025-12-12 21:55:53 -03:00
f884ce8146 atlas dashboards: align percent thresholds and disk bars 2025-12-12 21:13:31 -03:00
755a6926ab atlas overview: refine alert thresholds and availability colors 2025-12-12 20:50:41 -03:00
73deee09af atlas dashboards: use threshold colors for stats 2025-12-12 20:44:20 -03:00
2e18a4e1c5 atlas dashboards: fix pod share display and zero/red stat thresholds 2025-12-12 20:40:32 -03:00
da8ed7a3b0 atlas dashboards: show pod counts (not %) and make zero-friendly stats 2025-12-12 20:30:00 -03:00
ca1b2351c0 atlas dashboards: show pod counts with top12 bars 2025-12-12 20:20:13 -03:00
0a520e1d4b atlas dashboards: drop empty nodes and enforce top12 pod bars 2025-12-12 19:09:51 -03:00
1fefca3b3e atlas dashboards: cap pod count bars at top12 2025-12-12 18:56:13 -03:00
8ed23c673c atlas dashboards: sort pod counts and add pod row to overview 2025-12-12 18:51:43 -03:00
c093f98522 atlas dashboards: fix overview links and add pods-by-node pie 2025-12-12 18:32:45 -03:00
1a38bffdf3 atlas overview: fix availability scaling 2025-12-12 16:36:47 -03:00
92a7688a2f atlas overview: show availability percent with 3 decimals 2025-12-12 16:15:37 -03:00
72d4fd60d2 atlas overview: show availability percent and keep uptime centered 2025-12-12 16:11:28 -03:00
9320d809f4 atlas overview: center uptime and reorder top row 2025-12-12 15:56:33 -03:00
27f4e60f30 atlas overview: add uptime and crashloop panels 2025-12-12 15:23:51 -03:00