99 Commits

Author SHA1 Message Date
f0265d6b94 atlas pods: add namespace plurality by node table 2025-12-13 03:57:20 -03:00
6f8a70fd58 atlas overview: include titan-db in control plane panels 2025-12-12 21:55:53 -03:00
580d1731f9 monitoring: drop duplicate titan-db scrape job 2025-12-12 21:48:03 -03:00
4def298b83 monitoring: scrape titan-db node_exporter 2025-12-12 21:38:10 -03:00
1166069640 atlas dashboards: align percent thresholds and disk bars 2025-12-12 21:13:31 -03:00
e56bed284e atlas overview: refine alert thresholds and availability colors 2025-12-12 20:50:41 -03:00
24376594ff atlas dashboards: use threshold colors for stats 2025-12-12 20:44:20 -03:00
5277c98385 atlas dashboards: fix pod share display and zero/red stat thresholds 2025-12-12 20:40:32 -03:00
056b7b7770 atlas dashboards: show pod counts (not %) and make zero-friendly stats 2025-12-12 20:30:00 -03:00
b770575b42 atlas dashboards: show pod counts with top12 bars 2025-12-12 20:20:13 -03:00
9e76277c22 atlas dashboards: drop empty nodes and enforce top12 pod bars 2025-12-12 19:09:51 -03:00
93b3c6d2ec atlas dashboards: cap pod count bars at top12 2025-12-12 18:56:13 -03:00
596bf46863 atlas dashboards: sort pod counts and add pod row to overview 2025-12-12 18:51:43 -03:00
8b703f8655 atlas pods: add pod count bar and tidy pie 2025-12-12 18:45:29 -03:00
ec59d25ad8 atlas dashboards: fix overview links and add pods-by-node pie 2025-12-12 18:32:45 -03:00
bf6179f907 atlas internal dashboards: add SLO/burn and api health panels 2025-12-12 18:00:43 -03:00
0a0966db78 atlas overview: fix availability scaling 2025-12-12 16:36:47 -03:00
87fbba0d3e atlas overview: show availability percent with 3 decimals 2025-12-12 16:15:37 -03:00
b200dba5b9 atlas overview: show availability percent and keep uptime centered 2025-12-12 16:11:28 -03:00
697ce3c18f atlas overview: center uptime and reorder top row 2025-12-12 15:56:33 -03:00
8e39c6a28b atlas overview: add uptime and crashloop panels 2025-12-12 15:23:51 -03:00
6c77b8e7f8 restore docs after gitignore change 2025-12-12 00:50:02 -03:00
78195c4685 mailu: fix admin dns and tame vip 2025-12-12 00:49:45 -03:00
8d5e6c267c auth: wire oauth2-proxy and enable grafana oidc 2025-12-07 02:01:21 -03:00
0db149605d monitoring: show GPU share over dashboard range 2025-12-02 20:28:35 -03:00
6eba26b359 monitoring: show top12 root disks 2025-12-02 15:21:02 -03:00
ace383bedd monitoring: expand worker/control/root rows 2025-12-02 15:15:21 -03:00
b93636ecb9 monitoring: shrink hottest node row height 2025-12-02 15:12:16 -03:00
5df94a7937 monitoring: fix gpu share query and root bar labels 2025-12-02 14:56:36 -03:00
a3dc9391ee monitoring: polish dashboards and folders 2025-12-02 14:41:39 -03:00
eed67b3db0 monitoring: regen dashboards with gpu details 2025-12-02 13:16:00 -03:00
f1d0970aa0 monitoring: mirror dcgm-exporter as multi-arch 2025-12-02 12:36:24 -03:00
e26ef44d1a monitoring: run dcgm-exporter with nvidia runtime 2025-12-02 12:25:30 -03:00
a18c3e6f67 monitoring: always pull dcgm-exporter tag 2025-12-02 12:19:16 -03:00
ee923df567 monitoring: add registry pull secret for dcgm-exporter 2025-12-02 12:07:11 -03:00
d87a1dbc47 monitoring: allow dcgm rollout with unavailable node 2025-12-02 11:59:55 -03:00
5b89b0533e monitoring: use mirrored dcgm-exporter tag 2025-12-02 11:54:53 -03:00
d99bb06eeb monitoring: reenable dcgm exporter 2025-11-20 13:11:13 -03:00
e4f93e85d2 monitoring: control-plane stat and namespace share tweaks 2025-11-18 17:09:13 -03:00
f06be37f44 monitoring: refine network metrics and control-plane allowance 2025-11-18 16:18:52 -03:00
c7b7bc7a6d monitoring: adjust overview spacing and net panels 2025-11-18 15:55:24 -03:00
7b2a69cfe3 monitoring: disable dcgm exporter 2025-11-18 15:10:58 -03:00
46410c9a9d monitoring: fix dcgm image 2025-11-18 14:19:23 -03:00
ff056551c7 monitoring: refresh overview dashboards 2025-11-18 14:08:33 -03:00
8e6c0a3cfe monitoring: rework gpu share + gauges 2025-11-18 12:11:47 -03:00
497164a1ad monitoring: clean namespace gpu share and layout 2025-11-18 11:42:24 -03:00
fab5552039 monitoring: resolve pie errors and network data 2025-11-18 11:30:33 -03:00
7009a4f9ff monitoring: fix namespace gpu share and network stats 2025-11-18 11:12:03 -03:00
d7e4bcd533 monitoring: add gpu node fallback 2025-11-18 10:47:24 -03:00
ec76563a86 monitoring: source gpu pie from limits and node nets 2025-11-18 01:01:10 -03:00