From 839fb94836dec85164c3cf680e00732707ea86bf Mon Sep 17 00:00:00 2001 From: Brad Stein Date: Tue, 2 Dec 2025 17:01:32 -0300 Subject: [PATCH] notes: update monitoring and next steps --- AGENTS.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/AGENTS.md b/AGENTS.md index 05838aa..d660e75 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -40,3 +40,14 @@ Repository Guidelines - Atlas Overview panels are paired with internal dashboards (pods, nodes, storage, network, GPU). A new `atlas-gpu` internal dashboard holds the detailed GPU metrics that feed the overview share pie. - Old Grafana folders (`Atlas Storage`, `Atlas SRE`, `Atlas Public`, `Atlas Nodes`) should be removed in Grafana UI when convenient; only `Atlas Overview` and `Atlas Internal` should remain provisioned. - Future work: add a separate generator (e.g., `dashboards_render_oceanus.py`) for SUI/oceanus validation dashboards, mirroring the atlas pattern of internal dashboards feeding a public overview. + +## Monitoring state (2025-12-03) +- dcgm-exporter DaemonSet pulls `registry.bstein.dev/monitoring/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04` with nvidia runtime/imagePullSecret; titan-24 exports metrics, titan-22 remains NotReady. +- Atlas Overview is the Grafana home (1h range, 1m refresh), Overview folder UID `overview`, internal folder `atlas-internal` (oceanus-internal stub). +- Panels standardized via generator; hottest row compressed, worker/control rows taller, root disk row taller and top12 bar gauge with labels. GPU share pie uses 1h avg_over_time to persist idle activity. +- Internal dashboards are provisioned without Viewer role; if anonymous still sees them, restart Grafana and tighten auth if needed. + +## Upcoming priorities (SSO/storage/mail) +- Establish SSO (Keycloak or similar) and federate Grafana, Gitea, Zot, Nextcloud, Pegasus/Jellyfin; keep Vaultwarden separate until safe. +- Add Nextcloud (limit to rpi5 workers) with office suite; integrate with SSO; plan storage class and ingress. +- Plan mail: mostly self-hosted, relay through trusted provider for outbound; integrate with services (Nextcloud, Vaultwarden, etc.) for notifications and account flows.