From b703e66b9848236e660bfb8724eb62c1c6130ce9 Mon Sep 17 00:00:00 2001 From: Brad Stein Date: Sat, 13 Dec 2025 15:11:50 -0300 Subject: [PATCH] monitoring: restore README --- services/monitoring/README.md | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) create mode 100644 services/monitoring/README.md diff --git a/services/monitoring/README.md b/services/monitoring/README.md new file mode 100644 index 0000000..835ae1d --- /dev/null +++ b/services/monitoring/README.md @@ -0,0 +1,28 @@ +# services/monitoring + +## Grafana admin secret + +The Grafana Helm release expects a pre-existing secret named `grafana-admin` +in the `monitoring` namespace. Create or rotate it with: + +```bash +kubectl create secret generic grafana-admin \ + --namespace monitoring \ + --from-literal=admin-user=admin \ + --from-literal=admin-password='REPLACE_ME' +``` + +Update the password whenever you rotate credentials. + +## DCGM exporter image + +The NVIDIA GPU metrics DaemonSet expects `registry.bstein.dev/monitoring/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04`, mirrored from `docker.io/nvidia/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04`. Refresh it in Zot when bumping versions: + +```bash +skopeo copy \ + --all \ + docker://docker.io/nvidia/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04 \ + docker://registry.bstein.dev/monitoring/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04 +``` + +When finished mirroring from the control-plane, you can remove temporary tooling with `sudo apt-get purge -y skopeo && sudo apt-get autoremove -y` and clear `~/.config/containers/auth.json`.