diff --git a/services/monitoring/README.md b/services/monitoring/README.md index 74baf08..0e8885a 100644 --- a/services/monitoring/README.md +++ b/services/monitoring/README.md @@ -13,3 +13,15 @@ kubectl create secret generic grafana-admin \ ``` Update the password whenever you rotate credentials. + +## DCGM exporter image + +The NVIDIA GPU metrics DaemonSet expects `registry.bstein.dev/monitoring/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04`, mirrored from `docker.io/nvidia/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04`. Refresh it in Zot when bumping versions: + +```bash +skopeo copy \ + docker://docker.io/nvidia/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04 \ + docker://registry.bstein.dev/monitoring/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04 +``` + +When finished mirroring from the control-plane, you can remove temporary tooling with `sudo apt-get purge -y skopeo && sudo apt-get autoremove -y` and clear `~/.config/containers/auth.json`. diff --git a/services/monitoring/dcgm-exporter.yaml b/services/monitoring/dcgm-exporter.yaml index 9a4a1d4..766cf7b 100644 --- a/services/monitoring/dcgm-exporter.yaml +++ b/services/monitoring/dcgm-exporter.yaml @@ -35,7 +35,7 @@ spec: - operator: Exists containers: - name: dcgm-exporter - image: registry.bstein.dev/monitoring/dcgm:4.4.2-1-ubuntu22.04 + image: registry.bstein.dev/monitoring/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04 imagePullPolicy: IfNotPresent ports: - name: metrics