titan-iac/knowledge/runbooks/observability.md

---
title: "Observability: Grafana + VictoriaMetrics (how to query safely)"
tags: ["atlas", "monitoring", "grafana", "victoriametrics"]
owners: ["brad"]
entrypoints: ["metrics.bstein.dev", "alerts.bstein.dev"]
source_paths: ["services/monitoring"]
---

# Observability: Grafana + VictoriaMetrics (how to query safely)

## Where it is configured
- `services/monitoring/helmrelease.yaml` (Grafana + Alertmanager + VM values)
- `services/monitoring/grafana-dashboard-*.yaml` (dashboards and their PromQL)

## Using metrics as a “tool” for Atlas assistants
The safest pattern is: map a small set of intents → fixed PromQL queries, then summarize results.

Examples (intents)
- “Is the cluster healthy?” → node readiness + pod restart rate
- “Why is Element Call failing?” → LiveKit/coturn pod restarts + synapse errors + ingress 5xx
- “Is Jenkins slow?” → pod CPU/memory + HTTP latency metrics (if exported)

## Why dashboards are not the KB
Dashboards are great references, but the assistant should query VictoriaMetrics directly for live answers and keep the
KB focused on wiring, runbooks, and stable conventions.
knowledge: add runbooks skeleton 2026-01-06 14:53:19 -03:00			`---`
			`title: "Observability: Grafana + VictoriaMetrics (how to query safely)"`
			`tags: ["atlas", "monitoring", "grafana", "victoriametrics"]`
			`owners: ["brad"]`
			`entrypoints: ["metrics.bstein.dev", "alerts.bstein.dev"]`
			`source_paths: ["services/monitoring"]`
			`---`

			`# Observability: Grafana + VictoriaMetrics (how to query safely)`

			`## Where it is configured`
			- `services/monitoring/helmrelease.yaml` (Grafana + Alertmanager + VM values)
			- `services/monitoring/grafana-dashboard-*.yaml` (dashboards and their PromQL)

			`## Using metrics as a “tool” for Atlas assistants`
			`The safest pattern is: map a small set of intents → fixed PromQL queries, then summarize results.`

			`Examples (intents)`
			`- “Is the cluster healthy?” → node readiness + pod restart rate`
			`- “Why is Element Call failing?” → LiveKit/coturn pod restarts + synapse errors + ingress 5xx`
			`- “Is Jenkins slow?” → pod CPU/memory + HTTP latency metrics (if exported)`

			`## Why dashboards are not the KB`
			`Dashboards are great references, but the assistant should query VictoriaMetrics directly for live answers and keep the`
			`KB focused on wiring, runbooks, and stable conventions.`