27 lines
1.1 KiB
Markdown
27 lines
1.1 KiB
Markdown
|
|
---
|
||
|
|
title: "Observability: Grafana + VictoriaMetrics (how to query safely)"
|
||
|
|
tags: ["atlas", "monitoring", "grafana", "victoriametrics"]
|
||
|
|
owners: ["brad"]
|
||
|
|
entrypoints: ["metrics.bstein.dev", "alerts.bstein.dev"]
|
||
|
|
source_paths: ["services/monitoring"]
|
||
|
|
---
|
||
|
|
|
||
|
|
# Observability: Grafana + VictoriaMetrics (how to query safely)
|
||
|
|
|
||
|
|
## Where it is configured
|
||
|
|
- `services/monitoring/helmrelease.yaml` (Grafana + Alertmanager + VM values)
|
||
|
|
- `services/monitoring/grafana-dashboard-*.yaml` (dashboards and their PromQL)
|
||
|
|
|
||
|
|
## Using metrics as a “tool” for Atlas assistants
|
||
|
|
The safest pattern is: map a small set of intents → fixed PromQL queries, then summarize results.
|
||
|
|
|
||
|
|
Examples (intents)
|
||
|
|
- “Is the cluster healthy?” → node readiness + pod restart rate
|
||
|
|
- “Why is Element Call failing?” → LiveKit/coturn pod restarts + synapse errors + ingress 5xx
|
||
|
|
- “Is Jenkins slow?” → pod CPU/memory + HTTP latency metrics (if exported)
|
||
|
|
|
||
|
|
## Why dashboards are not the KB
|
||
|
|
Dashboards are great references, but the assistant should query VictoriaMetrics directly for live answers and keep the
|
||
|
|
KB focused on wiring, runbooks, and stable conventions.
|
||
|
|
|