diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..a8d49c8 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,68 @@ + + +Repository Guidelines + +## Project Structure & Module Organization +- `infrastructure/`: cluster-scoped building blocks (core, flux-system, traefik, longhorn). Add new platform features by mirroring this layout. +- `services/`: workload manifests per app (`services/gitea/`, etc.) with `kustomization.yaml` plus one file per kind; keep diffs small and focused. +- `dockerfiles/` hosts bespoke images, while `scripts/` stores operational Fish/Bash helpers—extend these directories instead of relying on ad-hoc commands. + +## Build, Test, and Development Commands +- `kustomize build services/` (or `kubectl kustomize ...`) renders manifests exactly as Flux will. +- `kubectl apply --server-side --dry-run=client -k services/` checks schema compatibility without touching the cluster. +- `flux reconcile kustomization --namespace flux-system --with-source` pulls the latest Git state after merges or hotfixes. +- `fish scripts/flux_hammer.fish --help` explains the recovery tool; read it before running against production workloads. + +## Coding Style & Naming Conventions +- YAML uses two-space indents; retain the leading path comment (e.g. `# services/gitea/deployment.yaml`) to speed code review. +- Keep resource names lowercase kebab-case, align labels/selectors, and mirror namespaces with directory names. +- List resources in `kustomization.yaml` from namespace/config, through storage, then workloads and networking for predictable diffs. +- Scripts start with `#!/usr/bin/env fish` or bash, stay executable, and follow snake_case names such as `flux_hammer.fish`. + +## Testing Guidelines +- Run `kustomize build` and the dry-run apply for every service you touch; capture failures before opening a PR. +- `flux diff kustomization --path services/` previews reconciliations—link notable output when behavior shifts. +- Docker edits: `docker build -f dockerfiles/Dockerfile.monerod .` (swap the file you changed) to verify image builds. + +## Commit & Pull Request Guidelines +- Keep commit subjects short, present-tense, and optionally scoped (`gpu(titan-24): add RuntimeClass`); squash fixups before review. +- Describe linked issues, affected services, and required operator steps (e.g. `flux reconcile kustomization services-gitea`) in the PR body. +- Focus each PR on one kustomization or service and update `infrastructure/flux-system` when Flux must track new folders. +- Record the validation you ran (dry-runs, diffs, builds) and add screenshots only when ingress or UI behavior changes. + +## Security & Configuration Tips +- Never commit credentials; use Vault workflows (`services/vault/`) or SOPS-encrypted manifests wired through `infrastructure/flux-system`. +- Node selectors and tolerations gate workloads to hardware like `hardware: rpi4`; confirm labels before scaling or renaming nodes. +- Pin external images by digest or rely on Flux image automation to follow approved tags and avoid drift. + +## Dashboard roadmap / context (2025-12-02) +- Atlas dashboards are generated via `scripts/dashboards_render_atlas.py --build`, which writes JSON under `services/monitoring/dashboards/` and ConfigMaps under `services/monitoring/`. Keep the Grafana manifests in sync by regenerating after edits. +- Atlas Overview panels are paired with internal dashboards (pods, nodes, storage, network, GPU). A new `atlas-gpu` internal dashboard holds the detailed GPU metrics that feed the overview share pie. +- Old Grafana folders (`Atlas Storage`, `Atlas SRE`, `Atlas Public`, `Atlas Nodes`) should be removed in Grafana UI when convenient; only `Atlas Overview` and `Atlas Internal` should remain provisioned. +- Future work: add a separate generator (e.g., `dashboards_render_oceanus.py`) for SUI/oceanus validation dashboards, mirroring the atlas pattern of internal dashboards feeding a public overview. + +## Monitoring state (2025-12-03) +- dcgm-exporter DaemonSet pulls `registry.bstein.dev/monitoring/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04` with nvidia runtime/imagePullSecret; titan-24 exports metrics, titan-22 remains NotReady. +- Atlas Overview is the Grafana home (1h range, 1m refresh), Overview folder UID `overview`, internal folder `atlas-internal` (oceanus-internal stub). +- Panels standardized via generator; hottest row compressed, worker/control rows taller, root disk row taller and top12 bar gauge with labels. GPU share pie uses 1h avg_over_time to persist idle activity. +- Internal dashboards are provisioned without Viewer role; if anonymous still sees them, restart Grafana and tighten auth if needed. + +## Upcoming priorities (SSO/storage/mail) +- Establish SSO (Keycloak or similar) and federate Grafana, Gitea, Zot, Nextcloud, Pegasus/Jellyfin; keep Vaultwarden separate until safe. +- Add Nextcloud (limit to rpi5 workers) with office suite; integrate with SSO; plan storage class and ingress. +- Plan mail: mostly self-hosted, relay through trusted provider for outbound; integrate with services (Nextcloud, Vaultwarden, etc.) for notifications and account flows. + +## SSO plan sketch (2025-12-03) +- IdP: use Keycloak (preferred) in a new `sso` namespace, Bitnami or codecentric chart with Postgres backing store (single PVC), ingress `sso.bstein.dev`, admin user bound to brad@bstein.dev; stick with local DB initially (no external IdP). +- Auth flow goals: Grafana (OIDC), Gitea (OAuth2/Keycloak), Zot (via Traefik forward-auth/oauth2-proxy), Jellyfin/Pegasus via Jellyfin OAuth/OpenID plugin (map existing usernames; run migration to pre-create users in Keycloak with same usernames/emails and temporary passwords), Pegasus keeps using Jellyfin tokens. +- Steps to implement: + 1) Add service folder `services/keycloak/` (namespace, PVC, HelmRelease, ingress, secret for admin creds). Verify with kustomize + Flux reconcile. + 2) Seed realm `atlas` with users (import CSV/realm). Create client for Grafana (public/implicit), Gitea (confidential), and a “jellyfin” client for the OAuth plugin; set email for brad@bstein.dev as admin. + 3) Reconfigure Grafana to OIDC (disable anonymous to internal folders, leave Overview public via folder permissions). Reconfigure Gitea to OIDC (app.ini). + 4) Add Traefik forward-auth (oauth2-proxy) in front of Zot and any other services needing headers-based auth. + 5) Deploy Jellyfin OpenID plugin; map Keycloak users to existing Jellyfin usernames; communicate password reset path. +- Migration caution: do not delete existing local creds until SSO validated; keep Pegasus working via Jellyfin tokens during transition. + +## Postgres centralization (2025-12-03) +- Prefer a shared in-cluster Postgres deployment with per-service databases to reduce resource sprawl on Pi nodes. Use it for services that can easily point at an external DB. +- Candidates to migrate to shared Postgres: Keycloak (realm DB), Gitea (git DB), Nextcloud (app DB), possibly Grafana (if persistence needed beyond current provisioner), Jitsi prosody/JVB state (if external DB supported). Keep tightly-coupled or lightweight embedded DBs as-is when migration is painful or not supported. diff --git a/clusters/atlas/flux-system/gotk-sync.yaml b/clusters/atlas/flux-system/gotk-sync.yaml index 473ab99..46f65d3 100644 --- a/clusters/atlas/flux-system/gotk-sync.yaml +++ b/clusters/atlas/flux-system/gotk-sync.yaml @@ -8,7 +8,7 @@ metadata: spec: interval: 1m0s ref: - branch: main + branch: feature/atlas-monitoring secretRef: name: flux-system-gitea url: ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git diff --git a/clusters/atlas/flux-system/platform/monitoring/kustomization.yaml b/clusters/atlas/flux-system/platform/monitoring/kustomization.yaml index 2899531..82ad672 100644 --- a/clusters/atlas/flux-system/platform/monitoring/kustomization.yaml +++ b/clusters/atlas/flux-system/platform/monitoring/kustomization.yaml @@ -11,4 +11,4 @@ spec: sourceRef: kind: GitRepository name: flux-system - wait: true + wait: false diff --git a/infrastructure/traefik/deployment.yaml b/infrastructure/traefik/deployment.yaml index ba16909..a34307a 100644 --- a/infrastructure/traefik/deployment.yaml +++ b/infrastructure/traefik/deployment.yaml @@ -39,6 +39,12 @@ items: - --metrics.prometheus.addEntryPointsLabels=true - --metrics.prometheus.addRoutersLabels=true - --metrics.prometheus.addServicesLabels=true + - --entrypoints.web.transport.respondingTimeouts.readTimeout=0s + - --entrypoints.web.transport.respondingTimeouts.writeTimeout=0s + - --entrypoints.web.transport.respondingTimeouts.idleTimeout=0s + - --entrypoints.websecure.transport.respondingTimeouts.readTimeout=0s + - --entrypoints.websecure.transport.respondingTimeouts.writeTimeout=0s + - --entrypoints.websecure.transport.respondingTimeouts.idleTimeout=0s - --entrypoints.metrics.address=:9100 - --metrics.prometheus.entryPoint=metrics image: traefik:v3.3.3 diff --git a/scripts/dashboards_render_atlas.py b/scripts/dashboards_render_atlas.py new file mode 100644 index 0000000..93de006 --- /dev/null +++ b/scripts/dashboards_render_atlas.py @@ -0,0 +1,1434 @@ +#!/usr/bin/env python3 +"""Generate Atlas Grafana dashboards and render them into ConfigMaps. + +Usage: + scripts/dashboards_render_atlas.py --build # rebuild JSON + ConfigMaps + scripts/dashboards_render_atlas.py # re-render ConfigMaps from JSON +""" + +import argparse +import json +import textwrap +from pathlib import Path + +# --------------------------------------------------------------------------- +# Paths, folders, and shared metadata +# --------------------------------------------------------------------------- + +ROOT = Path(__file__).resolve().parents[1] +DASHBOARD_DIR = ROOT / "services" / "monitoring" / "dashboards" +CONFIG_TEMPLATE = textwrap.dedent( + """# {relative_path} +apiVersion: v1 +kind: ConfigMap +metadata: + name: {name} + labels: + grafana_dashboard: "1" +data: + {key}: | +{payload} +""" +) + +PROM_DS = {"type": "prometheus", "uid": "atlas-vm"} +PUBLIC_FOLDER = "overview" +PRIVATE_FOLDER = "atlas-internal" + +PERCENT_THRESHOLDS = { + "mode": "percentage", + "steps": [ + {"color": "green", "value": None}, + {"color": "yellow", "value": 70}, + {"color": "red", "value": 85}, + ], +} + +# --------------------------------------------------------------------------- +# Cluster metadata +# --------------------------------------------------------------------------- + +CONTROL_PLANE_NODES = ["titan-0a", "titan-0b", "titan-0c"] +CONTROL_DEPENDENCIES = ["titan-db"] +CONTROL_ALL = CONTROL_PLANE_NODES + CONTROL_DEPENDENCIES +WORKER_NODES = [ + "titan-04", + "titan-05", + "titan-06", + "titan-07", + "titan-08", + "titan-09", + "titan-10", + "titan-11", + "titan-12", + "titan-13", + "titan-14", + "titan-15", + "titan-16", + "titan-17", + "titan-18", + "titan-19", + "titan-22", + "titan-24", +] + +CONTROL_REGEX = "|".join(CONTROL_PLANE_NODES) +CONTROL_ALL_REGEX = "|".join(CONTROL_ALL) +WORKER_REGEX = "|".join(WORKER_NODES) +CONTROL_TOTAL = len(CONTROL_PLANE_NODES) +WORKER_TOTAL = len(WORKER_NODES) +CONTROL_SUFFIX = f"/{CONTROL_TOTAL}" +WORKER_SUFFIX = f"/{WORKER_TOTAL}" +CP_ALLOWED_NS = "kube-system|kube-public|kube-node-lease|longhorn-system|monitoring|flux-system" +LONGHORN_NODE_REGEX = "titan-1[2-9]|titan-2[24]" +GAUGE_WIDTHS = [5, 5, 5, 5, 4] +CONTROL_WORKLOADS_EXPR = ( + f'sum(kube_pod_info{{node=~"{CONTROL_REGEX}",namespace!~"{CP_ALLOWED_NS}"}}) or on() vector(0)' +) + +# --------------------------------------------------------------------------- +# PromQL helpers +# --------------------------------------------------------------------------- + +NODE_INFO = 'label_replace(node_uname_info{nodename!=""}, "node", "$1", "nodename", "(.*)")' + + +def node_filter(regex): + """Return a selector that evaluates to 1 for nodes matching the regex.""" + return ( + f'label_replace(node_uname_info{{nodename=~"{regex}"}}, ' + '"node", "$1", "nodename", "(.*)")' + ) + + +def scoped_node_expr(base, scope=""): + """Attach nodename metadata and optionally filter to a scope regex.""" + expr = f"avg by (node) (({base}) * on(instance) group_left(node) {NODE_INFO})" + if scope: + expr = f"({expr}) * on(node) group_left() {node_filter(scope)}" + return expr + + +def node_cpu_expr(scope=""): + idle = 'avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))' + base = f"(1 - {idle}) * 100" + return scoped_node_expr(base, scope) + + +def node_mem_expr(scope=""): + usage = ( + "avg by (instance) (" + "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) " + "/ node_memory_MemTotal_bytes * 100)" + ) + return scoped_node_expr(usage, scope) + + +def filesystem_usage_expr(mount, scope=""): + base = ( + f'avg by (instance) (' + f'(1 - (node_filesystem_avail_bytes{{mountpoint="{mount}",fstype!~"tmpfs|overlay"}} ' + f'/ node_filesystem_size_bytes{{mountpoint="{mount}",fstype!~"tmpfs|overlay"}})) * 100)' + ) + return scoped_node_expr(base, scope) + + +def root_usage_expr(scope=""): + return filesystem_usage_expr("/", scope) + + +def astreae_usage_expr(mount): + return ( + f"100 - (sum(node_filesystem_avail_bytes{{mountpoint=\"{mount}\",fstype!~\"tmpfs|overlay\"}}) / " + f"sum(node_filesystem_size_bytes{{mountpoint=\"{mount}\",fstype!~\"tmpfs|overlay\"}}) * 100)" + ) + + +def astreae_free_expr(mount): + return f"sum(node_filesystem_avail_bytes{{mountpoint=\"{mount}\",fstype!~\"tmpfs|overlay\"}})" + + +def topk_with_node(expr): + return f'label_replace(topk(1, {expr}), "__name__", "$1", "node", "(.*)")' + + +def node_net_expr(scope=""): + base = ( + 'sum by (instance) (' + 'rate(node_network_receive_bytes_total{device!~"lo"}[5m]) ' + '+ rate(node_network_transmit_bytes_total{device!~"lo"}[5m]))' + ) + return scoped_node_expr(base, scope) + + +def node_io_expr(scope=""): + base = ( + "sum by (instance) (rate(node_disk_read_bytes_total[5m]) " + "+ rate(node_disk_written_bytes_total[5m]))" + ) + return scoped_node_expr(base, scope) + + +def namespace_share_expr(resource_expr): + selected = f"( {resource_expr} ) and on(namespace) ( {NAMESPACE_TOP_FILTER} )" + total = f"clamp_min(sum( {selected} ), 1)" + return f"100 * ( {selected} ) / {total}" + + +def namespace_cpu_share_expr(): + return namespace_share_expr(NAMESPACE_CPU_RAW) + + +def namespace_ram_share_expr(): + return namespace_share_expr(NAMESPACE_RAM_RAW) + + +def namespace_gpu_share_expr(): + return namespace_share_expr(NAMESPACE_GPU_RAW) + + +PROBLEM_PODS_EXPR = 'sum(max by (namespace,pod) (kube_pod_status_phase{phase!~"Running|Succeeded"}))' +CRASHLOOP_EXPR = ( + 'sum(max by (namespace,pod) (kube_pod_container_status_waiting_reason' + '{reason=~"CrashLoopBackOff|ImagePullBackOff"}))' +) +STUCK_TERMINATING_EXPR = ( + 'sum(max by (namespace,pod) (' + '((time() - kube_pod_deletion_timestamp{pod!=""}) > bool 600)' + ' and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=""} > bool 0)' + '))' +) +PROBLEM_TABLE_EXPR = ( + "(time() - kube_pod_created{pod!=\"\"}) " + "* on(namespace,pod) group_left(node) kube_pod_info " + "* on(namespace,pod) group_left(phase) " + "max by (namespace,pod,phase) (kube_pod_status_phase{phase!~\"Running|Succeeded\"})" +) +CRASHLOOP_TABLE_EXPR = ( + "(time() - kube_pod_created{pod!=\"\"}) " + "* on(namespace,pod) group_left(node) kube_pod_info " + "* on(namespace,pod,container) group_left(reason) " + "max by (namespace,pod,container,reason) " + "(kube_pod_container_status_waiting_reason{reason=~\"CrashLoopBackOff|ImagePullBackOff\"})" +) +STUCK_TABLE_EXPR = ( + "(" + "((time() - kube_pod_deletion_timestamp{pod!=\"\"}) " + "and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0)) " + "* on(namespace,pod) group_left(node) kube_pod_info" + ")" +) + +NAMESPACE_CPU_RAW = ( + 'sum(rate(container_cpu_usage_seconds_total{namespace!="",pod!="",container!=""}[5m])) by (namespace)' +) +NAMESPACE_RAM_RAW = ( + 'sum(container_memory_working_set_bytes{namespace!="",pod!="",container!=""}) by (namespace)' +) +GPU_NODES = ["titan-20", "titan-21", "titan-22", "titan-24"] +GPU_NODE_REGEX = "|".join(GPU_NODES) +NAMESPACE_GPU_ALLOC = ( + 'sum((kube_pod_container_resource_requests{namespace!="",resource="nvidia.com/gpu"}' + ' or kube_pod_container_resource_limits{namespace!="",resource="nvidia.com/gpu"})) by (namespace)' +) +NAMESPACE_GPU_USAGE_SHARE = ( + 'sum by (namespace) (avg_over_time(DCGM_FI_DEV_GPU_UTIL{namespace!=\"\",pod!=\"\"}[1h]))' +) +NAMESPACE_GPU_USAGE_INSTANT = 'sum(DCGM_FI_DEV_GPU_UTIL{namespace!="",pod!=""}) by (namespace)' +NAMESPACE_GPU_RAW = ( + "(" + + NAMESPACE_GPU_USAGE_SHARE + + ") or on(namespace) (" + + NAMESPACE_CPU_RAW + + " * 0)" +) +NAMESPACE_GPU_WEIGHT = ( + "(" + + NAMESPACE_GPU_ALLOC + + ") or on(namespace) (" + + NAMESPACE_CPU_RAW + + " * 0)" +) +NAMESPACE_ACTIVITY_SCORE = ( + "( " + + NAMESPACE_CPU_RAW + + " ) + (" + + NAMESPACE_RAM_RAW + + " / 1e9) + (" + + NAMESPACE_GPU_WEIGHT + + " * 100)" +) +NAMESPACE_TOP_FILTER = "(topk(10, " + NAMESPACE_ACTIVITY_SCORE + ") >= bool 0)" +TRAEFIK_ROUTER_EXPR = "sum by (router) (rate(traefik_router_requests_total[5m]))" +TRAEFIK_NET_INGRESS = ( + 'sum(rate(container_network_receive_bytes_total{namespace="traefik",pod=~"traefik-.*"}[5m]))' + " or on() vector(0)" +) +TRAEFIK_NET_EGRESS = ( + 'sum(rate(container_network_transmit_bytes_total{namespace="traefik",pod=~"traefik-.*"}[5m]))' + " or on() vector(0)" +) +NET_CLUSTER_RX = ( + 'sum(rate(container_network_receive_bytes_total{namespace!="",pod!="",container!=""}[5m]))' + " or on() vector(0)" +) +NET_CLUSTER_TX = ( + 'sum(rate(container_network_transmit_bytes_total{namespace!="",pod!="",container!=""}[5m]))' + " or on() vector(0)" +) +PHYSICAL_NET_FILTER = 'device!~"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*"' +NET_NODE_RX_PHYS = ( + f'sum(rate(node_network_receive_bytes_total{{{PHYSICAL_NET_FILTER}}}[5m])) or on() vector(0)' +) +NET_NODE_TX_PHYS = ( + f'sum(rate(node_network_transmit_bytes_total{{{PHYSICAL_NET_FILTER}}}[5m])) or on() vector(0)' +) +NET_TOTAL_EXPR = NET_NODE_TX_PHYS +NET_INGRESS_EXPR = NET_NODE_RX_PHYS +NET_EGRESS_EXPR = NET_NODE_TX_PHYS +NET_INTERNAL_EXPR = ( + 'sum(rate(container_network_receive_bytes_total{namespace!="traefik",pod!=""}[5m]) ' + '+ rate(container_network_transmit_bytes_total{namespace!="traefik",pod!=""}[5m]))' + ' or on() vector(0)' +) + +# --------------------------------------------------------------------------- +# Panel factories +# --------------------------------------------------------------------------- + + +def stat_panel( + panel_id, + title, + expr, + grid, + *, + unit="none", + thresholds=None, + text_mode="value", + legend=None, + instant=False, + value_suffix=None, + links=None, +): + """Return a Grafana stat panel definition.""" + defaults = { + "color": {"mode": "palette-classic"}, + "mappings": [], + "thresholds": thresholds + or { + "mode": "absolute", + "steps": [ + {"color": "rgba(115, 115, 115, 1)", "value": None}, + {"color": "green", "value": 1}, + ], + }, + "unit": unit, + "custom": {"displayMode": "auto"}, + } + if value_suffix: + defaults["custom"]["valueSuffix"] = value_suffix + panel = { + "id": panel_id, + "type": "stat", + "title": title, + "datasource": PROM_DS, + "gridPos": grid, + "targets": [{"expr": expr, "refId": "A"}], + "fieldConfig": {"defaults": defaults, "overrides": []}, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": False}, + "textMode": text_mode, + }, + } + if legend: + panel["targets"][0]["legendFormat"] = legend + if instant: + panel["targets"][0]["instant"] = True + if links: + panel["links"] = links + return panel + + +def gauge_panel( + panel_id, + title, + expr, + grid, + *, + min_value=0, + max_value=1, + thresholds=None, + links=None, +): + return { + "id": panel_id, + "type": "gauge", + "title": title, + "datasource": PROM_DS, + "gridPos": grid, + "targets": [{"expr": expr, "refId": "A"}], + "fieldConfig": { + "defaults": { + "min": min_value, + "max": max_value, + "thresholds": thresholds + or { + "mode": "absolute", + "steps": [ + {"color": "green", "value": None}, + {"color": "red", "value": max_value}, + ], + }, + }, + "overrides": [], + }, + "options": { + "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": False}, + "orientation": "auto", + "showThresholdMarkers": False, + "showThresholdLabels": False, + }, + **({"links": links} if links else {}), + } + + +def timeseries_panel( + panel_id, + title, + expr, + grid, + *, + unit="none", + legend=None, + legend_display="table", + legend_placement="bottom", + legend_calcs=None, + time_from=None, + links=None, +): + """Return a Grafana time-series panel definition.""" + panel = { + "id": panel_id, + "type": "timeseries", + "title": title, + "datasource": PROM_DS, + "gridPos": grid, + "targets": [{"expr": expr, "refId": "A"}], + "fieldConfig": {"defaults": {"unit": unit}, "overrides": []}, + "options": { + "legend": { + "displayMode": legend_display, + "placement": legend_placement, + }, + "tooltip": {"mode": "multi"}, + }, + } + if legend: + panel["targets"][0]["legendFormat"] = legend + if legend_calcs: + panel["options"]["legend"]["calcs"] = legend_calcs + if time_from: + panel["timeFrom"] = time_from + if links: + panel["links"] = links + return panel + + +def table_panel( + panel_id, + title, + expr, + grid, + *, + unit="none", + transformations=None, +): + """Return a Grafana table panel definition.""" + panel = { + "id": panel_id, + "type": "table", + "title": title, + "datasource": PROM_DS, + "gridPos": grid, + "targets": [{"expr": expr, "refId": "A"}], + "fieldConfig": {"defaults": {"unit": unit}, "overrides": []}, + "options": {"showHeader": True}, + } + if transformations: + panel["transformations"] = transformations + return panel + + +def pie_panel(panel_id, title, expr, grid): + """Return a pie chart panel with readable namespace labels.""" + return { + "id": panel_id, + "type": "piechart", + "title": title, + "datasource": PROM_DS, + "gridPos": grid, + "targets": [{"expr": expr, "refId": "A", "legendFormat": "{{namespace}}"}], + "fieldConfig": { + "defaults": { + "unit": "percent", + "color": {"mode": "palette-classic"}, + }, + "overrides": [], + }, + "options": { + "legend": {"displayMode": "list", "placement": "right"}, + "pieType": "pie", + "displayLabels": ["percent"], + "tooltip": {"mode": "single"}, + "colorScheme": "interpolateSpectral", + "colorBy": "value", + "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": False}, + }, + } + + +def bargauge_panel(panel_id, title, expr, grid, *, unit="none", links=None): + """Return a bar gauge panel with label-aware reduction.""" + panel = { + "id": panel_id, + "type": "bargauge", + "title": title, + "datasource": PROM_DS, + "gridPos": grid, + "targets": [{"expr": expr, "refId": "A", "legendFormat": "{{node}}"}], + "fieldConfig": { + "defaults": { + "unit": unit, + "min": 0, + "max": 100 if unit == "percent" else None, + "thresholds": { + "mode": "absolute", + "steps": [ + {"color": "green", "value": None}, + {"color": "yellow", "value": 50}, + {"color": "orange", "value": 70}, + {"color": "red", "value": 85}, + ], + }, + }, + "overrides": [], + }, + "options": { + "displayMode": "gradient", + "orientation": "horizontal", + "reduceOptions": { + "calcs": ["lastNotNull"], + "fields": "", + "values": False, + }, + }, + } + if links: + panel["links"] = links + return panel + + +def text_panel(panel_id, title, content, grid): + return { + "id": panel_id, + "type": "text", + "title": title, + "gridPos": grid, + "datasource": None, + "options": {"mode": "markdown", "content": content}, + } + + +def link_to(uid): + return [{"title": f"Open {uid} dashboard", "url": f"/d/{uid}", "targetBlank": True}] + + +# --------------------------------------------------------------------------- +# Dashboard builders +# --------------------------------------------------------------------------- + + +def build_overview(): + panels = [] + + row1_stats = [ + ( + 1, + "Workers Ready", + f'sum(kube_node_status_condition{{condition="Ready",status="true",node=~"{WORKER_REGEX}"}})', + WORKER_SUFFIX, + WORKER_TOTAL, + None, + ), + ( + 2, + "Control Plane Ready", + f'sum(kube_node_status_condition{{condition="Ready",status="true",node=~"{CONTROL_REGEX}"}})', + CONTROL_SUFFIX, + CONTROL_TOTAL, + None, + ), + ( + 3, + "Control Plane Workloads", + CONTROL_WORKLOADS_EXPR, + None, + 4, + link_to("atlas-pods"), + ), + ( + 4, + "Problem Pods", + PROBLEM_PODS_EXPR, + None, + 1, + link_to("atlas-pods"), + ), + ( + 5, + "Stuck Terminating", + STUCK_TERMINATING_EXPR, + None, + 1, + link_to("atlas-pods"), + ), + ] + + def gauge_grid(idx): + width = GAUGE_WIDTHS[idx] if idx < len(GAUGE_WIDTHS) else 4 + x = sum(GAUGE_WIDTHS[:idx]) + return width, x + + for idx, (panel_id, title, expr, suffix, ok_value, links) in enumerate(row1_stats): + thresholds = None + min_value = 0 + max_value = ok_value or 5 + if panel_id == 1: + max_value = WORKER_TOTAL + thresholds = { + "mode": "absolute", + "steps": [ + {"color": "red", "value": None}, + {"color": "orange", "value": WORKER_TOTAL - 2}, + {"color": "yellow", "value": WORKER_TOTAL - 1}, + {"color": "green", "value": WORKER_TOTAL}, + ], + } + elif panel_id == 2: + max_value = CONTROL_TOTAL + thresholds = { + "mode": "absolute", + "steps": [ + {"color": "red", "value": None}, + {"color": "green", "value": CONTROL_TOTAL}, + ], + } + elif panel_id in (3, 4, 5): + max_value = 4 + thresholds = { + "mode": "absolute", + "steps": [ + {"color": "green", "value": None}, + {"color": "yellow", "value": 1}, + {"color": "orange", "value": 2}, + {"color": "red", "value": 3}, + ], + } + else: + thresholds = { + "mode": "absolute", + "steps": [ + {"color": "green", "value": None}, + {"color": "red", "value": max_value}, + ], + } + width, x = gauge_grid(idx) + if panel_id in (3, 4, 5): + panels.append( + stat_panel( + panel_id, + title, + expr, + {"h": 5, "w": width, "x": x, "y": 0}, + thresholds=thresholds, + legend=None, + links=links, + text_mode="value", + ) + ) + else: + panels.append( + gauge_panel( + panel_id, + title, + expr, + {"h": 5, "w": width, "x": x, "y": 0}, + min_value=min_value, + max_value=max_value, + thresholds=thresholds, + links=links, + ) + ) + + hottest = [ + (7, "Hottest node: CPU", topk_with_node(node_cpu_expr()), "percent"), + (8, "Hottest node: RAM", topk_with_node(node_mem_expr()), "percent"), + (9, "Hottest node: NET (rx+tx)", topk_with_node(node_net_expr()), "Bps"), + (10, "Hottest node: I/O (r+w)", topk_with_node(node_io_expr()), "Bps"), + ] + for idx, (panel_id, title, expr, unit) in enumerate(hottest): + panels.append( + stat_panel( + panel_id, + title, + f"{expr}", + {"h": 3, "w": 6, "x": 6 * idx, "y": 5}, + unit=unit, + thresholds=PERCENT_THRESHOLDS if unit == "percent" else None, + text_mode="name_and_value", + legend="{{node}}", + instant=True, + links=link_to("atlas-nodes"), + ) + ) + + storage_panels = [ + (23, "Astreae Usage", astreae_usage_expr("/mnt/astreae"), "percent"), + (24, "Asteria Usage", astreae_usage_expr("/mnt/asteria"), "percent"), + (25, "Astreae Free", astreae_free_expr("/mnt/astreae"), "decbytes"), + (26, "Asteria Free", astreae_free_expr("/mnt/asteria"), "decbytes"), + ] + for idx, (panel_id, title, expr, unit) in enumerate(storage_panels): + panels.append( + stat_panel( + panel_id, + title, + expr, + {"h": 6, "w": 6, "x": 6 * idx, "y": 10}, + unit=unit, + thresholds=PERCENT_THRESHOLDS if unit == "percent" else None, + links=link_to("atlas-storage"), + ) + ) + + panels.append( + pie_panel( + 11, + "Namespace CPU Share", + namespace_cpu_share_expr(), + {"h": 9, "w": 8, "x": 0, "y": 16}, + ) + ) + panels.append( + pie_panel( + 12, + "Namespace GPU Share", + namespace_gpu_share_expr(), + {"h": 9, "w": 8, "x": 8, "y": 16}, + ) + ) + panels.append( + pie_panel( + 13, + "Namespace RAM Share", + namespace_ram_share_expr(), + {"h": 9, "w": 8, "x": 16, "y": 16}, + ) + ) + + worker_filter = f"{WORKER_REGEX}" + panels.append( + timeseries_panel( + 14, + "Worker Node CPU", + node_cpu_expr(worker_filter), + {"h": 12, "w": 12, "x": 0, "y": 32}, + unit="percent", + legend="{{node}}", + legend_calcs=["last"], + legend_display="table", + legend_placement="right", + links=link_to("atlas-nodes"), + ) + ) + panels.append( + timeseries_panel( + 15, + "Worker Node RAM", + node_mem_expr(worker_filter), + {"h": 12, "w": 12, "x": 12, "y": 32}, + unit="percent", + legend="{{node}}", + legend_calcs=["last"], + legend_display="table", + legend_placement="right", + links=link_to("atlas-nodes"), + ) + ) + + panels.append( + timeseries_panel( + 16, + "Control plane CPU", + node_cpu_expr(CONTROL_REGEX), + {"h": 10, "w": 12, "x": 0, "y": 44}, + unit="percent", + legend="{{node}}", + legend_display="table", + legend_placement="right", + ) + ) + panels.append( + timeseries_panel( + 17, + "Control plane RAM", + node_mem_expr(CONTROL_REGEX), + {"h": 10, "w": 12, "x": 12, "y": 44}, + unit="percent", + legend="{{node}}", + legend_display="table", + legend_placement="right", + ) + ) + + panels.append( + timeseries_panel( + 18, + "Cluster Ingress Throughput", + NET_INGRESS_EXPR, + {"h": 7, "w": 8, "x": 0, "y": 25}, + unit="Bps", + legend="Ingress (Traefik)", + legend_display="list", + legend_placement="bottom", + links=link_to("atlas-network"), + ) + ) + panels.append( + timeseries_panel( + 19, + "Cluster Egress Throughput", + NET_EGRESS_EXPR, + {"h": 7, "w": 8, "x": 8, "y": 25}, + unit="Bps", + legend="Egress (Traefik)", + legend_display="list", + legend_placement="bottom", + links=link_to("atlas-network"), + ) + ) + panels.append( + timeseries_panel( + 20, + "Intra-Cluster Throughput", + NET_INTERNAL_EXPR, + {"h": 7, "w": 8, "x": 16, "y": 25}, + unit="Bps", + legend="Internal traffic", + legend_display="list", + legend_placement="bottom", + links=link_to("atlas-network"), + ) + ) + + panels.append( + timeseries_panel( + 21, + "Root Filesystem Usage", + root_usage_expr(), + {"h": 16, "w": 12, "x": 0, "y": 54}, + unit="percent", + legend="{{node}}", + legend_calcs=["last"], + legend_display="table", + legend_placement="right", + time_from="30d", + links=link_to("atlas-storage"), + ) + ) + panels.append( + bargauge_panel( + 22, + "Nodes Closest to Full Root Disks", + f"topk(12, {root_usage_expr()})", + {"h": 16, "w": 12, "x": 12, "y": 54}, + unit="percent", + links=link_to("atlas-storage"), + ) + ) + + return { + "uid": "atlas-overview", + "title": "Atlas Overview", + "folderUid": PUBLIC_FOLDER, + "editable": False, + "annotations": {"list": []}, + "panels": panels, + "schemaVersion": 39, + "style": "dark", + "tags": ["atlas", "overview"], + "templating": {"list": []}, + "time": {"from": "now-1h", "to": "now"}, + "refresh": "1m", + "links": [ + {"title": "Atlas Pods", "type": "dashboard", "dashboardUid": "atlas-pods", "keepTime": False}, + {"title": "Atlas Nodes", "type": "dashboard", "dashboardUid": "atlas-nodes", "keepTime": False}, + {"title": "Atlas Storage", "type": "dashboard", "dashboardUid": "atlas-storage", "keepTime": False}, + {"title": "Atlas Network", "type": "dashboard", "dashboardUid": "atlas-network", "keepTime": False}, + {"title": "Atlas GPU", "type": "dashboard", "dashboardUid": "atlas-gpu", "keepTime": False}, + ], + } + + +def build_pods_dashboard(): + panels = [] + panels.append( + stat_panel( + 1, + "Problem Pods", + PROBLEM_PODS_EXPR, + {"h": 4, "w": 6, "x": 0, "y": 0}, + thresholds={ + "mode": "absolute", + "steps": [ + {"color": "green", "value": None}, + {"color": "red", "value": 1}, + ], + }, + ) + ) + panels.append( + stat_panel( + 2, + "CrashLoop / ImagePull", + CRASHLOOP_EXPR, + {"h": 4, "w": 6, "x": 6, "y": 0}, + thresholds={ + "mode": "absolute", + "steps": [ + {"color": "green", "value": None}, + {"color": "red", "value": 1}, + ], + }, + ) + ) + panels.append( + stat_panel( + 3, + "Stuck Terminating (>10m)", + STUCK_TERMINATING_EXPR, + {"h": 4, "w": 6, "x": 12, "y": 0}, + thresholds={ + "mode": "absolute", + "steps": [ + {"color": "green", "value": None}, + {"color": "red", "value": 1}, + ], + }, + ) + ) + panels.append( + stat_panel( + 4, + "Control Plane Workloads", + f'sum(kube_pod_info{{node=~"{CONTROL_REGEX}",namespace!~"{CP_ALLOWED_NS}"}})', + {"h": 4, "w": 6, "x": 18, "y": 0}, + thresholds={ + "mode": "absolute", + "steps": [ + {"color": "green", "value": None}, + {"color": "red", "value": 1}, + ], + }, + ) + ) + + panels.append( + table_panel( + 5, + "Pods Not Running", + PROBLEM_TABLE_EXPR, + {"h": 10, "w": 24, "x": 0, "y": 4}, + unit="s", + transformations=[{"id": "labelsToFields", "options": {}}], + ) + ) + panels.append( + table_panel( + 6, + "CrashLoop / ImagePull", + CRASHLOOP_TABLE_EXPR, + {"h": 10, "w": 24, "x": 0, "y": 14}, + unit="s", + transformations=[{"id": "labelsToFields", "options": {}}], + ) + ) + panels.append( + table_panel( + 7, + "Terminating >10m", + STUCK_TABLE_EXPR, + {"h": 10, "w": 24, "x": 0, "y": 24}, + unit="s", + transformations=[ + {"id": "labelsToFields", "options": {}}, + {"id": "filterByValue", "options": {"match": "Value", "operator": "gt", "value": 600}}, + ], + ) + ) + return { + "uid": "atlas-pods", + "title": "Atlas Pods", + "folderUid": PRIVATE_FOLDER, + "editable": True, + "panels": panels, + "time": {"from": "now-12h", "to": "now"}, + "annotations": {"list": []}, + "schemaVersion": 39, + "style": "dark", + "tags": ["atlas", "pods"], + } + + +def build_nodes_dashboard(): + panels = [] + panels.append( + stat_panel( + 1, + "Worker Nodes Ready", + f'sum(kube_node_status_condition{{condition="Ready",status="true",node=~"{WORKER_REGEX}"}})', + {"h": 4, "w": 8, "x": 0, "y": 0}, + value_suffix=WORKER_SUFFIX, + ) + ) + panels.append( + stat_panel( + 2, + "Control Plane Ready", + f'sum(kube_node_status_condition{{condition="Ready",status="true",node=~"{CONTROL_REGEX}"}})', + {"h": 4, "w": 8, "x": 8, "y": 0}, + value_suffix=CONTROL_SUFFIX, + ) + ) + panels.append( + stat_panel( + 3, + "Control Plane Workloads", + f'sum(kube_pod_info{{node=~"{CONTROL_REGEX}",namespace!~"{CP_ALLOWED_NS}"}})', + {"h": 4, "w": 8, "x": 16, "y": 0}, + ) + ) + panels.append( + timeseries_panel( + 4, + "Node CPU", + node_cpu_expr(), + {"h": 9, "w": 24, "x": 0, "y": 4}, + unit="percent", + legend="{{node}}", + legend_calcs=["last"], + legend_display="table", + legend_placement="right", + ) + ) + panels.append( + timeseries_panel( + 5, + "Node RAM", + node_mem_expr(), + {"h": 9, "w": 24, "x": 0, "y": 13}, + unit="percent", + legend="{{node}}", + legend_calcs=["last"], + legend_display="table", + legend_placement="right", + ) + ) + panels.append( + timeseries_panel( + 6, + "Control Plane (incl. titan-db) CPU", + node_cpu_expr(CONTROL_ALL_REGEX), + {"h": 9, "w": 12, "x": 0, "y": 22}, + unit="percent", + legend="{{node}}", + legend_display="table", + legend_placement="right", + ) + ) + panels.append( + timeseries_panel( + 7, + "Control Plane (incl. titan-db) RAM", + node_mem_expr(CONTROL_ALL_REGEX), + {"h": 9, "w": 12, "x": 12, "y": 22}, + unit="percent", + legend="{{node}}", + legend_display="table", + legend_placement="right", + ) + ) + panels.append( + timeseries_panel( + 8, + "Root Filesystem Usage", + root_usage_expr(), + {"h": 9, "w": 24, "x": 0, "y": 31}, + unit="percent", + legend="{{node}}", + legend_display="table", + legend_placement="right", + time_from="30d", + ) + ) + return { + "uid": "atlas-nodes", + "title": "Atlas Nodes", + "folderUid": PRIVATE_FOLDER, + "editable": True, + "panels": panels, + "time": {"from": "now-12h", "to": "now"}, + "annotations": {"list": []}, + "schemaVersion": 39, + "style": "dark", + "tags": ["atlas", "nodes"], + } + + +def build_storage_dashboard(): + panels = [] + panels.append( + stat_panel( + 1, + "Astreae Usage", + astreae_usage_expr("/mnt/astreae"), + {"h": 5, "w": 6, "x": 0, "y": 0}, + unit="percent", + thresholds=PERCENT_THRESHOLDS, + ) + ) + panels.append( + stat_panel( + 2, + "Asteria Usage", + astreae_usage_expr("/mnt/asteria"), + {"h": 5, "w": 6, "x": 6, "y": 0}, + unit="percent", + thresholds=PERCENT_THRESHOLDS, + ) + ) + panels.append( + stat_panel( + 3, + "Astreae Free", + astreae_free_expr("/mnt/astreae"), + {"h": 5, "w": 6, "x": 12, "y": 0}, + unit="decbytes", + ) + ) + panels.append( + stat_panel( + 4, + "Asteria Free", + astreae_free_expr("/mnt/asteria"), + {"h": 5, "w": 6, "x": 18, "y": 0}, + unit="decbytes", + ) + ) + panels.append( + timeseries_panel( + 5, + "Astreae Per-Node Usage", + filesystem_usage_expr("/mnt/astreae", LONGHORN_NODE_REGEX), + {"h": 9, "w": 12, "x": 0, "y": 5}, + unit="percent", + legend="{{node}}", + legend_display="table", + legend_placement="right", + time_from="30d", + ) + ) + panels.append( + timeseries_panel( + 6, + "Asteria Per-Node Usage", + filesystem_usage_expr("/mnt/asteria", LONGHORN_NODE_REGEX), + {"h": 9, "w": 12, "x": 12, "y": 5}, + unit="percent", + legend="{{node}}", + legend_display="table", + legend_placement="right", + time_from="30d", + ) + ) + panels.append( + timeseries_panel( + 7, + "Astreae Usage History", + astreae_usage_expr("/mnt/astreae"), + {"h": 9, "w": 12, "x": 0, "y": 14}, + unit="percent", + time_from="90d", + ) + ) + panels.append( + timeseries_panel( + 8, + "Asteria Usage History", + astreae_usage_expr("/mnt/asteria"), + {"h": 9, "w": 12, "x": 12, "y": 14}, + unit="percent", + time_from="90d", + ) + ) + return { + "uid": "atlas-storage", + "title": "Atlas Storage", + "folderUid": PRIVATE_FOLDER, + "editable": True, + "panels": panels, + "time": {"from": "now-12h", "to": "now"}, + "annotations": {"list": []}, + "schemaVersion": 39, + "style": "dark", + "tags": ["atlas", "storage"], + } + + +def build_network_dashboard(): + panels = [] + panels.append( + stat_panel( + 1, + "Ingress Traffic", + NET_INGRESS_EXPR, + {"h": 4, "w": 8, "x": 0, "y": 0}, + unit="Bps", + ) + ) + panels.append( + stat_panel( + 2, + "Egress Traffic", + NET_EGRESS_EXPR, + {"h": 4, "w": 8, "x": 8, "y": 0}, + unit="Bps", + ) + ) + panels.append( + stat_panel( + 3, + "Intra-Cluster Traffic", + NET_INTERNAL_EXPR, + {"h": 4, "w": 8, "x": 16, "y": 0}, + unit="Bps", + ) + ) + panels.append( + stat_panel( + 4, + "Top Router req/s", + f"topk(1, {TRAEFIK_ROUTER_EXPR})", + {"h": 4, "w": 8, "x": 0, "y": 4}, + unit="req/s", + legend="{{router}}", + ) + ) + panels.append( + timeseries_panel( + 5, + "Per-Node Throughput", + f'avg by (node) (({NET_NODE_TX_PHYS} + {NET_NODE_RX_PHYS}) * on(instance) group_left(node) {NODE_INFO})', + {"h": 8, "w": 24, "x": 0, "y": 8}, + unit="Bps", + legend="{{node}}", + legend_display="table", + legend_placement="right", + ) + ) + panels.append( + table_panel( + 6, + "Top Namespaces", + 'topk(10, sum(rate(container_network_transmit_bytes_total{namespace!=""}[5m]) ' + '+ rate(container_network_receive_bytes_total{namespace!=""}[5m])) by (namespace))', + {"h": 9, "w": 12, "x": 0, "y": 16}, + unit="Bps", + transformations=[{"id": "labelsToFields", "options": {}}], + ) + ) + panels.append( + table_panel( + 7, + "Top Pods", + 'topk(10, sum(rate(container_network_transmit_bytes_total{pod!=""}[5m]) ' + '+ rate(container_network_receive_bytes_total{pod!=""}[5m])) by (namespace,pod))', + {"h": 9, "w": 12, "x": 12, "y": 16}, + unit="Bps", + transformations=[{"id": "labelsToFields", "options": {}}], + ) + ) + panels.append( + timeseries_panel( + 8, + "Traefik Routers (req/s)", + f"topk(10, {TRAEFIK_ROUTER_EXPR})", + {"h": 9, "w": 12, "x": 0, "y": 25}, + unit="req/s", + legend="{{router}}", + legend_display="table", + legend_placement="right", + ) + ) + panels.append( + timeseries_panel( + 9, + "Traefik Entrypoints (req/s)", + 'sum by (entrypoint) (rate(traefik_entrypoint_requests_total[5m]))', + {"h": 9, "w": 12, "x": 12, "y": 25}, + unit="req/s", + legend="{{entrypoint}}", + legend_display="table", + legend_placement="right", + ) + ) + return { + "uid": "atlas-network", + "title": "Atlas Network", + "folderUid": PRIVATE_FOLDER, + "editable": True, + "panels": panels, + "time": {"from": "now-12h", "to": "now"}, + "annotations": {"list": []}, + "schemaVersion": 39, + "style": "dark", + "tags": ["atlas", "network"], + } + + +def build_gpu_dashboard(): + panels = [] + panels.append( + pie_panel( + 1, + "Namespace GPU Share", + namespace_gpu_share_expr(), + {"h": 8, "w": 12, "x": 0, "y": 0}, + ) + ) + panels.append( + timeseries_panel( + 2, + "GPU Util by Namespace", + NAMESPACE_GPU_USAGE_INSTANT, + {"h": 8, "w": 12, "x": 12, "y": 0}, + unit="percent", + legend="{{namespace}}", + legend_display="table", + legend_placement="right", + ) + ) + panels.append( + timeseries_panel( + 3, + "GPU Util by Node", + 'sum by (Hostname) (DCGM_FI_DEV_GPU_UTIL{pod!=""})', + {"h": 8, "w": 12, "x": 0, "y": 8}, + unit="percent", + legend="{{Hostname}}", + legend_display="table", + legend_placement="right", + ) + ) + panels.append( + table_panel( + 4, + "Top Pods by GPU Util", + 'topk(10, sum(DCGM_FI_DEV_GPU_UTIL{pod!=""}) by (namespace,pod,Hostname))', + {"h": 8, "w": 12, "x": 12, "y": 8}, + unit="percent", + transformations=[{"id": "labelsToFields", "options": {}}], + ) + ) + return { + "uid": "atlas-gpu", + "title": "Atlas GPU", + "folderUid": PRIVATE_FOLDER, + "editable": True, + "panels": panels, + "time": {"from": "now-12h", "to": "now"}, + "annotations": {"list": []}, + "schemaVersion": 39, + "style": "dark", + "tags": ["atlas", "gpu"], + } + + +DASHBOARDS = { + "atlas-overview": { + "builder": build_overview, + "configmap": ROOT / "services" / "monitoring" / "grafana-dashboard-overview.yaml", + }, + "atlas-pods": { + "builder": build_pods_dashboard, + "configmap": ROOT / "services" / "monitoring" / "grafana-dashboard-pods.yaml", + }, + "atlas-nodes": { + "builder": build_nodes_dashboard, + "configmap": ROOT / "services" / "monitoring" / "grafana-dashboard-nodes.yaml", + }, + "atlas-storage": { + "builder": build_storage_dashboard, + "configmap": ROOT / "services" / "monitoring" / "grafana-dashboard-storage.yaml", + }, + "atlas-network": { + "builder": build_network_dashboard, + "configmap": ROOT / "services" / "monitoring" / "grafana-dashboard-network.yaml", + }, + "atlas-gpu": { + "builder": build_gpu_dashboard, + "configmap": ROOT / "services" / "monitoring" / "grafana-dashboard-gpu.yaml", + }, +} + + +def write_json(uid, data): + DASHBOARD_DIR.mkdir(parents=True, exist_ok=True) + path = DASHBOARD_DIR / f"{uid}.json" + path.write_text(json.dumps(data, indent=2) + "\n") + + +def render_configmap(uid, info): + json_path = DASHBOARD_DIR / f"{uid}.json" + payload = json.dumps(json.loads(json_path.read_text()), indent=2) + indented = "\n".join(" " + line for line in payload.splitlines()) + output_path = info["configmap"] + content = CONFIG_TEMPLATE.format( + relative_path=output_path.relative_to(ROOT), + name=output_path.stem, + key=json_path.name, + payload=indented, + ) + output_path.write_text(content) + print(f"Rendered {json_path.name} -> {output_path.relative_to(ROOT)}") + + +def main(): + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("--build", action="store_true", help="Regenerate dashboard JSON files from builders") + args = parser.parse_args() + + if args.build: + for uid, info in DASHBOARDS.items(): + write_json(uid, info["builder"]()) + + for uid, info in DASHBOARDS.items(): + render_configmap(uid, info) + + +if __name__ == "__main__": + main() diff --git a/scripts/styx_prep_nvme_luks.sh b/scripts/styx_prep_nvme_luks.sh new file mode 100755 index 0000000..d5ea0c5 --- /dev/null +++ b/scripts/styx_prep_nvme_luks.sh @@ -0,0 +1,575 @@ +#!/usr/bin/env bash +set -euo pipefail + +# --- CONFIG (edit if needed) --- +# Leave NVME empty → script will auto-detect the SSK dock. +NVME="${NVME:-}" +FLAVOR="${FLAVOR:-desktop}" +# Persistent cache so the image survives reboots. +IMG_DIR="${IMG_DIR:-/var/cache/styx-rpi}" +IMG_FILE="${IMG_FILE:-ubuntu-24.04.3-preinstalled-${FLAVOR}-arm64+raspi.img}" +IMG_BOOT_MNT="${IMG_BOOT_MNT:-/mnt/img-boot}" +IMG_ROOT_MNT="${IMG_ROOT_MNT:-/mnt/img-root}" +TGT_ROOT="/mnt/target-root" +TGT_BOOT="/mnt/target-boot" + +STYX_USER="styx" +STYX_HOSTNAME="titan-ag" +STYX_PASS="TempPass#123" # will be forced to change on first login via cloud-init +SSH_PUBKEY="ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOb8oMX6u0z3sH/p/WBGlvPXXdbGETCKzWYwR/dd6fZb titan-bastion" + +# Video / input prefs +DSI_FLAGS="video=DSI-1:800x480@60D video=HDMI-A-1:off video=HDMI-A-2:off" + +# --- Helpers --- +fatal(){ echo "ERROR: $*" >&2; exit 1; } +need(){ command -v "$1" >/dev/null || fatal "Missing tool: $1"; } + +require_root(){ [[ $EUID -eq 0 ]] || exec sudo -E "$0" "$@"; } + +part() { + local n="$1" + if [[ "$NVME" =~ [0-9]$ ]]; then + echo "${NVME}p${n}" + else + echo "${NVME}${n}" + fi +} + +auto_detect_target_disk() { + # If user already set NVME, validate and return + if [[ -n "${NVME:-}" ]]; then + [[ -b "$NVME" ]] || fatal "NVME='$NVME' is not a block device" + return + fi + + # Prefer stable by-id symlinks + local byid + byid=$(ls -1 /dev/disk/by-id/usb-SSK* 2>/dev/null | head -n1 || true) + if [[ -n "$byid" ]]; then + NVME=$(readlink -f "$byid") + else + # Heuristic via lsblk -S: look for USB with SSK/Ingram/Storage in vendor/model + NVME=$(lsblk -S -p -o NAME,TRAN,VENDOR,MODEL | \ + awk '/ usb / && ($3 ~ /SSK|Ingram/i || $4 ~ /SSK|Storage/i){print $1; exit}') + fi + + [[ -n "${NVME:-}" && -b "$NVME" ]] || fatal "Could not auto-detect SSK USB NVMe dock. Export NVME=/dev/sdX and re-run." + echo "Auto-detected target disk: $NVME" +} + +preflight_cleanup() { + local img="$IMG_DIR/$IMG_FILE" + + # 1) Unmount image mountpoints and detach only loops for this IMG + umount -lf "$IMG_BOOT_MNT" "$IMG_ROOT_MNT" 2>/dev/null || true + # losetup -j exits non-zero if no association → tolerate it + { losetup -j "$img" | cut -d: -f1 | xargs -r losetup -d; } 2>/dev/null || true + + # 2) Unmount our target mounts + umount -lf "$TGT_ROOT/boot/firmware" "$TGT_BOOT" "$TGT_ROOT" 2>/dev/null || true + + # 3) Unmount the actual target partitions if mounted anywhere (tolerate 'not found') + for p in "$(part 1)" "$(part 2)"; do + # findmnt returns 1 when no match → capture and iterate if any + while read -r mnt; do + [ -n "$mnt" ] && umount -lf "$mnt" 2>/dev/null || true + done < <(findmnt -rno TARGET -S "$p" 2>/dev/null || true) + done + + # 4) Close dm-crypt mapping (if it exists) + cryptsetup luksClose cryptroot 2>/dev/null || true + dmsetup remove -f cryptroot 2>/dev/null || true + + # 5) Let udev settle + command -v udevadm >/dev/null && udevadm settle || true +} + +guard_target_device() { + # Refuse to operate if NVME appears to be the current system disk + local root_src root_disk + root_src=$(findmnt -no SOURCE /) + root_disk=$(lsblk -no pkname "$root_src" 2>/dev/null || true) + if [[ -n "$root_disk" && "/dev/$root_disk" == "$NVME" ]]; then + fatal "Refusing to operate on system disk ($NVME). Pick the external NVMe." + fi +} + +need_host_fido2() { + if ! command -v fido2-token >/dev/null 2>&1; then + echo "Host is missing fido2-token. On Arch: sudo pacman -S libfido2" + echo "On Debian/Ubuntu host: sudo apt-get install fido2-tools" + exit 1 + fi +} + +ensure_image() { + mkdir -p "$IMG_DIR" + chmod 755 "$IMG_DIR" + + local BASE="https://cdimage.ubuntu.com/releases/noble/release" + local XZ="ubuntu-24.04.3-preinstalled-${FLAVOR}-arm64+raspi.img.xz" + + # If the decompressed .img is missing, fetch/decompress into the cache. + if [[ ! -f "$IMG_DIR/$IMG_FILE" ]]; then + need curl; need unxz # Arch: pacman -S curl xz | Ubuntu: apt-get install curl xz-utils + if [[ ! -f "$IMG_DIR/$XZ" ]]; then + echo "Fetching image…" + curl -fL -o "$IMG_DIR/$XZ" "$BASE/$XZ" + fi + echo "Decompressing to $IMG_DIR/$IMG_FILE …" + # Keep the .xz for future runs; stream-decompress to the .img + if command -v unxz >/dev/null 2>&1; then + unxz -c "$IMG_DIR/$XZ" > "$IMG_DIR/$IMG_FILE" + else + need xz + xz -dc "$IMG_DIR/$XZ" > "$IMG_DIR/$IMG_FILE" + fi + sync + else + echo "Using cached image: $IMG_DIR/$IMG_FILE" + fi +} + +ensure_binfmt_aarch64(){ + # Register qemu-aarch64 for chrooted ARM64 apt runs + if [[ ! -e /proc/sys/fs/binfmt_misc/qemu-aarch64 ]]; then + need docker + systemctl enable --now docker >/dev/null 2>&1 || true + docker run --rm --privileged tonistiigi/binfmt --install arm64 >/dev/null + fi + if [[ ! -x /usr/local/bin/qemu-aarch64-static ]]; then + docker rm -f qemu-static >/dev/null 2>&1 || true + docker create --name qemu-static docker.io/multiarch/qemu-user-static:latest >/dev/null + docker cp qemu-static:/usr/bin/qemu-aarch64-static /usr/local/bin/ + install -D -m755 /usr/local/bin/qemu-aarch64-static /usr/local/bin/qemu-aarch64-static + docker rm qemu-static >/dev/null + fi +} + +open_image() { + [[ -r "$IMG_DIR/$IMG_FILE" ]] || fatal "Image not found: $IMG_DIR/$IMG_FILE" + mkdir -p "$IMG_BOOT_MNT" "$IMG_ROOT_MNT" + + # Pre-clean: detach any previous loop(s) for this image (tolerate absence) + umount -lf "$IMG_BOOT_MNT" 2>/dev/null || true + umount -lf "$IMG_ROOT_MNT" 2>/dev/null || true + # If no loop is attached, losetup -j returns non-zero → swallow it + mapfile -t OLD < <({ losetup -j "$IMG_DIR/$IMG_FILE" | cut -d: -f1; } 2>/dev/null || true) + for L in "${OLD[@]:-}"; do losetup -d "$L" 2>/dev/null || true; done + command -v udevadm >/dev/null && udevadm settle || true + + # Attach with partition scan; wait for partition nodes to exist + LOOP=$(losetup --find --show --partscan "$IMG_DIR/$IMG_FILE") || fatal "losetup failed" + command -v udevadm >/dev/null && udevadm settle || true + for _ in {1..25}; do + [[ -b "${LOOP}p1" && -b "${LOOP}p2" ]] && break + sleep 0.1 + command -v udevadm >/dev/null && udevadm settle || true + done + [[ -b "${LOOP}p1" ]] || fatal "loop partitions not present for $LOOP" + + # Cleanup on exit: unmount first, then detach loop (tolerate absence) + trap 'umount -lf "'"$IMG_BOOT_MNT"'" "'"$IMG_ROOT_MNT"'" 2>/dev/null; losetup -d "'"$LOOP"'" 2>/dev/null' EXIT + + # Mount image partitions read-only + mount -o ro "${LOOP}p1" "$IMG_BOOT_MNT" + mount -o ro "${LOOP}p2" "$IMG_ROOT_MNT" + + # Sanity checks without using failing pipelines + # start*.elf must exist + if ! compgen -G "$IMG_BOOT_MNT/start*.elf" > /dev/null; then + fatal "start*.elf not found in image" + fi + # vmlinuz-* must exist + if ! compgen -G "$IMG_ROOT_MNT/boot/vmlinuz-*" > /dev/null; then + fatal "vmlinuz-* not found in image root" + fi +} + +confirm_and_wipe(){ + lsblk -o NAME,SIZE,MODEL,TRAN,LABEL "$NVME" + read -rp "Type EXACTLY 'WIPE' to destroy ALL DATA on $NVME: " ACK + [[ "$ACK" == "WIPE" ]] || fatal "Aborted" + wipefs -a "$NVME" + sgdisk -Zo "$NVME" + # GPT: 1: 1MiB..513MiB vfat ESP; 2: rest LUKS + parted -s "$NVME" mklabel gpt \ + mkpart system-boot fat32 1MiB 513MiB set 1 esp on \ + mkpart cryptroot 513MiB 100% + partprobe "$NVME"; sleep 1 + mkfs.vfat -F32 -n system-boot "$(part 1)" +} + +setup_luks(){ + echo "Create LUKS2 on $(part 2) (you will be prompted for a passphrase; keep it as fallback)" + need cryptsetup + cryptsetup luksFormat --type luks2 "$(part 2)" + cryptsetup open "$(part 2)" cryptroot + mkfs.ext4 -L rootfs /dev/mapper/cryptroot +} + +mount_targets(){ + mkdir -p "$TGT_ROOT" "$TGT_BOOT" + mount /dev/mapper/cryptroot "$TGT_ROOT" + mkdir -p "$TGT_ROOT/boot/firmware" + mount "$(part 1)" "$TGT_BOOT" + mount --bind "$TGT_BOOT" "$TGT_ROOT/boot/firmware" +} + +rsync_root_and_boot(){ + need rsync + rsync -aAXH --numeric-ids --delete \ + --exclude='/boot/firmware' --exclude='/boot/firmware/**' \ + --exclude='/dev/*' --exclude='/proc/*' --exclude='/sys/*' \ + --exclude='/run/*' --exclude='/tmp/*' --exclude='/mnt/*' \ + --exclude='/media/*' --exclude='/lost+found' \ + "$IMG_ROOT_MNT"/ "$TGT_ROOT"/ + rsync -aH --delete "$IMG_BOOT_MNT"/ "$TGT_ROOT/boot/firmware"/ +} + +write_crypttab_fstab(){ + LUUID=$(blkid -s UUID -o value "$(part 2)") + printf 'cryptroot UUID=%s none luks,discard,fido2-device=auto\n' "$LUUID" > "$TGT_ROOT/etc/crypttab" + cat > "$TGT_ROOT/etc/fstab" <> "$C" + grep -q '^cmdline=cmdline.txt' "$C" || sed -i '1i cmdline=cmdline.txt' "$C" + + # Display & buses (Pi 5) + grep -q '^dtoverlay=vc4-kms-v3d-pi5' "$C" || echo 'dtoverlay=vc4-kms-v3d-pi5' >> "$C" + grep -q '^dtparam=i2c_arm=on' "$C" || echo 'dtparam=i2c_arm=on' >> "$C" + grep -q '^dtparam=pciex1=on' "$C" || echo 'dtparam=pciex1=on' >> "$C" + grep -q '^dtparam=pciex1_gen=2' "$C" || echo 'dtparam=pciex1_gen=2' >> "$C" + grep -q '^enable_uart=1' "$C" || echo 'enable_uart=1' >> "$C" + + # Minimal, correct dracut hints using the bare UUID + local LUUID; LUUID=$(blkid -s UUID -o value "$(part 2)") + : > "$CL" + { + echo -n "rd.luks.uuid=$LUUID rd.luks.name=$LUUID=cryptroot " + echo -n "root=/dev/mapper/cryptroot rootfstype=ext4 rootwait fixrtc " + echo "console=serial0,115200 console=tty1 ds=nocloud;s=file:///boot/firmware/ ${DSI_FLAGS} rd.debug" + } >> "$CL" +} + +seed_cloud_init(){ + # NoCloud seed to create user, lock down SSH, set hostname, and enable avahi. + cat > "$TGT_ROOT/boot/firmware/user-data" < "$TGT_ROOT/boot/firmware/meta-data" +} + +prep_chroot_mounts(){ + for d in dev proc sys; do mount --bind "/$d" "$TGT_ROOT/$d"; done + mount -t devpts devpts "$TGT_ROOT/dev/pts" + # Replace the usual resolv.conf symlink with a real file for apt to work + rm -f "$TGT_ROOT/etc/resolv.conf" + cp /etc/resolv.conf "$TGT_ROOT/etc/resolv.conf" + + # Block service starts (no systemd in chroot) + cat > "$TGT_ROOT/usr/sbin/policy-rc.d" <<'EOP' +#!/bin/sh +exit 101 +EOP + chmod +x "$TGT_ROOT/usr/sbin/policy-rc.d" + + # Ensure qemu static is present inside chroot + install -D -m755 /usr/local/bin/qemu-aarch64-static "$TGT_ROOT/usr/bin/qemu-aarch64-static" +} + +in_chroot(){ + chroot "$TGT_ROOT" /usr/bin/qemu-aarch64-static /bin/bash -lc ' +set -euo pipefail +export DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC + +# --- APT sources (ports) --- +cat > /etc/apt/sources.list <<'"'"'EOS'"'"' +deb http://ports.ubuntu.com/ubuntu-ports noble main restricted universe multiverse +deb http://ports.ubuntu.com/ubuntu-ports noble-updates main restricted universe multiverse +deb http://ports.ubuntu.com/ubuntu-ports noble-security main restricted universe multiverse +EOS + +apt-get update + +# --- Remove snaps and pin them off --- +apt-get -y purge snapd || true +rm -rf /snap /var/snap /var/lib/snapd /home/*/snap || true +mkdir -p /etc/apt/preferences.d +cat > /etc/apt/preferences.d/nosnap.pref <<'"'"'EOS'"'"' +Package: snapd +Pin: release * +Pin-Priority: -10 +EOS + +# --- Base tools (no flash-kernel; we use dracut) --- +apt-get install -y --no-install-recommends \ + openssh-client openssh-server openssh-sftp-server avahi-daemon \ + cryptsetup dracut fido2-tools libfido2-1 i2c-tools \ + python3-smbus python3-pil zbar-tools qrencode lm-sensors \ + file zstd lz4 || true + +# Camera apps: try rpicam-apps; otherwise basic libcamera tools +apt-get install -y rpicam-apps || apt-get install -y libcamera-tools || true + +# --- Persistent journal so we can read logs after failed boot --- +mkdir -p /etc/systemd/journald.conf.d +cat > /etc/systemd/journald.conf.d/99-persistent.conf <<'"'"'EOS'"'"' +[Journal] +Storage=persistent +EOS + +# --- SSH hardening (ensure file exists even if package was half-installed) --- +if [ ! -f /etc/ssh/sshd_config ]; then + mkdir -p /etc/ssh + cat > /etc/ssh/sshd_config <<'"'"'EOS'"'"' +PermitRootLogin no +PasswordAuthentication no +KbdInteractiveAuthentication no +PubkeyAuthentication yes +# Accept defaults for the rest +EOS +fi +sed -i -e "s/^#\?PasswordAuthentication .*/PasswordAuthentication no/" \ + -e "s/^#\?KbdInteractiveAuthentication .*/KbdInteractiveAuthentication no/" \ + -e "s/^#\?PermitRootLogin .*/PermitRootLogin no/" \ + -e "s/^#\?PubkeyAuthentication .*/PubkeyAuthentication yes/" /etc/ssh/sshd_config || true + +# --- Hostname & hosts --- +echo "'"$STYX_HOSTNAME"'" > /etc/hostname +if grep -q "^127\\.0\\.1\\.1" /etc/hosts; then + sed -i "s/^127\\.0\\.1\\.1.*/127.0.1.1\t'"$STYX_HOSTNAME"'/" /etc/hosts +else + echo -e "127.0.1.1\t'"$STYX_HOSTNAME"'" >> /etc/hosts +fi + +# --- Enable services on first boot --- +mkdir -p /etc/systemd/system/multi-user.target.wants +ln -sf /lib/systemd/system/ssh.service /etc/systemd/system/multi-user.target.wants/ssh.service +ln -sf /lib/systemd/system/avahi-daemon.service /etc/systemd/system/multi-user.target.wants/avahi-daemon.service || true + +# --- Ensure i2c group --- +getent group i2c >/dev/null || groupadd i2c + +# --- Dracut configuration (generic, not host-only) --- +mkdir -p /etc/dracut.conf.d +cat > /etc/dracut.conf.d/00-hostonly.conf <<'"'"'EOS'"'"' +hostonly=no +EOS +cat > /etc/dracut.conf.d/10-systemd-crypt.conf <<'"'"'EOS'"'"' +add_dracutmodules+=" systemd crypt " +EOS +cat > /etc/dracut.conf.d/20-drivers.conf <<'"'"'EOS'"'"' +add_drivers+=" nvme xhci_pci xhci_hcd usbhid hid_generic hid " +EOS +cat > /etc/dracut.conf.d/30-fido2.conf <<'"'"'EOS'"'"' +install_items+="/usr/bin/systemd-cryptsetup /usr/bin/fido2-token /usr/lib/*/libfido2.so* /usr/lib/*/libcbor.so*" +EOS + +# --- Build initramfs and place it where firmware expects it --- +KVER=$(ls -1 /lib/modules | sort -V | tail -n1) +dracut --force /boot/initramfs-$KVER.img $KVER +ln -sf initramfs-$KVER.img /boot/initrd.img +ln -sf initramfs-$KVER.img /boot/initrd.img-$KVER +cp -a /boot/initramfs-$KVER.img /boot/firmware/initrd.img + +# --- Create uncompressed kernel for Pi 5 firmware --- +if [ -f "/usr/lib/linux-image-$KVER/Image" ]; then + cp -a "/usr/lib/linux-image-$KVER/Image" /boot/firmware/kernel_2712.img +else + FMT=$(file -b "/boot/vmlinuz-$KVER" || true) + case "$FMT" in + *Zstandard*|*zstd*) zstd -dc "/boot/vmlinuz-$KVER" > /boot/firmware/kernel_2712.img ;; + *LZ4*) lz4 -dc "/boot/vmlinuz-$KVER" > /boot/firmware/kernel_2712.img ;; + *gzip*) zcat "/boot/vmlinuz-$KVER" > /boot/firmware/kernel_2712.img ;; + *) cp -a "/boot/vmlinuz-$KVER" /boot/firmware/kernel_2712.img ;; + esac +fi + +# --- Ensure Pi 5 DTB is present on the boot partition --- +DTB=$(find /lib/firmware -type f -name "bcm2712-rpi-5-b.dtb" | sort | tail -n1 || true) +[ -n "$DTB" ] && cp -a "$DTB" /boot/firmware/ + +# --- Dracut hook to copy rdsosreport.txt to the FAT partition on failure --- +mkdir -p /usr/lib/dracut/modules.d/99copylog +cat > /usr/lib/dracut/modules.d/99copylog/module-setup.sh <<'"'"'EOS'"'"' +#!/bin/bash +check() { return 0; } +depends() { echo base; return 0; } +install() { + # Guard $moddir for nounset; derive if absent + local mdir="${moddir:-$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)}" + inst_hook emergency 99 "$mdir/copylog.sh" +} +EOS +chmod +x /usr/lib/dracut/modules.d/99copylog/module-setup.sh + +cat > /usr/lib/dracut/modules.d/99copylog/copylog.sh <<'"'"'EOS'"'"' +#!/bin/sh +set -e +for dev in /dev/nvme0n1p1 /dev/sda1 /dev/sdb1 /dev/mmcblk0p1; do + [ -b "$dev" ] || continue + mkdir -p /mnt/bootfat + if mount -t vfat "$dev" /mnt/bootfat 2>/dev/null; then + if [ -s /run/initramfs/rdsosreport.txt ]; then + cp -f /run/initramfs/rdsosreport.txt /mnt/bootfat/rdsosreport.txt 2>/dev/null || true + sync || true + fi + umount /mnt/bootfat || true + break + fi +done +EOS +chmod +x /usr/lib/dracut/modules.d/99copylog/copylog.sh + +# Rebuild to ensure the copylog module is included +dracut --force /boot/initramfs-$KVER.img $KVER +ln -sf initramfs-$KVER.img /boot/initrd.img +cp -a /boot/initramfs-$KVER.img /boot/firmware/initrd.img + +true +' +} + +verify_boot_assets(){ + echo "---- verify boot assets on FAT ----" + file "$TGT_ROOT/boot/firmware/kernel_2712.img" || true + ls -lh "$TGT_ROOT/boot/firmware/initrd.img" || true + echo "-- config.txt (key lines) --" + grep -E '^(kernel|initramfs|cmdline)=|^dtoverlay=|^dtparam=' "$TGT_ROOT/boot/firmware/config.txt" || true + echo "-- cmdline.txt --" + cat "$TGT_ROOT/boot/firmware/cmdline.txt" || true + echo "-- firmware blobs (sample) --" + ls -1 "$TGT_ROOT/boot/firmware"/start*.elf "$TGT_ROOT/boot/firmware"/fixup*.dat | head -n 8 || true + echo "-- Pi5 DTB --" + ls -l "$TGT_ROOT/boot/firmware/"*rpi-5-b.dtb || true +} + +enroll_fido_tokens(){ + echo "Enrolling FIDO2 Solo keys into $(part 2) ..." + need systemd-cryptenroll + need fido2-token + + # Collect all hidraw paths from both output styles (some distros print 'Device: /dev/hidrawX') + mapfile -t DEVS < <( + fido2-token -L \ + | sed -n 's,^\(/dev/hidraw[0-9]\+\):.*,\1,p; s,^Device:[[:space:]]\+/dev/hidraw\([0-9]\+\).*,/dev/hidraw\1,p' \ + | sort -u + ) + + if (( ${#DEVS[@]} == 0 )); then + echo "No FIDO2 tokens detected; skipping enrollment (you can enroll later)." + echo "Example later: systemd-cryptenroll $(part 2) --fido2-device=/dev/hidrawX --fido2-with-client-pin=no" + return 0 + fi + + # Recommend keeping exactly ONE key plugged during first enrollment to avoid ambiguity. + if (( ${#DEVS[@]} > 1 )); then + echo "Note: multiple FIDO2 tokens present: ${DEVS[*]}" + echo "If enrollment fails, try with only one key inserted." + fi + + local rc=0 + for D in "${DEVS[@]}"; do + echo "-> Enrolling $D (you should be asked to touch the key)" + if ! SYSTEMD_LOG_LEVEL=debug systemd-cryptenroll "$(part 2)" \ + --fido2-device="$D" \ + --fido2-with-client-pin=no \ + --fido2-with-user-presence=yes \ + --fido2-with-user-verification=no \ + --label="solo-$(basename "$D")"; then + echo "WARN: enrollment failed for $D" + rc=1 + fi + done + + echo "Tokens enrolled (if any):" + systemd-cryptenroll "$(part 2)" --list || true + return $rc +} + +cleanup(){ + rm -f "$TGT_ROOT/usr/sbin/policy-rc.d" || true + umount -lf "$TGT_ROOT/dev/pts" 2>/dev/null || true + for d in dev proc sys; do umount -lf "$TGT_ROOT/$d" 2>/dev/null || true; done + umount -lf "$TGT_ROOT/boot/firmware" 2>/dev/null || true + umount -lf "$TGT_BOOT" 2>/dev/null || true + umount -lf "$TGT_ROOT" 2>/dev/null || true + cryptsetup close cryptroot 2>/dev/null || true + umount -lf "$IMG_BOOT_MNT" 2>/dev/null || true + umount -lf "$IMG_ROOT_MNT" 2>/dev/null || true +} + +main(){ + require_root + need losetup; need parted; need rsync + auto_detect_target_disk + echo "Target disk: $NVME" + ensure_binfmt_aarch64 + ensure_image + preflight_cleanup + guard_target_device + open_image + confirm_and_wipe + setup_luks + mount_targets + rsync_root_and_boot + write_crypttab_fstab + fix_firmware_files + seed_cloud_init + prep_chroot_mounts + in_chroot + verify_boot_assets + need_host_fido2 + enroll_fido_tokens + cleanup + echo "✅ NVMe prepared." + echo " Install in the Pi 5 and boot with no SD." + echo " Expect LUKS to unlock automatically with a Solo key inserted;" + echo " passphrase fallback remains. Hostname: ${STYX_HOSTNAME} User: ${STYX_USER}" + echo " On first boot, reach it via: ssh -i ~/.ssh/id_ed25519_titan styx@titan-ag.local" +} + +main "$@" diff --git a/services/monitoring/README.md b/services/monitoring/README.md new file mode 100644 index 0000000..835ae1d --- /dev/null +++ b/services/monitoring/README.md @@ -0,0 +1,28 @@ +# services/monitoring + +## Grafana admin secret + +The Grafana Helm release expects a pre-existing secret named `grafana-admin` +in the `monitoring` namespace. Create or rotate it with: + +```bash +kubectl create secret generic grafana-admin \ + --namespace monitoring \ + --from-literal=admin-user=admin \ + --from-literal=admin-password='REPLACE_ME' +``` + +Update the password whenever you rotate credentials. + +## DCGM exporter image + +The NVIDIA GPU metrics DaemonSet expects `registry.bstein.dev/monitoring/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04`, mirrored from `docker.io/nvidia/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04`. Refresh it in Zot when bumping versions: + +```bash +skopeo copy \ + --all \ + docker://docker.io/nvidia/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04 \ + docker://registry.bstein.dev/monitoring/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04 +``` + +When finished mirroring from the control-plane, you can remove temporary tooling with `sudo apt-get purge -y skopeo && sudo apt-get autoremove -y` and clear `~/.config/containers/auth.json`. diff --git a/services/monitoring/dashboards/atlas-gpu.json b/services/monitoring/dashboards/atlas-gpu.json new file mode 100644 index 0000000..e67b3d2 --- /dev/null +++ b/services/monitoring/dashboards/atlas-gpu.json @@ -0,0 +1,184 @@ +{ + "uid": "atlas-gpu", + "title": "Atlas GPU", + "folderUid": "atlas-internal", + "editable": true, + "panels": [ + { + "id": 1, + "type": "piechart", + "title": "Namespace GPU Share", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 0 + }, + "targets": [ + { + "expr": "100 * ( ( (sum by (namespace) (avg_over_time(DCGM_FI_DEV_GPU_UTIL{namespace!=\"\",pod!=\"\"}[1h]))) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) ) and on(namespace) ( (topk(10, ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) + (sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) / 1e9) + ((sum((kube_pod_container_resource_requests{namespace!=\"\",resource=\"nvidia.com/gpu\"} or kube_pod_container_resource_limits{namespace!=\"\",resource=\"nvidia.com/gpu\"})) by (namespace)) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) * 100)) >= bool 0) ) ) / clamp_min(sum( ( (sum by (namespace) (avg_over_time(DCGM_FI_DEV_GPU_UTIL{namespace!=\"\",pod!=\"\"}[1h]))) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) ) and on(namespace) ( (topk(10, ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) + (sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) / 1e9) + ((sum((kube_pod_container_resource_requests{namespace!=\"\",resource=\"nvidia.com/gpu\"} or kube_pod_container_resource_limits{namespace!=\"\",resource=\"nvidia.com/gpu\"})) by (namespace)) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) * 100)) >= bool 0) ) ), 1)", + "refId": "A", + "legendFormat": "{{namespace}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent", + "color": { + "mode": "palette-classic" + } + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "list", + "placement": "right" + }, + "pieType": "pie", + "displayLabels": [ + "percent" + ], + "tooltip": { + "mode": "single" + }, + "colorScheme": "interpolateSpectral", + "colorBy": "value", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + } + } + }, + { + "id": 2, + "type": "timeseries", + "title": "GPU Util by Namespace", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 0 + }, + "targets": [ + { + "expr": "sum(DCGM_FI_DEV_GPU_UTIL{namespace!=\"\",pod!=\"\"}) by (namespace)", + "refId": "A", + "legendFormat": "{{namespace}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + } + }, + { + "id": 3, + "type": "timeseries", + "title": "GPU Util by Node", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 8 + }, + "targets": [ + { + "expr": "sum by (Hostname) (DCGM_FI_DEV_GPU_UTIL{pod!=\"\"})", + "refId": "A", + "legendFormat": "{{Hostname}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + } + }, + { + "id": 4, + "type": "table", + "title": "Top Pods by GPU Util", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 8 + }, + "targets": [ + { + "expr": "topk(10, sum(DCGM_FI_DEV_GPU_UTIL{pod!=\"\"}) by (namespace,pod,Hostname))", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "showHeader": true + }, + "transformations": [ + { + "id": "labelsToFields", + "options": {} + } + ] + } + ], + "time": { + "from": "now-12h", + "to": "now" + }, + "annotations": { + "list": [] + }, + "schemaVersion": 39, + "style": "dark", + "tags": [ + "atlas", + "gpu" + ] +} diff --git a/services/monitoring/dashboards/atlas-network.json b/services/monitoring/dashboards/atlas-network.json new file mode 100644 index 0000000..ff0af9b --- /dev/null +++ b/services/monitoring/dashboards/atlas-network.json @@ -0,0 +1,445 @@ +{ + "uid": "atlas-network", + "title": "Atlas Network", + "folderUid": "atlas-internal", + "editable": true, + "panels": [ + { + "id": 1, + "type": "stat", + "title": "Ingress Traffic", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 8, + "x": 0, + "y": 0 + }, + "targets": [ + { + "expr": "sum(rate(node_network_receive_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "Bps", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 2, + "type": "stat", + "title": "Egress Traffic", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 8, + "x": 8, + "y": 0 + }, + "targets": [ + { + "expr": "sum(rate(node_network_transmit_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "Bps", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 3, + "type": "stat", + "title": "Intra-Cluster Traffic", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 8, + "x": 16, + "y": 0 + }, + "targets": [ + { + "expr": "sum(rate(container_network_receive_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m]) + rate(container_network_transmit_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m])) or on() vector(0)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "Bps", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 4, + "type": "stat", + "title": "Top Router req/s", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 8, + "x": 0, + "y": 4 + }, + "targets": [ + { + "expr": "topk(1, sum by (router) (rate(traefik_router_requests_total[5m])))", + "refId": "A", + "legendFormat": "{{router}}" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "req/s", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 5, + "type": "timeseries", + "title": "Per-Node Throughput", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 8, + "w": 24, + "x": 0, + "y": 8 + }, + "targets": [ + { + "expr": "avg by (node) ((sum(rate(node_network_transmit_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0) + sum(rate(node_network_receive_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "Bps" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + } + }, + { + "id": 6, + "type": "table", + "title": "Top Namespaces", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 0, + "y": 16 + }, + "targets": [ + { + "expr": "topk(10, sum(rate(container_network_transmit_bytes_total{namespace!=\"\"}[5m]) + rate(container_network_receive_bytes_total{namespace!=\"\"}[5m])) by (namespace))", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "unit": "Bps" + }, + "overrides": [] + }, + "options": { + "showHeader": true + }, + "transformations": [ + { + "id": "labelsToFields", + "options": {} + } + ] + }, + { + "id": 7, + "type": "table", + "title": "Top Pods", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 12, + "y": 16 + }, + "targets": [ + { + "expr": "topk(10, sum(rate(container_network_transmit_bytes_total{pod!=\"\"}[5m]) + rate(container_network_receive_bytes_total{pod!=\"\"}[5m])) by (namespace,pod))", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "unit": "Bps" + }, + "overrides": [] + }, + "options": { + "showHeader": true + }, + "transformations": [ + { + "id": "labelsToFields", + "options": {} + } + ] + }, + { + "id": 8, + "type": "timeseries", + "title": "Traefik Routers (req/s)", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 0, + "y": 25 + }, + "targets": [ + { + "expr": "topk(10, sum by (router) (rate(traefik_router_requests_total[5m])))", + "refId": "A", + "legendFormat": "{{router}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "req/s" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + } + }, + { + "id": 9, + "type": "timeseries", + "title": "Traefik Entrypoints (req/s)", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 12, + "y": 25 + }, + "targets": [ + { + "expr": "sum by (entrypoint) (rate(traefik_entrypoint_requests_total[5m]))", + "refId": "A", + "legendFormat": "{{entrypoint}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "req/s" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + } + } + ], + "time": { + "from": "now-12h", + "to": "now" + }, + "annotations": { + "list": [] + }, + "schemaVersion": 39, + "style": "dark", + "tags": [ + "atlas", + "network" + ] +} diff --git a/services/monitoring/dashboards/atlas-nodes.json b/services/monitoring/dashboards/atlas-nodes.json new file mode 100644 index 0000000..802fe5a --- /dev/null +++ b/services/monitoring/dashboards/atlas-nodes.json @@ -0,0 +1,395 @@ +{ + "uid": "atlas-nodes", + "title": "Atlas Nodes", + "folderUid": "atlas-internal", + "editable": true, + "panels": [ + { + "id": 1, + "type": "stat", + "title": "Worker Nodes Ready", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 8, + "x": 0, + "y": 0 + }, + "targets": [ + { + "expr": "sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-04|titan-05|titan-06|titan-07|titan-08|titan-09|titan-10|titan-11|titan-12|titan-13|titan-14|titan-15|titan-16|titan-17|titan-18|titan-19|titan-22|titan-24\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto", + "valueSuffix": "/18" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 2, + "type": "stat", + "title": "Control Plane Ready", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 8, + "x": 8, + "y": 0 + }, + "targets": [ + { + "expr": "sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-0a|titan-0b|titan-0c\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto", + "valueSuffix": "/3" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 3, + "type": "stat", + "title": "Control Plane Workloads", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 8, + "x": 16, + "y": 0 + }, + "targets": [ + { + "expr": "sum(kube_pod_info{node=~\"titan-0a|titan-0b|titan-0c\",namespace!~\"kube-system|kube-public|kube-node-lease|longhorn-system|monitoring|flux-system\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 4, + "type": "timeseries", + "title": "Node CPU", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 24, + "x": 0, + "y": 4 + }, + "targets": [ + { + "expr": "avg by (node) (((1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right", + "calcs": [ + "last" + ] + }, + "tooltip": { + "mode": "multi" + } + } + }, + { + "id": 5, + "type": "timeseries", + "title": "Node RAM", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 24, + "x": 0, + "y": 13 + }, + "targets": [ + { + "expr": "avg by (node) ((avg by (instance) ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right", + "calcs": [ + "last" + ] + }, + "tooltip": { + "mode": "multi" + } + } + }, + { + "id": 6, + "type": "timeseries", + "title": "Control Plane (incl. titan-db) CPU", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 0, + "y": 22 + }, + "targets": [ + { + "expr": "(avg by (node) (((1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c|titan-db\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + } + }, + { + "id": 7, + "type": "timeseries", + "title": "Control Plane (incl. titan-db) RAM", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 12, + "y": 22 + }, + "targets": [ + { + "expr": "(avg by (node) ((avg by (instance) ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c|titan-db\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + } + }, + { + "id": 8, + "type": "timeseries", + "title": "Root Filesystem Usage", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 24, + "x": 0, + "y": 31 + }, + "targets": [ + { + "expr": "avg by (node) ((avg by (instance) ((1 - (node_filesystem_avail_bytes{mountpoint=\"/\",fstype!~\"tmpfs|overlay\"} / node_filesystem_size_bytes{mountpoint=\"/\",fstype!~\"tmpfs|overlay\"})) * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + }, + "timeFrom": "30d" + } + ], + "time": { + "from": "now-12h", + "to": "now" + }, + "annotations": { + "list": [] + }, + "schemaVersion": 39, + "style": "dark", + "tags": [ + "atlas", + "nodes" + ] +} diff --git a/services/monitoring/dashboards/atlas-overview.json b/services/monitoring/dashboards/atlas-overview.json new file mode 100644 index 0000000..9eda81d --- /dev/null +++ b/services/monitoring/dashboards/atlas-overview.json @@ -0,0 +1,1532 @@ +{ + "uid": "atlas-overview", + "title": "Atlas Overview", + "folderUid": "overview", + "editable": false, + "annotations": { + "list": [] + }, + "panels": [ + { + "id": 1, + "type": "gauge", + "title": "Workers Ready", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 5, + "x": 0, + "y": 0 + }, + "targets": [ + { + "expr": "sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-04|titan-05|titan-06|titan-07|titan-08|titan-09|titan-10|titan-11|titan-12|titan-13|titan-14|titan-15|titan-16|titan-17|titan-18|titan-19|titan-22|titan-24\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "min": 0, + "max": 18, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "orange", + "value": 16 + }, + { + "color": "yellow", + "value": 17 + }, + { + "color": "green", + "value": 18 + } + ] + } + }, + "overrides": [] + }, + "options": { + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "orientation": "auto", + "showThresholdMarkers": false, + "showThresholdLabels": false + } + }, + { + "id": 2, + "type": "gauge", + "title": "Control Plane Ready", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 5, + "x": 5, + "y": 0 + }, + "targets": [ + { + "expr": "sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-0a|titan-0b|titan-0c\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "min": 0, + "max": 3, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "green", + "value": 3 + } + ] + } + }, + "overrides": [] + }, + "options": { + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "orientation": "auto", + "showThresholdMarkers": false, + "showThresholdLabels": false + } + }, + { + "id": 3, + "type": "stat", + "title": "Control Plane Workloads", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 5, + "x": 10, + "y": 0 + }, + "targets": [ + { + "expr": "sum(kube_pod_info{node=~\"titan-0a|titan-0b|titan-0c\",namespace!~\"kube-system|kube-public|kube-node-lease|longhorn-system|monitoring|flux-system\"}) or on() vector(0)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 1 + }, + { + "color": "orange", + "value": 2 + }, + { + "color": "red", + "value": 3 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + }, + "links": [ + { + "title": "Open atlas-pods dashboard", + "url": "/d/atlas-pods", + "targetBlank": true + } + ] + }, + { + "id": 4, + "type": "stat", + "title": "Problem Pods", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 5, + "x": 15, + "y": 0 + }, + "targets": [ + { + "expr": "sum(max by (namespace,pod) (kube_pod_status_phase{phase!~\"Running|Succeeded\"}))", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 1 + }, + { + "color": "orange", + "value": 2 + }, + { + "color": "red", + "value": 3 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + }, + "links": [ + { + "title": "Open atlas-pods dashboard", + "url": "/d/atlas-pods", + "targetBlank": true + } + ] + }, + { + "id": 5, + "type": "stat", + "title": "Stuck Terminating", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 4, + "x": 20, + "y": 0 + }, + "targets": [ + { + "expr": "sum(max by (namespace,pod) (((time() - kube_pod_deletion_timestamp{pod!=\"\"}) > bool 600) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0)))", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 1 + }, + { + "color": "orange", + "value": 2 + }, + { + "color": "red", + "value": 3 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + }, + "links": [ + { + "title": "Open atlas-pods dashboard", + "url": "/d/atlas-pods", + "targetBlank": true + } + ] + }, + { + "id": 7, + "type": "stat", + "title": "Hottest node: CPU", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 3, + "w": 6, + "x": 0, + "y": 5 + }, + "targets": [ + { + "expr": "label_replace(topk(1, avg by (node) (((1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))), \"__name__\", \"$1\", \"node\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}", + "instant": true + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "percentage", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 70 + }, + { + "color": "red", + "value": 85 + } + ] + }, + "unit": "percent", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "name_and_value" + }, + "links": [ + { + "title": "Open atlas-nodes dashboard", + "url": "/d/atlas-nodes", + "targetBlank": true + } + ] + }, + { + "id": 8, + "type": "stat", + "title": "Hottest node: RAM", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 3, + "w": 6, + "x": 6, + "y": 5 + }, + "targets": [ + { + "expr": "label_replace(topk(1, avg by (node) ((avg by (instance) ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))), \"__name__\", \"$1\", \"node\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}", + "instant": true + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "percentage", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 70 + }, + { + "color": "red", + "value": 85 + } + ] + }, + "unit": "percent", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "name_and_value" + }, + "links": [ + { + "title": "Open atlas-nodes dashboard", + "url": "/d/atlas-nodes", + "targetBlank": true + } + ] + }, + { + "id": 9, + "type": "stat", + "title": "Hottest node: NET (rx+tx)", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 3, + "w": 6, + "x": 12, + "y": 5 + }, + "targets": [ + { + "expr": "label_replace(topk(1, avg by (node) ((sum by (instance) (rate(node_network_receive_bytes_total{device!~\"lo\"}[5m]) + rate(node_network_transmit_bytes_total{device!~\"lo\"}[5m]))) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))), \"__name__\", \"$1\", \"node\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}", + "instant": true + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "Bps", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "name_and_value" + }, + "links": [ + { + "title": "Open atlas-nodes dashboard", + "url": "/d/atlas-nodes", + "targetBlank": true + } + ] + }, + { + "id": 10, + "type": "stat", + "title": "Hottest node: I/O (r+w)", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 3, + "w": 6, + "x": 18, + "y": 5 + }, + "targets": [ + { + "expr": "label_replace(topk(1, avg by (node) ((sum by (instance) (rate(node_disk_read_bytes_total[5m]) + rate(node_disk_written_bytes_total[5m]))) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))), \"__name__\", \"$1\", \"node\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}", + "instant": true + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "Bps", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "name_and_value" + }, + "links": [ + { + "title": "Open atlas-nodes dashboard", + "url": "/d/atlas-nodes", + "targetBlank": true + } + ] + }, + { + "id": 23, + "type": "stat", + "title": "Astreae Usage", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 6, + "w": 6, + "x": 0, + "y": 10 + }, + "targets": [ + { + "expr": "100 - (sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"}) / sum(node_filesystem_size_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"}) * 100)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "percentage", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 70 + }, + { + "color": "red", + "value": 85 + } + ] + }, + "unit": "percent", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + }, + "links": [ + { + "title": "Open atlas-storage dashboard", + "url": "/d/atlas-storage", + "targetBlank": true + } + ] + }, + { + "id": 24, + "type": "stat", + "title": "Asteria Usage", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 6, + "w": 6, + "x": 6, + "y": 10 + }, + "targets": [ + { + "expr": "100 - (sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"}) / sum(node_filesystem_size_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"}) * 100)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "percentage", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 70 + }, + { + "color": "red", + "value": 85 + } + ] + }, + "unit": "percent", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + }, + "links": [ + { + "title": "Open atlas-storage dashboard", + "url": "/d/atlas-storage", + "targetBlank": true + } + ] + }, + { + "id": 25, + "type": "stat", + "title": "Astreae Free", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 6, + "w": 6, + "x": 12, + "y": 10 + }, + "targets": [ + { + "expr": "sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "decbytes", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + }, + "links": [ + { + "title": "Open atlas-storage dashboard", + "url": "/d/atlas-storage", + "targetBlank": true + } + ] + }, + { + "id": 26, + "type": "stat", + "title": "Asteria Free", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 6, + "w": 6, + "x": 18, + "y": 10 + }, + "targets": [ + { + "expr": "sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "decbytes", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + }, + "links": [ + { + "title": "Open atlas-storage dashboard", + "url": "/d/atlas-storage", + "targetBlank": true + } + ] + }, + { + "id": 11, + "type": "piechart", + "title": "Namespace CPU Share", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 8, + "x": 0, + "y": 16 + }, + "targets": [ + { + "expr": "100 * ( ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) and on(namespace) ( (topk(10, ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) + (sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) / 1e9) + ((sum((kube_pod_container_resource_requests{namespace!=\"\",resource=\"nvidia.com/gpu\"} or kube_pod_container_resource_limits{namespace!=\"\",resource=\"nvidia.com/gpu\"})) by (namespace)) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) * 100)) >= bool 0) ) ) / clamp_min(sum( ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) and on(namespace) ( (topk(10, ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) + (sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) / 1e9) + ((sum((kube_pod_container_resource_requests{namespace!=\"\",resource=\"nvidia.com/gpu\"} or kube_pod_container_resource_limits{namespace!=\"\",resource=\"nvidia.com/gpu\"})) by (namespace)) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) * 100)) >= bool 0) ) ), 1)", + "refId": "A", + "legendFormat": "{{namespace}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent", + "color": { + "mode": "palette-classic" + } + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "list", + "placement": "right" + }, + "pieType": "pie", + "displayLabels": [ + "percent" + ], + "tooltip": { + "mode": "single" + }, + "colorScheme": "interpolateSpectral", + "colorBy": "value", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + } + } + }, + { + "id": 12, + "type": "piechart", + "title": "Namespace GPU Share", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 8, + "x": 8, + "y": 16 + }, + "targets": [ + { + "expr": "100 * ( ( (sum by (namespace) (avg_over_time(DCGM_FI_DEV_GPU_UTIL{namespace!=\"\",pod!=\"\"}[1h]))) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) ) and on(namespace) ( (topk(10, ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) + (sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) / 1e9) + ((sum((kube_pod_container_resource_requests{namespace!=\"\",resource=\"nvidia.com/gpu\"} or kube_pod_container_resource_limits{namespace!=\"\",resource=\"nvidia.com/gpu\"})) by (namespace)) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) * 100)) >= bool 0) ) ) / clamp_min(sum( ( (sum by (namespace) (avg_over_time(DCGM_FI_DEV_GPU_UTIL{namespace!=\"\",pod!=\"\"}[1h]))) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) ) and on(namespace) ( (topk(10, ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) + (sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) / 1e9) + ((sum((kube_pod_container_resource_requests{namespace!=\"\",resource=\"nvidia.com/gpu\"} or kube_pod_container_resource_limits{namespace!=\"\",resource=\"nvidia.com/gpu\"})) by (namespace)) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) * 100)) >= bool 0) ) ), 1)", + "refId": "A", + "legendFormat": "{{namespace}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent", + "color": { + "mode": "palette-classic" + } + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "list", + "placement": "right" + }, + "pieType": "pie", + "displayLabels": [ + "percent" + ], + "tooltip": { + "mode": "single" + }, + "colorScheme": "interpolateSpectral", + "colorBy": "value", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + } + } + }, + { + "id": 13, + "type": "piechart", + "title": "Namespace RAM Share", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 8, + "x": 16, + "y": 16 + }, + "targets": [ + { + "expr": "100 * ( ( sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) ) and on(namespace) ( (topk(10, ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) + (sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) / 1e9) + ((sum((kube_pod_container_resource_requests{namespace!=\"\",resource=\"nvidia.com/gpu\"} or kube_pod_container_resource_limits{namespace!=\"\",resource=\"nvidia.com/gpu\"})) by (namespace)) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) * 100)) >= bool 0) ) ) / clamp_min(sum( ( sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) ) and on(namespace) ( (topk(10, ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) + (sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) / 1e9) + ((sum((kube_pod_container_resource_requests{namespace!=\"\",resource=\"nvidia.com/gpu\"} or kube_pod_container_resource_limits{namespace!=\"\",resource=\"nvidia.com/gpu\"})) by (namespace)) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) * 100)) >= bool 0) ) ), 1)", + "refId": "A", + "legendFormat": "{{namespace}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent", + "color": { + "mode": "palette-classic" + } + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "list", + "placement": "right" + }, + "pieType": "pie", + "displayLabels": [ + "percent" + ], + "tooltip": { + "mode": "single" + }, + "colorScheme": "interpolateSpectral", + "colorBy": "value", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + } + } + }, + { + "id": 14, + "type": "timeseries", + "title": "Worker Node CPU", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 12, + "w": 12, + "x": 0, + "y": 32 + }, + "targets": [ + { + "expr": "(avg by (node) (((1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-04|titan-05|titan-06|titan-07|titan-08|titan-09|titan-10|titan-11|titan-12|titan-13|titan-14|titan-15|titan-16|titan-17|titan-18|titan-19|titan-22|titan-24\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right", + "calcs": [ + "last" + ] + }, + "tooltip": { + "mode": "multi" + } + }, + "links": [ + { + "title": "Open atlas-nodes dashboard", + "url": "/d/atlas-nodes", + "targetBlank": true + } + ] + }, + { + "id": 15, + "type": "timeseries", + "title": "Worker Node RAM", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 12, + "w": 12, + "x": 12, + "y": 32 + }, + "targets": [ + { + "expr": "(avg by (node) ((avg by (instance) ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-04|titan-05|titan-06|titan-07|titan-08|titan-09|titan-10|titan-11|titan-12|titan-13|titan-14|titan-15|titan-16|titan-17|titan-18|titan-19|titan-22|titan-24\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right", + "calcs": [ + "last" + ] + }, + "tooltip": { + "mode": "multi" + } + }, + "links": [ + { + "title": "Open atlas-nodes dashboard", + "url": "/d/atlas-nodes", + "targetBlank": true + } + ] + }, + { + "id": 16, + "type": "timeseries", + "title": "Control plane CPU", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 10, + "w": 12, + "x": 0, + "y": 44 + }, + "targets": [ + { + "expr": "(avg by (node) (((1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + } + }, + { + "id": 17, + "type": "timeseries", + "title": "Control plane RAM", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 10, + "w": 12, + "x": 12, + "y": 44 + }, + "targets": [ + { + "expr": "(avg by (node) ((avg by (instance) ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + } + }, + { + "id": 18, + "type": "timeseries", + "title": "Cluster Ingress Throughput", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 7, + "w": 8, + "x": 0, + "y": 25 + }, + "targets": [ + { + "expr": "sum(rate(node_network_receive_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)", + "refId": "A", + "legendFormat": "Ingress (Traefik)" + } + ], + "fieldConfig": { + "defaults": { + "unit": "Bps" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "list", + "placement": "bottom" + }, + "tooltip": { + "mode": "multi" + } + }, + "links": [ + { + "title": "Open atlas-network dashboard", + "url": "/d/atlas-network", + "targetBlank": true + } + ] + }, + { + "id": 19, + "type": "timeseries", + "title": "Cluster Egress Throughput", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 7, + "w": 8, + "x": 8, + "y": 25 + }, + "targets": [ + { + "expr": "sum(rate(node_network_transmit_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)", + "refId": "A", + "legendFormat": "Egress (Traefik)" + } + ], + "fieldConfig": { + "defaults": { + "unit": "Bps" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "list", + "placement": "bottom" + }, + "tooltip": { + "mode": "multi" + } + }, + "links": [ + { + "title": "Open atlas-network dashboard", + "url": "/d/atlas-network", + "targetBlank": true + } + ] + }, + { + "id": 20, + "type": "timeseries", + "title": "Intra-Cluster Throughput", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 7, + "w": 8, + "x": 16, + "y": 25 + }, + "targets": [ + { + "expr": "sum(rate(container_network_receive_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m]) + rate(container_network_transmit_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m])) or on() vector(0)", + "refId": "A", + "legendFormat": "Internal traffic" + } + ], + "fieldConfig": { + "defaults": { + "unit": "Bps" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "list", + "placement": "bottom" + }, + "tooltip": { + "mode": "multi" + } + }, + "links": [ + { + "title": "Open atlas-network dashboard", + "url": "/d/atlas-network", + "targetBlank": true + } + ] + }, + { + "id": 21, + "type": "timeseries", + "title": "Root Filesystem Usage", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 16, + "w": 12, + "x": 0, + "y": 54 + }, + "targets": [ + { + "expr": "avg by (node) ((avg by (instance) ((1 - (node_filesystem_avail_bytes{mountpoint=\"/\",fstype!~\"tmpfs|overlay\"} / node_filesystem_size_bytes{mountpoint=\"/\",fstype!~\"tmpfs|overlay\"})) * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right", + "calcs": [ + "last" + ] + }, + "tooltip": { + "mode": "multi" + } + }, + "timeFrom": "30d", + "links": [ + { + "title": "Open atlas-storage dashboard", + "url": "/d/atlas-storage", + "targetBlank": true + } + ] + }, + { + "id": 22, + "type": "bargauge", + "title": "Nodes Closest to Full Root Disks", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 16, + "w": 12, + "x": 12, + "y": 54 + }, + "targets": [ + { + "expr": "topk(12, avg by (node) ((avg by (instance) ((1 - (node_filesystem_avail_bytes{mountpoint=\"/\",fstype!~\"tmpfs|overlay\"} / node_filesystem_size_bytes{mountpoint=\"/\",fstype!~\"tmpfs|overlay\"})) * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")))", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent", + "min": 0, + "max": 100, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 50 + }, + { + "color": "orange", + "value": 70 + }, + { + "color": "red", + "value": 85 + } + ] + } + }, + "overrides": [] + }, + "options": { + "displayMode": "gradient", + "orientation": "horizontal", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + } + }, + "links": [ + { + "title": "Open atlas-storage dashboard", + "url": "/d/atlas-storage", + "targetBlank": true + } + ] + } + ], + "schemaVersion": 39, + "style": "dark", + "tags": [ + "atlas", + "overview" + ], + "templating": { + "list": [] + }, + "time": { + "from": "now-1h", + "to": "now" + }, + "refresh": "1m", + "links": [ + { + "title": "Atlas Pods", + "type": "dashboard", + "dashboardUid": "atlas-pods", + "keepTime": false + }, + { + "title": "Atlas Nodes", + "type": "dashboard", + "dashboardUid": "atlas-nodes", + "keepTime": false + }, + { + "title": "Atlas Storage", + "type": "dashboard", + "dashboardUid": "atlas-storage", + "keepTime": false + }, + { + "title": "Atlas Network", + "type": "dashboard", + "dashboardUid": "atlas-network", + "keepTime": false + }, + { + "title": "Atlas GPU", + "type": "dashboard", + "dashboardUid": "atlas-gpu", + "keepTime": false + } + ] +} diff --git a/services/monitoring/dashboards/atlas-pods.json b/services/monitoring/dashboards/atlas-pods.json new file mode 100644 index 0000000..ef616e0 --- /dev/null +++ b/services/monitoring/dashboards/atlas-pods.json @@ -0,0 +1,377 @@ +{ + "uid": "atlas-pods", + "title": "Atlas Pods", + "folderUid": "atlas-internal", + "editable": true, + "panels": [ + { + "id": 1, + "type": "stat", + "title": "Problem Pods", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 6, + "x": 0, + "y": 0 + }, + "targets": [ + { + "expr": "sum(max by (namespace,pod) (kube_pod_status_phase{phase!~\"Running|Succeeded\"}))", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 1 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 2, + "type": "stat", + "title": "CrashLoop / ImagePull", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 6, + "x": 6, + "y": 0 + }, + "targets": [ + { + "expr": "sum(max by (namespace,pod) (kube_pod_container_status_waiting_reason{reason=~\"CrashLoopBackOff|ImagePullBackOff\"}))", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 1 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 3, + "type": "stat", + "title": "Stuck Terminating (>10m)", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 6, + "x": 12, + "y": 0 + }, + "targets": [ + { + "expr": "sum(max by (namespace,pod) (((time() - kube_pod_deletion_timestamp{pod!=\"\"}) > bool 600) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0)))", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 1 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 4, + "type": "stat", + "title": "Control Plane Workloads", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 6, + "x": 18, + "y": 0 + }, + "targets": [ + { + "expr": "sum(kube_pod_info{node=~\"titan-0a|titan-0b|titan-0c\",namespace!~\"kube-system|kube-public|kube-node-lease|longhorn-system|monitoring|flux-system\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 1 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 5, + "type": "table", + "title": "Pods Not Running", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 10, + "w": 24, + "x": 0, + "y": 4 + }, + "targets": [ + { + "expr": "(time() - kube_pod_created{pod!=\"\"}) * on(namespace,pod) group_left(node) kube_pod_info * on(namespace,pod) group_left(phase) max by (namespace,pod,phase) (kube_pod_status_phase{phase!~\"Running|Succeeded\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "unit": "s" + }, + "overrides": [] + }, + "options": { + "showHeader": true + }, + "transformations": [ + { + "id": "labelsToFields", + "options": {} + } + ] + }, + { + "id": 6, + "type": "table", + "title": "CrashLoop / ImagePull", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 10, + "w": 24, + "x": 0, + "y": 14 + }, + "targets": [ + { + "expr": "(time() - kube_pod_created{pod!=\"\"}) * on(namespace,pod) group_left(node) kube_pod_info * on(namespace,pod,container) group_left(reason) max by (namespace,pod,container,reason) (kube_pod_container_status_waiting_reason{reason=~\"CrashLoopBackOff|ImagePullBackOff\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "unit": "s" + }, + "overrides": [] + }, + "options": { + "showHeader": true + }, + "transformations": [ + { + "id": "labelsToFields", + "options": {} + } + ] + }, + { + "id": 7, + "type": "table", + "title": "Terminating >10m", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 10, + "w": 24, + "x": 0, + "y": 24 + }, + "targets": [ + { + "expr": "(((time() - kube_pod_deletion_timestamp{pod!=\"\"}) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0)) * on(namespace,pod) group_left(node) kube_pod_info)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "unit": "s" + }, + "overrides": [] + }, + "options": { + "showHeader": true + }, + "transformations": [ + { + "id": "labelsToFields", + "options": {} + }, + { + "id": "filterByValue", + "options": { + "match": "Value", + "operator": "gt", + "value": 600 + } + } + ] + } + ], + "time": { + "from": "now-12h", + "to": "now" + }, + "annotations": { + "list": [] + }, + "schemaVersion": 39, + "style": "dark", + "tags": [ + "atlas", + "pods" + ] +} diff --git a/services/monitoring/dashboards/atlas-storage.json b/services/monitoring/dashboards/atlas-storage.json new file mode 100644 index 0000000..1d07040 --- /dev/null +++ b/services/monitoring/dashboards/atlas-storage.json @@ -0,0 +1,419 @@ +{ + "uid": "atlas-storage", + "title": "Atlas Storage", + "folderUid": "atlas-internal", + "editable": true, + "panels": [ + { + "id": 1, + "type": "stat", + "title": "Astreae Usage", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 6, + "x": 0, + "y": 0 + }, + "targets": [ + { + "expr": "100 - (sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"}) / sum(node_filesystem_size_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"}) * 100)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "percentage", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 70 + }, + { + "color": "red", + "value": 85 + } + ] + }, + "unit": "percent", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 2, + "type": "stat", + "title": "Asteria Usage", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 6, + "x": 6, + "y": 0 + }, + "targets": [ + { + "expr": "100 - (sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"}) / sum(node_filesystem_size_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"}) * 100)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "percentage", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 70 + }, + { + "color": "red", + "value": 85 + } + ] + }, + "unit": "percent", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 3, + "type": "stat", + "title": "Astreae Free", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 6, + "x": 12, + "y": 0 + }, + "targets": [ + { + "expr": "sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "decbytes", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 4, + "type": "stat", + "title": "Asteria Free", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 6, + "x": 18, + "y": 0 + }, + "targets": [ + { + "expr": "sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "decbytes", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 5, + "type": "timeseries", + "title": "Astreae Per-Node Usage", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 0, + "y": 5 + }, + "targets": [ + { + "expr": "(avg by (node) ((avg by (instance) ((1 - (node_filesystem_avail_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"} / node_filesystem_size_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"})) * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-1[2-9]|titan-2[24]\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + }, + "timeFrom": "30d" + }, + { + "id": 6, + "type": "timeseries", + "title": "Asteria Per-Node Usage", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 12, + "y": 5 + }, + "targets": [ + { + "expr": "(avg by (node) ((avg by (instance) ((1 - (node_filesystem_avail_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"} / node_filesystem_size_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"})) * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-1[2-9]|titan-2[24]\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + }, + "timeFrom": "30d" + }, + { + "id": 7, + "type": "timeseries", + "title": "Astreae Usage History", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 0, + "y": 14 + }, + "targets": [ + { + "expr": "100 - (sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"}) / sum(node_filesystem_size_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"}) * 100)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "bottom" + }, + "tooltip": { + "mode": "multi" + } + }, + "timeFrom": "90d" + }, + { + "id": 8, + "type": "timeseries", + "title": "Asteria Usage History", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 12, + "y": 14 + }, + "targets": [ + { + "expr": "100 - (sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"}) / sum(node_filesystem_size_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"}) * 100)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "bottom" + }, + "tooltip": { + "mode": "multi" + } + }, + "timeFrom": "90d" + } + ], + "time": { + "from": "now-12h", + "to": "now" + }, + "annotations": { + "list": [] + }, + "schemaVersion": 39, + "style": "dark", + "tags": [ + "atlas", + "storage" + ] +} diff --git a/services/monitoring/dcgm-exporter.yaml b/services/monitoring/dcgm-exporter.yaml new file mode 100644 index 0000000..06152e7 --- /dev/null +++ b/services/monitoring/dcgm-exporter.yaml @@ -0,0 +1,80 @@ +# services/monitoring/dcgm-exporter.yaml +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: dcgm-exporter + namespace: monitoring + labels: + app: dcgm-exporter +spec: + selector: + matchLabels: + app: dcgm-exporter + updateStrategy: + rollingUpdate: + maxUnavailable: 2 + template: + metadata: + labels: + app: dcgm-exporter + annotations: + prometheus.io/scrape: "true" + prometheus.io/port: "9400" + spec: + serviceAccountName: default + runtimeClassName: nvidia + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: In + values: + - titan-20 + - titan-21 + - titan-22 + - titan-24 + tolerations: + - operator: Exists + containers: + - name: dcgm-exporter + image: registry.bstein.dev/monitoring/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04 + imagePullPolicy: Always + ports: + - name: metrics + containerPort: 9400 + env: + - name: DCGM_EXPORTER_KUBERNETES + value: "true" + securityContext: + privileged: true + resources: + requests: + cpu: 50m + memory: 64Mi + volumeMounts: + - name: pod-resources + mountPath: /var/lib/kubelet/pod-resources + imagePullSecrets: + - name: zot-regcred + volumes: + - name: pod-resources + hostPath: + path: /var/lib/kubelet/pod-resources + type: Directory +--- +apiVersion: v1 +kind: Service +metadata: + name: dcgm-exporter + namespace: monitoring + labels: + app: dcgm-exporter +spec: + selector: + app: dcgm-exporter + ports: + - name: metrics + port: 9400 + targetPort: metrics diff --git a/services/monitoring/grafana-dashboard-gpu.yaml b/services/monitoring/grafana-dashboard-gpu.yaml new file mode 100644 index 0000000..3af8717 --- /dev/null +++ b/services/monitoring/grafana-dashboard-gpu.yaml @@ -0,0 +1,193 @@ +# services/monitoring/grafana-dashboard-gpu.yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: grafana-dashboard-gpu + labels: + grafana_dashboard: "1" +data: + atlas-gpu.json: | + { + "uid": "atlas-gpu", + "title": "Atlas GPU", + "folderUid": "atlas-internal", + "editable": true, + "panels": [ + { + "id": 1, + "type": "piechart", + "title": "Namespace GPU Share", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 0 + }, + "targets": [ + { + "expr": "100 * ( ( (sum by (namespace) (avg_over_time(DCGM_FI_DEV_GPU_UTIL{namespace!=\"\",pod!=\"\"}[1h]))) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) ) and on(namespace) ( (topk(10, ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) + (sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) / 1e9) + ((sum((kube_pod_container_resource_requests{namespace!=\"\",resource=\"nvidia.com/gpu\"} or kube_pod_container_resource_limits{namespace!=\"\",resource=\"nvidia.com/gpu\"})) by (namespace)) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) * 100)) >= bool 0) ) ) / clamp_min(sum( ( (sum by (namespace) (avg_over_time(DCGM_FI_DEV_GPU_UTIL{namespace!=\"\",pod!=\"\"}[1h]))) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) ) and on(namespace) ( (topk(10, ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) + (sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) / 1e9) + ((sum((kube_pod_container_resource_requests{namespace!=\"\",resource=\"nvidia.com/gpu\"} or kube_pod_container_resource_limits{namespace!=\"\",resource=\"nvidia.com/gpu\"})) by (namespace)) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) * 100)) >= bool 0) ) ), 1)", + "refId": "A", + "legendFormat": "{{namespace}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent", + "color": { + "mode": "palette-classic" + } + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "list", + "placement": "right" + }, + "pieType": "pie", + "displayLabels": [ + "percent" + ], + "tooltip": { + "mode": "single" + }, + "colorScheme": "interpolateSpectral", + "colorBy": "value", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + } + } + }, + { + "id": 2, + "type": "timeseries", + "title": "GPU Util by Namespace", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 0 + }, + "targets": [ + { + "expr": "sum(DCGM_FI_DEV_GPU_UTIL{namespace!=\"\",pod!=\"\"}) by (namespace)", + "refId": "A", + "legendFormat": "{{namespace}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + } + }, + { + "id": 3, + "type": "timeseries", + "title": "GPU Util by Node", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 8 + }, + "targets": [ + { + "expr": "sum by (Hostname) (DCGM_FI_DEV_GPU_UTIL{pod!=\"\"})", + "refId": "A", + "legendFormat": "{{Hostname}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + } + }, + { + "id": 4, + "type": "table", + "title": "Top Pods by GPU Util", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 8 + }, + "targets": [ + { + "expr": "topk(10, sum(DCGM_FI_DEV_GPU_UTIL{pod!=\"\"}) by (namespace,pod,Hostname))", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "showHeader": true + }, + "transformations": [ + { + "id": "labelsToFields", + "options": {} + } + ] + } + ], + "time": { + "from": "now-12h", + "to": "now" + }, + "annotations": { + "list": [] + }, + "schemaVersion": 39, + "style": "dark", + "tags": [ + "atlas", + "gpu" + ] + } diff --git a/services/monitoring/grafana-dashboard-network.yaml b/services/monitoring/grafana-dashboard-network.yaml new file mode 100644 index 0000000..fd1f5d6 --- /dev/null +++ b/services/monitoring/grafana-dashboard-network.yaml @@ -0,0 +1,454 @@ +# services/monitoring/grafana-dashboard-network.yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: grafana-dashboard-network + labels: + grafana_dashboard: "1" +data: + atlas-network.json: | + { + "uid": "atlas-network", + "title": "Atlas Network", + "folderUid": "atlas-internal", + "editable": true, + "panels": [ + { + "id": 1, + "type": "stat", + "title": "Ingress Traffic", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 8, + "x": 0, + "y": 0 + }, + "targets": [ + { + "expr": "sum(rate(node_network_receive_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "Bps", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 2, + "type": "stat", + "title": "Egress Traffic", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 8, + "x": 8, + "y": 0 + }, + "targets": [ + { + "expr": "sum(rate(node_network_transmit_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "Bps", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 3, + "type": "stat", + "title": "Intra-Cluster Traffic", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 8, + "x": 16, + "y": 0 + }, + "targets": [ + { + "expr": "sum(rate(container_network_receive_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m]) + rate(container_network_transmit_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m])) or on() vector(0)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "Bps", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 4, + "type": "stat", + "title": "Top Router req/s", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 8, + "x": 0, + "y": 4 + }, + "targets": [ + { + "expr": "topk(1, sum by (router) (rate(traefik_router_requests_total[5m])))", + "refId": "A", + "legendFormat": "{{router}}" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "req/s", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 5, + "type": "timeseries", + "title": "Per-Node Throughput", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 8, + "w": 24, + "x": 0, + "y": 8 + }, + "targets": [ + { + "expr": "avg by (node) ((sum(rate(node_network_transmit_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0) + sum(rate(node_network_receive_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "Bps" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + } + }, + { + "id": 6, + "type": "table", + "title": "Top Namespaces", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 0, + "y": 16 + }, + "targets": [ + { + "expr": "topk(10, sum(rate(container_network_transmit_bytes_total{namespace!=\"\"}[5m]) + rate(container_network_receive_bytes_total{namespace!=\"\"}[5m])) by (namespace))", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "unit": "Bps" + }, + "overrides": [] + }, + "options": { + "showHeader": true + }, + "transformations": [ + { + "id": "labelsToFields", + "options": {} + } + ] + }, + { + "id": 7, + "type": "table", + "title": "Top Pods", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 12, + "y": 16 + }, + "targets": [ + { + "expr": "topk(10, sum(rate(container_network_transmit_bytes_total{pod!=\"\"}[5m]) + rate(container_network_receive_bytes_total{pod!=\"\"}[5m])) by (namespace,pod))", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "unit": "Bps" + }, + "overrides": [] + }, + "options": { + "showHeader": true + }, + "transformations": [ + { + "id": "labelsToFields", + "options": {} + } + ] + }, + { + "id": 8, + "type": "timeseries", + "title": "Traefik Routers (req/s)", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 0, + "y": 25 + }, + "targets": [ + { + "expr": "topk(10, sum by (router) (rate(traefik_router_requests_total[5m])))", + "refId": "A", + "legendFormat": "{{router}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "req/s" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + } + }, + { + "id": 9, + "type": "timeseries", + "title": "Traefik Entrypoints (req/s)", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 12, + "y": 25 + }, + "targets": [ + { + "expr": "sum by (entrypoint) (rate(traefik_entrypoint_requests_total[5m]))", + "refId": "A", + "legendFormat": "{{entrypoint}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "req/s" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + } + } + ], + "time": { + "from": "now-12h", + "to": "now" + }, + "annotations": { + "list": [] + }, + "schemaVersion": 39, + "style": "dark", + "tags": [ + "atlas", + "network" + ] + } diff --git a/services/monitoring/grafana-dashboard-nodes.yaml b/services/monitoring/grafana-dashboard-nodes.yaml new file mode 100644 index 0000000..2facfed --- /dev/null +++ b/services/monitoring/grafana-dashboard-nodes.yaml @@ -0,0 +1,404 @@ +# services/monitoring/grafana-dashboard-nodes.yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: grafana-dashboard-nodes + labels: + grafana_dashboard: "1" +data: + atlas-nodes.json: | + { + "uid": "atlas-nodes", + "title": "Atlas Nodes", + "folderUid": "atlas-internal", + "editable": true, + "panels": [ + { + "id": 1, + "type": "stat", + "title": "Worker Nodes Ready", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 8, + "x": 0, + "y": 0 + }, + "targets": [ + { + "expr": "sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-04|titan-05|titan-06|titan-07|titan-08|titan-09|titan-10|titan-11|titan-12|titan-13|titan-14|titan-15|titan-16|titan-17|titan-18|titan-19|titan-22|titan-24\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto", + "valueSuffix": "/18" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 2, + "type": "stat", + "title": "Control Plane Ready", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 8, + "x": 8, + "y": 0 + }, + "targets": [ + { + "expr": "sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-0a|titan-0b|titan-0c\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto", + "valueSuffix": "/3" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 3, + "type": "stat", + "title": "Control Plane Workloads", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 8, + "x": 16, + "y": 0 + }, + "targets": [ + { + "expr": "sum(kube_pod_info{node=~\"titan-0a|titan-0b|titan-0c\",namespace!~\"kube-system|kube-public|kube-node-lease|longhorn-system|monitoring|flux-system\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 4, + "type": "timeseries", + "title": "Node CPU", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 24, + "x": 0, + "y": 4 + }, + "targets": [ + { + "expr": "avg by (node) (((1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right", + "calcs": [ + "last" + ] + }, + "tooltip": { + "mode": "multi" + } + } + }, + { + "id": 5, + "type": "timeseries", + "title": "Node RAM", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 24, + "x": 0, + "y": 13 + }, + "targets": [ + { + "expr": "avg by (node) ((avg by (instance) ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right", + "calcs": [ + "last" + ] + }, + "tooltip": { + "mode": "multi" + } + } + }, + { + "id": 6, + "type": "timeseries", + "title": "Control Plane (incl. titan-db) CPU", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 0, + "y": 22 + }, + "targets": [ + { + "expr": "(avg by (node) (((1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c|titan-db\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + } + }, + { + "id": 7, + "type": "timeseries", + "title": "Control Plane (incl. titan-db) RAM", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 12, + "y": 22 + }, + "targets": [ + { + "expr": "(avg by (node) ((avg by (instance) ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c|titan-db\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + } + }, + { + "id": 8, + "type": "timeseries", + "title": "Root Filesystem Usage", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 24, + "x": 0, + "y": 31 + }, + "targets": [ + { + "expr": "avg by (node) ((avg by (instance) ((1 - (node_filesystem_avail_bytes{mountpoint=\"/\",fstype!~\"tmpfs|overlay\"} / node_filesystem_size_bytes{mountpoint=\"/\",fstype!~\"tmpfs|overlay\"})) * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + }, + "timeFrom": "30d" + } + ], + "time": { + "from": "now-12h", + "to": "now" + }, + "annotations": { + "list": [] + }, + "schemaVersion": 39, + "style": "dark", + "tags": [ + "atlas", + "nodes" + ] + } diff --git a/services/monitoring/grafana-dashboard-overview.yaml b/services/monitoring/grafana-dashboard-overview.yaml new file mode 100644 index 0000000..928098e --- /dev/null +++ b/services/monitoring/grafana-dashboard-overview.yaml @@ -0,0 +1,1541 @@ +# services/monitoring/grafana-dashboard-overview.yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: grafana-dashboard-overview + labels: + grafana_dashboard: "1" +data: + atlas-overview.json: | + { + "uid": "atlas-overview", + "title": "Atlas Overview", + "folderUid": "overview", + "editable": false, + "annotations": { + "list": [] + }, + "panels": [ + { + "id": 1, + "type": "gauge", + "title": "Workers Ready", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 5, + "x": 0, + "y": 0 + }, + "targets": [ + { + "expr": "sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-04|titan-05|titan-06|titan-07|titan-08|titan-09|titan-10|titan-11|titan-12|titan-13|titan-14|titan-15|titan-16|titan-17|titan-18|titan-19|titan-22|titan-24\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "min": 0, + "max": 18, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "orange", + "value": 16 + }, + { + "color": "yellow", + "value": 17 + }, + { + "color": "green", + "value": 18 + } + ] + } + }, + "overrides": [] + }, + "options": { + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "orientation": "auto", + "showThresholdMarkers": false, + "showThresholdLabels": false + } + }, + { + "id": 2, + "type": "gauge", + "title": "Control Plane Ready", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 5, + "x": 5, + "y": 0 + }, + "targets": [ + { + "expr": "sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-0a|titan-0b|titan-0c\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "min": 0, + "max": 3, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "green", + "value": 3 + } + ] + } + }, + "overrides": [] + }, + "options": { + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "orientation": "auto", + "showThresholdMarkers": false, + "showThresholdLabels": false + } + }, + { + "id": 3, + "type": "stat", + "title": "Control Plane Workloads", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 5, + "x": 10, + "y": 0 + }, + "targets": [ + { + "expr": "sum(kube_pod_info{node=~\"titan-0a|titan-0b|titan-0c\",namespace!~\"kube-system|kube-public|kube-node-lease|longhorn-system|monitoring|flux-system\"}) or on() vector(0)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 1 + }, + { + "color": "orange", + "value": 2 + }, + { + "color": "red", + "value": 3 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + }, + "links": [ + { + "title": "Open atlas-pods dashboard", + "url": "/d/atlas-pods", + "targetBlank": true + } + ] + }, + { + "id": 4, + "type": "stat", + "title": "Problem Pods", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 5, + "x": 15, + "y": 0 + }, + "targets": [ + { + "expr": "sum(max by (namespace,pod) (kube_pod_status_phase{phase!~\"Running|Succeeded\"}))", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 1 + }, + { + "color": "orange", + "value": 2 + }, + { + "color": "red", + "value": 3 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + }, + "links": [ + { + "title": "Open atlas-pods dashboard", + "url": "/d/atlas-pods", + "targetBlank": true + } + ] + }, + { + "id": 5, + "type": "stat", + "title": "Stuck Terminating", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 4, + "x": 20, + "y": 0 + }, + "targets": [ + { + "expr": "sum(max by (namespace,pod) (((time() - kube_pod_deletion_timestamp{pod!=\"\"}) > bool 600) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0)))", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 1 + }, + { + "color": "orange", + "value": 2 + }, + { + "color": "red", + "value": 3 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + }, + "links": [ + { + "title": "Open atlas-pods dashboard", + "url": "/d/atlas-pods", + "targetBlank": true + } + ] + }, + { + "id": 7, + "type": "stat", + "title": "Hottest node: CPU", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 3, + "w": 6, + "x": 0, + "y": 5 + }, + "targets": [ + { + "expr": "label_replace(topk(1, avg by (node) (((1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))), \"__name__\", \"$1\", \"node\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}", + "instant": true + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "percentage", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 70 + }, + { + "color": "red", + "value": 85 + } + ] + }, + "unit": "percent", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "name_and_value" + }, + "links": [ + { + "title": "Open atlas-nodes dashboard", + "url": "/d/atlas-nodes", + "targetBlank": true + } + ] + }, + { + "id": 8, + "type": "stat", + "title": "Hottest node: RAM", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 3, + "w": 6, + "x": 6, + "y": 5 + }, + "targets": [ + { + "expr": "label_replace(topk(1, avg by (node) ((avg by (instance) ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))), \"__name__\", \"$1\", \"node\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}", + "instant": true + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "percentage", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 70 + }, + { + "color": "red", + "value": 85 + } + ] + }, + "unit": "percent", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "name_and_value" + }, + "links": [ + { + "title": "Open atlas-nodes dashboard", + "url": "/d/atlas-nodes", + "targetBlank": true + } + ] + }, + { + "id": 9, + "type": "stat", + "title": "Hottest node: NET (rx+tx)", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 3, + "w": 6, + "x": 12, + "y": 5 + }, + "targets": [ + { + "expr": "label_replace(topk(1, avg by (node) ((sum by (instance) (rate(node_network_receive_bytes_total{device!~\"lo\"}[5m]) + rate(node_network_transmit_bytes_total{device!~\"lo\"}[5m]))) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))), \"__name__\", \"$1\", \"node\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}", + "instant": true + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "Bps", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "name_and_value" + }, + "links": [ + { + "title": "Open atlas-nodes dashboard", + "url": "/d/atlas-nodes", + "targetBlank": true + } + ] + }, + { + "id": 10, + "type": "stat", + "title": "Hottest node: I/O (r+w)", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 3, + "w": 6, + "x": 18, + "y": 5 + }, + "targets": [ + { + "expr": "label_replace(topk(1, avg by (node) ((sum by (instance) (rate(node_disk_read_bytes_total[5m]) + rate(node_disk_written_bytes_total[5m]))) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))), \"__name__\", \"$1\", \"node\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}", + "instant": true + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "Bps", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "name_and_value" + }, + "links": [ + { + "title": "Open atlas-nodes dashboard", + "url": "/d/atlas-nodes", + "targetBlank": true + } + ] + }, + { + "id": 23, + "type": "stat", + "title": "Astreae Usage", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 6, + "w": 6, + "x": 0, + "y": 10 + }, + "targets": [ + { + "expr": "100 - (sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"}) / sum(node_filesystem_size_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"}) * 100)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "percentage", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 70 + }, + { + "color": "red", + "value": 85 + } + ] + }, + "unit": "percent", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + }, + "links": [ + { + "title": "Open atlas-storage dashboard", + "url": "/d/atlas-storage", + "targetBlank": true + } + ] + }, + { + "id": 24, + "type": "stat", + "title": "Asteria Usage", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 6, + "w": 6, + "x": 6, + "y": 10 + }, + "targets": [ + { + "expr": "100 - (sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"}) / sum(node_filesystem_size_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"}) * 100)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "percentage", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 70 + }, + { + "color": "red", + "value": 85 + } + ] + }, + "unit": "percent", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + }, + "links": [ + { + "title": "Open atlas-storage dashboard", + "url": "/d/atlas-storage", + "targetBlank": true + } + ] + }, + { + "id": 25, + "type": "stat", + "title": "Astreae Free", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 6, + "w": 6, + "x": 12, + "y": 10 + }, + "targets": [ + { + "expr": "sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "decbytes", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + }, + "links": [ + { + "title": "Open atlas-storage dashboard", + "url": "/d/atlas-storage", + "targetBlank": true + } + ] + }, + { + "id": 26, + "type": "stat", + "title": "Asteria Free", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 6, + "w": 6, + "x": 18, + "y": 10 + }, + "targets": [ + { + "expr": "sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "decbytes", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + }, + "links": [ + { + "title": "Open atlas-storage dashboard", + "url": "/d/atlas-storage", + "targetBlank": true + } + ] + }, + { + "id": 11, + "type": "piechart", + "title": "Namespace CPU Share", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 8, + "x": 0, + "y": 16 + }, + "targets": [ + { + "expr": "100 * ( ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) and on(namespace) ( (topk(10, ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) + (sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) / 1e9) + ((sum((kube_pod_container_resource_requests{namespace!=\"\",resource=\"nvidia.com/gpu\"} or kube_pod_container_resource_limits{namespace!=\"\",resource=\"nvidia.com/gpu\"})) by (namespace)) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) * 100)) >= bool 0) ) ) / clamp_min(sum( ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) and on(namespace) ( (topk(10, ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) + (sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) / 1e9) + ((sum((kube_pod_container_resource_requests{namespace!=\"\",resource=\"nvidia.com/gpu\"} or kube_pod_container_resource_limits{namespace!=\"\",resource=\"nvidia.com/gpu\"})) by (namespace)) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) * 100)) >= bool 0) ) ), 1)", + "refId": "A", + "legendFormat": "{{namespace}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent", + "color": { + "mode": "palette-classic" + } + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "list", + "placement": "right" + }, + "pieType": "pie", + "displayLabels": [ + "percent" + ], + "tooltip": { + "mode": "single" + }, + "colorScheme": "interpolateSpectral", + "colorBy": "value", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + } + } + }, + { + "id": 12, + "type": "piechart", + "title": "Namespace GPU Share", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 8, + "x": 8, + "y": 16 + }, + "targets": [ + { + "expr": "100 * ( ( (sum by (namespace) (avg_over_time(DCGM_FI_DEV_GPU_UTIL{namespace!=\"\",pod!=\"\"}[1h]))) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) ) and on(namespace) ( (topk(10, ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) + (sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) / 1e9) + ((sum((kube_pod_container_resource_requests{namespace!=\"\",resource=\"nvidia.com/gpu\"} or kube_pod_container_resource_limits{namespace!=\"\",resource=\"nvidia.com/gpu\"})) by (namespace)) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) * 100)) >= bool 0) ) ) / clamp_min(sum( ( (sum by (namespace) (avg_over_time(DCGM_FI_DEV_GPU_UTIL{namespace!=\"\",pod!=\"\"}[1h]))) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) ) and on(namespace) ( (topk(10, ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) + (sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) / 1e9) + ((sum((kube_pod_container_resource_requests{namespace!=\"\",resource=\"nvidia.com/gpu\"} or kube_pod_container_resource_limits{namespace!=\"\",resource=\"nvidia.com/gpu\"})) by (namespace)) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) * 100)) >= bool 0) ) ), 1)", + "refId": "A", + "legendFormat": "{{namespace}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent", + "color": { + "mode": "palette-classic" + } + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "list", + "placement": "right" + }, + "pieType": "pie", + "displayLabels": [ + "percent" + ], + "tooltip": { + "mode": "single" + }, + "colorScheme": "interpolateSpectral", + "colorBy": "value", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + } + } + }, + { + "id": 13, + "type": "piechart", + "title": "Namespace RAM Share", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 8, + "x": 16, + "y": 16 + }, + "targets": [ + { + "expr": "100 * ( ( sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) ) and on(namespace) ( (topk(10, ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) + (sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) / 1e9) + ((sum((kube_pod_container_resource_requests{namespace!=\"\",resource=\"nvidia.com/gpu\"} or kube_pod_container_resource_limits{namespace!=\"\",resource=\"nvidia.com/gpu\"})) by (namespace)) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) * 100)) >= bool 0) ) ) / clamp_min(sum( ( sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) ) and on(namespace) ( (topk(10, ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) + (sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) / 1e9) + ((sum((kube_pod_container_resource_requests{namespace!=\"\",resource=\"nvidia.com/gpu\"} or kube_pod_container_resource_limits{namespace!=\"\",resource=\"nvidia.com/gpu\"})) by (namespace)) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) * 100)) >= bool 0) ) ), 1)", + "refId": "A", + "legendFormat": "{{namespace}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent", + "color": { + "mode": "palette-classic" + } + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "list", + "placement": "right" + }, + "pieType": "pie", + "displayLabels": [ + "percent" + ], + "tooltip": { + "mode": "single" + }, + "colorScheme": "interpolateSpectral", + "colorBy": "value", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + } + } + }, + { + "id": 14, + "type": "timeseries", + "title": "Worker Node CPU", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 12, + "w": 12, + "x": 0, + "y": 32 + }, + "targets": [ + { + "expr": "(avg by (node) (((1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-04|titan-05|titan-06|titan-07|titan-08|titan-09|titan-10|titan-11|titan-12|titan-13|titan-14|titan-15|titan-16|titan-17|titan-18|titan-19|titan-22|titan-24\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right", + "calcs": [ + "last" + ] + }, + "tooltip": { + "mode": "multi" + } + }, + "links": [ + { + "title": "Open atlas-nodes dashboard", + "url": "/d/atlas-nodes", + "targetBlank": true + } + ] + }, + { + "id": 15, + "type": "timeseries", + "title": "Worker Node RAM", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 12, + "w": 12, + "x": 12, + "y": 32 + }, + "targets": [ + { + "expr": "(avg by (node) ((avg by (instance) ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-04|titan-05|titan-06|titan-07|titan-08|titan-09|titan-10|titan-11|titan-12|titan-13|titan-14|titan-15|titan-16|titan-17|titan-18|titan-19|titan-22|titan-24\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right", + "calcs": [ + "last" + ] + }, + "tooltip": { + "mode": "multi" + } + }, + "links": [ + { + "title": "Open atlas-nodes dashboard", + "url": "/d/atlas-nodes", + "targetBlank": true + } + ] + }, + { + "id": 16, + "type": "timeseries", + "title": "Control plane CPU", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 10, + "w": 12, + "x": 0, + "y": 44 + }, + "targets": [ + { + "expr": "(avg by (node) (((1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + } + }, + { + "id": 17, + "type": "timeseries", + "title": "Control plane RAM", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 10, + "w": 12, + "x": 12, + "y": 44 + }, + "targets": [ + { + "expr": "(avg by (node) ((avg by (instance) ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + } + }, + { + "id": 18, + "type": "timeseries", + "title": "Cluster Ingress Throughput", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 7, + "w": 8, + "x": 0, + "y": 25 + }, + "targets": [ + { + "expr": "sum(rate(node_network_receive_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)", + "refId": "A", + "legendFormat": "Ingress (Traefik)" + } + ], + "fieldConfig": { + "defaults": { + "unit": "Bps" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "list", + "placement": "bottom" + }, + "tooltip": { + "mode": "multi" + } + }, + "links": [ + { + "title": "Open atlas-network dashboard", + "url": "/d/atlas-network", + "targetBlank": true + } + ] + }, + { + "id": 19, + "type": "timeseries", + "title": "Cluster Egress Throughput", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 7, + "w": 8, + "x": 8, + "y": 25 + }, + "targets": [ + { + "expr": "sum(rate(node_network_transmit_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)", + "refId": "A", + "legendFormat": "Egress (Traefik)" + } + ], + "fieldConfig": { + "defaults": { + "unit": "Bps" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "list", + "placement": "bottom" + }, + "tooltip": { + "mode": "multi" + } + }, + "links": [ + { + "title": "Open atlas-network dashboard", + "url": "/d/atlas-network", + "targetBlank": true + } + ] + }, + { + "id": 20, + "type": "timeseries", + "title": "Intra-Cluster Throughput", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 7, + "w": 8, + "x": 16, + "y": 25 + }, + "targets": [ + { + "expr": "sum(rate(container_network_receive_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m]) + rate(container_network_transmit_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m])) or on() vector(0)", + "refId": "A", + "legendFormat": "Internal traffic" + } + ], + "fieldConfig": { + "defaults": { + "unit": "Bps" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "list", + "placement": "bottom" + }, + "tooltip": { + "mode": "multi" + } + }, + "links": [ + { + "title": "Open atlas-network dashboard", + "url": "/d/atlas-network", + "targetBlank": true + } + ] + }, + { + "id": 21, + "type": "timeseries", + "title": "Root Filesystem Usage", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 16, + "w": 12, + "x": 0, + "y": 54 + }, + "targets": [ + { + "expr": "avg by (node) ((avg by (instance) ((1 - (node_filesystem_avail_bytes{mountpoint=\"/\",fstype!~\"tmpfs|overlay\"} / node_filesystem_size_bytes{mountpoint=\"/\",fstype!~\"tmpfs|overlay\"})) * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right", + "calcs": [ + "last" + ] + }, + "tooltip": { + "mode": "multi" + } + }, + "timeFrom": "30d", + "links": [ + { + "title": "Open atlas-storage dashboard", + "url": "/d/atlas-storage", + "targetBlank": true + } + ] + }, + { + "id": 22, + "type": "bargauge", + "title": "Nodes Closest to Full Root Disks", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 16, + "w": 12, + "x": 12, + "y": 54 + }, + "targets": [ + { + "expr": "topk(12, avg by (node) ((avg by (instance) ((1 - (node_filesystem_avail_bytes{mountpoint=\"/\",fstype!~\"tmpfs|overlay\"} / node_filesystem_size_bytes{mountpoint=\"/\",fstype!~\"tmpfs|overlay\"})) * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")))", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent", + "min": 0, + "max": 100, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 50 + }, + { + "color": "orange", + "value": 70 + }, + { + "color": "red", + "value": 85 + } + ] + } + }, + "overrides": [] + }, + "options": { + "displayMode": "gradient", + "orientation": "horizontal", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + } + }, + "links": [ + { + "title": "Open atlas-storage dashboard", + "url": "/d/atlas-storage", + "targetBlank": true + } + ] + } + ], + "schemaVersion": 39, + "style": "dark", + "tags": [ + "atlas", + "overview" + ], + "templating": { + "list": [] + }, + "time": { + "from": "now-1h", + "to": "now" + }, + "refresh": "1m", + "links": [ + { + "title": "Atlas Pods", + "type": "dashboard", + "dashboardUid": "atlas-pods", + "keepTime": false + }, + { + "title": "Atlas Nodes", + "type": "dashboard", + "dashboardUid": "atlas-nodes", + "keepTime": false + }, + { + "title": "Atlas Storage", + "type": "dashboard", + "dashboardUid": "atlas-storage", + "keepTime": false + }, + { + "title": "Atlas Network", + "type": "dashboard", + "dashboardUid": "atlas-network", + "keepTime": false + }, + { + "title": "Atlas GPU", + "type": "dashboard", + "dashboardUid": "atlas-gpu", + "keepTime": false + } + ] + } diff --git a/services/monitoring/grafana-dashboard-pods.yaml b/services/monitoring/grafana-dashboard-pods.yaml new file mode 100644 index 0000000..f92adf1 --- /dev/null +++ b/services/monitoring/grafana-dashboard-pods.yaml @@ -0,0 +1,386 @@ +# services/monitoring/grafana-dashboard-pods.yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: grafana-dashboard-pods + labels: + grafana_dashboard: "1" +data: + atlas-pods.json: | + { + "uid": "atlas-pods", + "title": "Atlas Pods", + "folderUid": "atlas-internal", + "editable": true, + "panels": [ + { + "id": 1, + "type": "stat", + "title": "Problem Pods", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 6, + "x": 0, + "y": 0 + }, + "targets": [ + { + "expr": "sum(max by (namespace,pod) (kube_pod_status_phase{phase!~\"Running|Succeeded\"}))", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 1 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 2, + "type": "stat", + "title": "CrashLoop / ImagePull", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 6, + "x": 6, + "y": 0 + }, + "targets": [ + { + "expr": "sum(max by (namespace,pod) (kube_pod_container_status_waiting_reason{reason=~\"CrashLoopBackOff|ImagePullBackOff\"}))", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 1 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 3, + "type": "stat", + "title": "Stuck Terminating (>10m)", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 6, + "x": 12, + "y": 0 + }, + "targets": [ + { + "expr": "sum(max by (namespace,pod) (((time() - kube_pod_deletion_timestamp{pod!=\"\"}) > bool 600) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0)))", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 1 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 4, + "type": "stat", + "title": "Control Plane Workloads", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 6, + "x": 18, + "y": 0 + }, + "targets": [ + { + "expr": "sum(kube_pod_info{node=~\"titan-0a|titan-0b|titan-0c\",namespace!~\"kube-system|kube-public|kube-node-lease|longhorn-system|monitoring|flux-system\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 1 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 5, + "type": "table", + "title": "Pods Not Running", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 10, + "w": 24, + "x": 0, + "y": 4 + }, + "targets": [ + { + "expr": "(time() - kube_pod_created{pod!=\"\"}) * on(namespace,pod) group_left(node) kube_pod_info * on(namespace,pod) group_left(phase) max by (namespace,pod,phase) (kube_pod_status_phase{phase!~\"Running|Succeeded\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "unit": "s" + }, + "overrides": [] + }, + "options": { + "showHeader": true + }, + "transformations": [ + { + "id": "labelsToFields", + "options": {} + } + ] + }, + { + "id": 6, + "type": "table", + "title": "CrashLoop / ImagePull", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 10, + "w": 24, + "x": 0, + "y": 14 + }, + "targets": [ + { + "expr": "(time() - kube_pod_created{pod!=\"\"}) * on(namespace,pod) group_left(node) kube_pod_info * on(namespace,pod,container) group_left(reason) max by (namespace,pod,container,reason) (kube_pod_container_status_waiting_reason{reason=~\"CrashLoopBackOff|ImagePullBackOff\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "unit": "s" + }, + "overrides": [] + }, + "options": { + "showHeader": true + }, + "transformations": [ + { + "id": "labelsToFields", + "options": {} + } + ] + }, + { + "id": 7, + "type": "table", + "title": "Terminating >10m", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 10, + "w": 24, + "x": 0, + "y": 24 + }, + "targets": [ + { + "expr": "(((time() - kube_pod_deletion_timestamp{pod!=\"\"}) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0)) * on(namespace,pod) group_left(node) kube_pod_info)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "unit": "s" + }, + "overrides": [] + }, + "options": { + "showHeader": true + }, + "transformations": [ + { + "id": "labelsToFields", + "options": {} + }, + { + "id": "filterByValue", + "options": { + "match": "Value", + "operator": "gt", + "value": 600 + } + } + ] + } + ], + "time": { + "from": "now-12h", + "to": "now" + }, + "annotations": { + "list": [] + }, + "schemaVersion": 39, + "style": "dark", + "tags": [ + "atlas", + "pods" + ] + } diff --git a/services/monitoring/grafana-dashboard-storage.yaml b/services/monitoring/grafana-dashboard-storage.yaml new file mode 100644 index 0000000..0a534f2 --- /dev/null +++ b/services/monitoring/grafana-dashboard-storage.yaml @@ -0,0 +1,428 @@ +# services/monitoring/grafana-dashboard-storage.yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: grafana-dashboard-storage + labels: + grafana_dashboard: "1" +data: + atlas-storage.json: | + { + "uid": "atlas-storage", + "title": "Atlas Storage", + "folderUid": "atlas-internal", + "editable": true, + "panels": [ + { + "id": 1, + "type": "stat", + "title": "Astreae Usage", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 6, + "x": 0, + "y": 0 + }, + "targets": [ + { + "expr": "100 - (sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"}) / sum(node_filesystem_size_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"}) * 100)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "percentage", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 70 + }, + { + "color": "red", + "value": 85 + } + ] + }, + "unit": "percent", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 2, + "type": "stat", + "title": "Asteria Usage", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 6, + "x": 6, + "y": 0 + }, + "targets": [ + { + "expr": "100 - (sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"}) / sum(node_filesystem_size_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"}) * 100)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "percentage", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 70 + }, + { + "color": "red", + "value": 85 + } + ] + }, + "unit": "percent", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 3, + "type": "stat", + "title": "Astreae Free", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 6, + "x": 12, + "y": 0 + }, + "targets": [ + { + "expr": "sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "decbytes", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 4, + "type": "stat", + "title": "Asteria Free", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 6, + "x": 18, + "y": 0 + }, + "targets": [ + { + "expr": "sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "decbytes", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 5, + "type": "timeseries", + "title": "Astreae Per-Node Usage", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 0, + "y": 5 + }, + "targets": [ + { + "expr": "(avg by (node) ((avg by (instance) ((1 - (node_filesystem_avail_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"} / node_filesystem_size_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"})) * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-1[2-9]|titan-2[24]\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + }, + "timeFrom": "30d" + }, + { + "id": 6, + "type": "timeseries", + "title": "Asteria Per-Node Usage", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 12, + "y": 5 + }, + "targets": [ + { + "expr": "(avg by (node) ((avg by (instance) ((1 - (node_filesystem_avail_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"} / node_filesystem_size_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"})) * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-1[2-9]|titan-2[24]\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", + "refId": "A", + "legendFormat": "{{node}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "right" + }, + "tooltip": { + "mode": "multi" + } + }, + "timeFrom": "30d" + }, + { + "id": 7, + "type": "timeseries", + "title": "Astreae Usage History", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 0, + "y": 14 + }, + "targets": [ + { + "expr": "100 - (sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"}) / sum(node_filesystem_size_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"}) * 100)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "bottom" + }, + "tooltip": { + "mode": "multi" + } + }, + "timeFrom": "90d" + }, + { + "id": 8, + "type": "timeseries", + "title": "Asteria Usage History", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 12, + "y": 14 + }, + "targets": [ + { + "expr": "100 - (sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"}) / sum(node_filesystem_size_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"}) * 100)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent" + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "table", + "placement": "bottom" + }, + "tooltip": { + "mode": "multi" + } + }, + "timeFrom": "90d" + } + ], + "time": { + "from": "now-12h", + "to": "now" + }, + "annotations": { + "list": [] + }, + "schemaVersion": 39, + "style": "dark", + "tags": [ + "atlas", + "storage" + ] + } diff --git a/services/monitoring/grafana-folders.yaml b/services/monitoring/grafana-folders.yaml new file mode 100644 index 0000000..54b278f --- /dev/null +++ b/services/monitoring/grafana-folders.yaml @@ -0,0 +1,35 @@ +# services/monitoring/grafana-folders.yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: grafana-folders + labels: + app.kubernetes.io/name: grafana + app.kubernetes.io/component: folders +data: + folders.yaml: | + apiVersion: 1 + folders: + - uid: overview + title: Overview + permissions: + - role: Viewer + permission: View + - role: Editor + permission: Edit + - role: Admin + permission: Admin + - uid: atlas-internal + title: Atlas Internal + permissions: + - role: Editor + permission: View + - role: Admin + permission: Admin + - uid: oceanus-internal + title: Oceanus Internal + permissions: + - role: Editor + permission: View + - role: Admin + permission: Admin diff --git a/services/monitoring/helmrelease.yaml b/services/monitoring/helmrelease.yaml index 22bc2b1..2546dc1 100644 --- a/services/monitoring/helmrelease.yaml +++ b/services/monitoring/helmrelease.yaml @@ -71,8 +71,7 @@ spec: persistentVolume: enabled: true - size: 100Gi # adjust; uses default StorageClass (Longhorn) - # storageClassName: "" # set if you want a specific class + size: 100Gi # Enable built-in Kubernetes scraping scrape: @@ -210,3 +209,187 @@ spec: - action: keep source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_label_app_kubernetes_io_part_of] regex: flux-system;flux + - job_name: "titan-db" + static_configs: + - targets: ["titan-db:9100"] + relabel_configs: + - source_labels: [__address__] + target_label: instance + metric_relabel_configs: + - source_labels: [instance] + target_label: node + replacement: titan-db + +--- + +apiVersion: helm.toolkit.fluxcd.io/v2 +kind: HelmRelease +metadata: + name: grafana + namespace: monitoring +spec: + interval: 15m + chart: + spec: + chart: grafana + version: "~8.5.0" + sourceRef: + kind: HelmRepository + name: grafana + namespace: flux-system + values: + admin: + existingSecret: grafana-admin + userKey: admin-user + passwordKey: admin-password + persistence: + enabled: true + size: 20Gi + storageClassName: astreae + service: + type: ClusterIP + env: + GF_AUTH_ANONYMOUS_ENABLED: "true" + GF_AUTH_ANONYMOUS_ORG_ROLE: Viewer + GF_SECURITY_ALLOW_EMBEDDING: "true" + grafana.ini: + server: + domain: metrics.bstein.dev + root_url: https://metrics.bstein.dev/ + dashboards: + default_home_dashboard_path: /var/lib/grafana/dashboards/overview/atlas-overview.json + auth.anonymous: + hide_version: true + users: + default_theme: dark + ingress: + enabled: true + ingressClassName: traefik + annotations: + cert-manager.io/cluster-issuer: letsencrypt + hosts: + - metrics.bstein.dev + path: / + tls: + - secretName: grafana-metrics-tls + hosts: + - metrics.bstein.dev + datasources: + datasources.yaml: + apiVersion: 1 + datasources: + - name: VictoriaMetrics + type: prometheus + access: proxy + url: http://victoria-metrics-single-server:8428 + isDefault: true + jsonData: + timeInterval: "15s" + uid: atlas-vm + dashboardProviders: + dashboardproviders.yaml: + apiVersion: 1 + providers: + - name: overview + orgId: 1 + folder: Overview + type: file + disableDeletion: false + editable: false + options: + path: /var/lib/grafana/dashboards/overview + - name: pods + orgId: 1 + folder: Atlas Internal + type: file + disableDeletion: false + editable: true + options: + path: /var/lib/grafana/dashboards/pods + - name: nodes + orgId: 1 + folder: Atlas Internal + type: file + disableDeletion: false + editable: true + options: + path: /var/lib/grafana/dashboards/nodes + - name: storage + orgId: 1 + folder: Atlas Internal + type: file + disableDeletion: false + editable: true + options: + path: /var/lib/grafana/dashboards/storage + - name: gpu + orgId: 1 + folder: Atlas Internal + type: file + disableDeletion: false + editable: true + options: + path: /var/lib/grafana/dashboards/gpu + - name: network + orgId: 1 + folder: Atlas Internal + type: file + disableDeletion: false + editable: true + options: + path: /var/lib/grafana/dashboards/network + dashboardsConfigMaps: + overview: grafana-dashboard-overview + pods: grafana-dashboard-pods + nodes: grafana-dashboard-nodes + storage: grafana-dashboard-storage + gpu: grafana-dashboard-gpu + network: grafana-dashboard-network + extraConfigmapMounts: + - name: grafana-folders + mountPath: /etc/grafana/provisioning/folders + configMap: grafana-folders + readOnly: true + +--- + +apiVersion: helm.toolkit.fluxcd.io/v2 +kind: HelmRelease +metadata: + name: alertmanager + namespace: monitoring +spec: + interval: 15m + chart: + spec: + chart: alertmanager + version: "~1.9.0" + sourceRef: + kind: HelmRepository + name: prometheus + namespace: flux-system + values: + ingress: + enabled: true + ingressClassName: traefik + annotations: + cert-manager.io/cluster-issuer: letsencrypt + hosts: + - host: alerts.bstein.dev + paths: + - path: / + pathType: Prefix + tls: + - secretName: alerts-bstein-dev-tls + hosts: + - alerts.bstein.dev + config: + global: + resolve_timeout: 5m + route: + receiver: default + group_wait: 30s + group_interval: 5m + repeat_interval: 2h + receivers: + - name: default diff --git a/services/monitoring/kustomization.yaml b/services/monitoring/kustomization.yaml index 036afa3..a50a1c1 100644 --- a/services/monitoring/kustomization.yaml +++ b/services/monitoring/kustomization.yaml @@ -5,4 +5,12 @@ namespace: monitoring resources: - namespace.yaml - rbac.yaml + - grafana-dashboard-overview.yaml + - grafana-dashboard-pods.yaml + - grafana-dashboard-nodes.yaml + - grafana-dashboard-storage.yaml + - grafana-dashboard-network.yaml + - grafana-dashboard-gpu.yaml + - dcgm-exporter.yaml + - grafana-folders.yaml - helmrelease.yaml