diff --git a/.gitignore b/.gitignore
index c317064..9bdfcf6 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1 +1,5 @@
-AGENTS.md
+# Ignore markdown by default, but keep top-level docs
+*.md
+!README.md
+!AGENTS.md
+!**/NOTES.md
diff --git a/AGENTS.md b/AGENTS.md
new file mode 100644
index 0000000..9dc36ac
--- /dev/null
+++ b/AGENTS.md
@@ -0,0 +1,81 @@
+
+
+Repository Guidelines
+
+> Local-only note: apply changes through Flux-tracked manifests, not by manual kubectl edits in-cluster—manual tweaks will be reverted by Flux.
+
+## Project Structure & Module Organization
+- `infrastructure/`: cluster-scoped building blocks (core, flux-system, traefik, longhorn). Add new platform features by mirroring this layout.
+- `services/`: workload manifests per app (`services/gitea/`, etc.) with `kustomization.yaml` plus one file per kind; keep diffs small and focused.
+- `dockerfiles/` hosts bespoke images, while `scripts/` stores operational Fish/Bash helpers—extend these directories instead of relying on ad-hoc commands.
+
+## Build, Test, and Development Commands
+- `kustomize build services/<app>` (or `kubectl kustomize ...`) renders manifests exactly as Flux will.
+- `kubectl apply --server-side --dry-run=client -k services/<app>` checks schema compatibility without touching the cluster.
+- `flux reconcile kustomization <name> --namespace flux-system --with-source` pulls the latest Git state after merges or hotfixes.
+- `fish scripts/flux_hammer.fish --help` explains the recovery tool; read it before running against production workloads.
+
+## Coding Style & Naming Conventions
+- YAML uses two-space indents; retain the leading path comment (e.g. `# services/gitea/deployment.yaml`) to speed code review.
+- Keep resource names lowercase kebab-case, align labels/selectors, and mirror namespaces with directory names.
+- List resources in `kustomization.yaml` from namespace/config, through storage, then workloads and networking for predictable diffs.
+- Scripts start with `#!/usr/bin/env fish` or bash, stay executable, and follow snake_case names such as `flux_hammer.fish`.
+
+## Testing Guidelines
+- Run `kustomize build` and the dry-run apply for every service you touch; capture failures before opening a PR.
+- `flux diff kustomization <name> --path services/<app>` previews reconciliations—link notable output when behavior shifts.
+- Docker edits: `docker build -f dockerfiles/Dockerfile.monerod .` (swap the file you changed) to verify image builds.
+
+## Commit & Pull Request Guidelines
+- Keep commit subjects short, present-tense, and optionally scoped (`gpu(titan-24): add RuntimeClass`); squash fixups before review.
+- Describe linked issues, affected services, and required operator steps (e.g. `flux reconcile kustomization services-gitea`) in the PR body.
+- Focus each PR on one kustomization or service and update `infrastructure/flux-system` when Flux must track new folders.
+- Record the validation you ran (dry-runs, diffs, builds) and add screenshots only when ingress or UI behavior changes.
+
+## Security & Configuration Tips
+- Never commit credentials; use Vault workflows (`services/vault/`) or SOPS-encrypted manifests wired through `infrastructure/flux-system`.
+- Node selectors and tolerations gate workloads to hardware like `hardware: rpi4`; confirm labels before scaling or renaming nodes.
+- Pin external images by digest or rely on Flux image automation to follow approved tags and avoid drift.
+
+## Dashboard roadmap / context (2025-12-02)
+- Atlas dashboards are generated via `scripts/dashboards_render_atlas.py --build`, which writes JSON under `services/monitoring/dashboards/` and ConfigMaps under `services/monitoring/`. Keep the Grafana manifests in sync by regenerating after edits.
+- Atlas Overview panels are paired with internal dashboards (pods, nodes, storage, network, GPU). A new `atlas-gpu` internal dashboard holds the detailed GPU metrics that feed the overview share pie.
+- Old Grafana folders (`Atlas Storage`, `Atlas SRE`, `Atlas Public`, `Atlas Nodes`) should be removed in Grafana UI when convenient; only `Atlas Overview` and `Atlas Internal` should remain provisioned.
+- Future work: add a separate generator (e.g., `dashboards_render_oceanus.py`) for SUI/oceanus validation dashboards, mirroring the atlas pattern of internal dashboards feeding a public overview.
+
+## Monitoring state (2025-12-03)
+- dcgm-exporter DaemonSet pulls `registry.bstein.dev/monitoring/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04` with nvidia runtime/imagePullSecret; titan-24 exports metrics, titan-22 remains NotReady.
+- Atlas Overview is the Grafana home (1h range, 1m refresh), Overview folder UID `overview`, internal folder `atlas-internal` (oceanus-internal stub).
+- Panels standardized via generator; hottest row compressed, worker/control rows taller, root disk row taller and top12 bar gauge with labels. GPU share pie uses 1h avg_over_time to persist idle activity.
+- Internal dashboards are provisioned without Viewer role; if anonymous still sees them, restart Grafana and tighten auth if needed.
+- GPU share panel updated (feature/sso) to use `max_over_time(…[$__range])`, so longer ranges (e.g., 12h) keep recent activity visible. Flux tracking `feature/sso`.
+
+## Upcoming priorities (SSO/storage/mail)
+- Establish SSO (Keycloak or similar) and federate Grafana, Gitea, Zot, Nextcloud, Pegasus/Jellyfin; keep Vaultwarden separate until safe.
+- Add Nextcloud (limit to rpi5 workers) with office suite; integrate with SSO; plan storage class and ingress.
+- Plan mail: mostly self-hosted, relay through trusted provider for outbound; integrate with services (Nextcloud, Vaultwarden, etc.) for notifications and account flows.
+
+## SSO plan sketch (2025-12-03)
+- IdP: use Keycloak (preferred) in a new `sso` namespace, Bitnami or codecentric chart with Postgres backing store (single PVC), ingress `sso.bstein.dev`, admin user bound to brad@bstein.dev; stick with local DB initially (no external IdP).
+- Auth flow goals: Grafana (OIDC), Gitea (OAuth2/Keycloak), Zot (via Traefik forward-auth/oauth2-proxy), Jellyfin/Pegasus via Jellyfin OAuth/OpenID plugin (map existing usernames; run migration to pre-create users in Keycloak with same usernames/emails and temporary passwords), Pegasus keeps using Jellyfin tokens.
+- Steps to implement:
+  1) Add service folder `services/keycloak/` (namespace, PVC, HelmRelease, ingress, secret for admin creds). Verify with kustomize + Flux reconcile.
+  2) Seed realm `atlas` with users (import CSV/realm). Create client for Grafana (public/implicit), Gitea (confidential), and a “jellyfin” client for the OAuth plugin; set email for brad@bstein.dev as admin.
+  3) Reconfigure Grafana to OIDC (disable anonymous to internal folders, leave Overview public via folder permissions). Reconfigure Gitea to OIDC (app.ini).
+  4) Add Traefik forward-auth (oauth2-proxy) in front of Zot and any other services needing headers-based auth.
+  5) Deploy Jellyfin OpenID plugin; map Keycloak users to existing Jellyfin usernames; communicate password reset path.
+- Migration caution: do not delete existing local creds until SSO validated; keep Pegasus working via Jellyfin tokens during transition.
+
+## Postgres centralization (2025-12-03)
+- Prefer a shared in-cluster Postgres deployment with per-service databases to reduce resource sprawl on Pi nodes. Use it for services that can easily point at an external DB.
+- Candidates to migrate to shared Postgres: Keycloak (realm DB), Gitea (git DB), Nextcloud (app DB), possibly Grafana (if persistence needed beyond current provisioner), Jitsi prosody/JVB state (if external DB supported). Keep tightly-coupled or lightweight embedded DBs as-is when migration is painful or not supported.
+
+## SSO integration snapshot (2025-12-08)
+- Current blockers: Zot still prompts for basic auth/double-login; Vault still wants the token UI after Keycloak (previously 502/404 when vault-0 sealed). Forward-auth middleware on Zot Ingress likely still causing the 401/Found hop; Vault OIDC mount not completing UI flow unless unsealed and preferred login is set.
+- Flux-only changes required: remove zot forward-auth middleware from Ingress (let oauth2-proxy handle redirect), ensure Vault OIDC mount is preferred UI login and bound to admin group; keep all edits in repo so Flux enforces them.
+- Secrets present (per user): `zot-oidc-client` (client_secret only), `oauth2-proxy-zot-oidc`, `oauth2-proxy-vault-oidc`, `vault-oidc-admin-token`. Zot needs its regcred in the zot namespace if image pulls fail.
+- Cluster validation blocked here: `kubectl get nodes` fails (403/permission) and DNS to `*.bstein.dev` fails in this session, so no live curl verification could be run. Re-test on a host with cluster/DNS access after Flux applies fixes.
+
+## Docs hygiene
+- Do not add per-service `README.md` files; use `NOTES.md` if documentation is needed inside service folders. Keep only the top-level repo README.
+- Keep comments succinct and in a human voice—no AI-sounding notes. Use `NOTES.md` for scratch notes instead of sprinkling reminders into code or extra READMEs.
diff --git a/NOTES.md b/NOTES.md
new file mode 100644
index 0000000..8b8b8d2
--- /dev/null
+++ b/NOTES.md
@@ -0,0 +1,3 @@
+# Rotation reminders (temporary secrets set by automation)
+
+- Weave GitOps UI (`cd.bstein.dev`) admin: `admin` / `G1tOps!2025` — rotate immediately after first login.
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..15dc377
--- /dev/null
+++ b/README.md
@@ -0,0 +1,3 @@
+# titan-iac
+
+Flux-managed Kubernetes cluster for bstein.dev services.
diff --git a/clusters/atlas/flux-system/applications/kustomization.yaml b/clusters/atlas/flux-system/applications/kustomization.yaml
index 1bc2700..daf1c42 100644
--- a/clusters/atlas/flux-system/applications/kustomization.yaml
+++ b/clusters/atlas/flux-system/applications/kustomization.yaml
@@ -15,3 +15,4 @@ resources:
   - sui-metrics/kustomization.yaml
   - keycloak/kustomization.yaml
   - oauth2-proxy/kustomization.yaml
+  - mailu/kustomization.yaml
diff --git a/clusters/atlas/flux-system/applications/mailu/kustomization.yaml b/clusters/atlas/flux-system/applications/mailu/kustomization.yaml
new file mode 100644
index 0000000..09db2fd
--- /dev/null
+++ b/clusters/atlas/flux-system/applications/mailu/kustomization.yaml
@@ -0,0 +1,18 @@
+# clusters/atlas/flux-system/applications/mailu/kustomization.yaml
+apiVersion: kustomize.toolkit.fluxcd.io/v1
+kind: Kustomization
+metadata:
+  name: mailu
+  namespace: flux-system
+spec:
+  interval: 10m
+  sourceRef:
+    kind: GitRepository
+    name: flux-system
+    namespace: flux-system
+  path: ./services/mailu
+  targetNamespace: mailu-mailserver
+  prune: true
+  wait: true
+  dependsOn:
+    - name: helm
diff --git a/clusters/atlas/flux-system/gotk-sync.yaml b/clusters/atlas/flux-system/gotk-sync.yaml
index 4076ef6..26dc23f 100644
--- a/clusters/atlas/flux-system/gotk-sync.yaml
+++ b/clusters/atlas/flux-system/gotk-sync.yaml
@@ -8,7 +8,7 @@ metadata:
 spec:
   interval: 1m0s
   ref:
-    branch: feature/sso
+    branch: feature/mailu
   secretRef:
     name: flux-system-gitea
   url: ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git
diff --git a/clusters/atlas/flux-system/platform/gitops-ui/kustomization.yaml b/clusters/atlas/flux-system/platform/gitops-ui/kustomization.yaml
new file mode 100644
index 0000000..9241a4b
--- /dev/null
+++ b/clusters/atlas/flux-system/platform/gitops-ui/kustomization.yaml
@@ -0,0 +1,20 @@
+# clusters/atlas/flux-system/platform/gitops-ui/kustomization.yaml
+apiVersion: kustomize.toolkit.fluxcd.io/v1
+kind: Kustomization
+metadata:
+  name: gitops-ui
+  namespace: flux-system
+spec:
+  interval: 10m
+  timeout: 10m
+  path: ./services/gitops-ui
+  prune: true
+  sourceRef:
+    kind: GitRepository
+    name: flux-system
+    namespace: flux-system
+  targetNamespace: flux-system
+  dependsOn:
+    - name: helm
+    - name: traefik
+  wait: true
diff --git a/clusters/atlas/flux-system/platform/kustomization.yaml b/clusters/atlas/flux-system/platform/kustomization.yaml
index 59d2032..040e478 100644
--- a/clusters/atlas/flux-system/platform/kustomization.yaml
+++ b/clusters/atlas/flux-system/platform/kustomization.yaml
@@ -5,5 +5,6 @@ resources:
   - core/kustomization.yaml
   - helm/kustomization.yaml
   - traefik/kustomization.yaml
+  - gitops-ui/kustomization.yaml
   - monitoring/kustomization.yaml
   - longhorn-ui/kustomization.yaml
diff --git a/clusters/oceanus/README.md b/clusters/oceanus/README.md
deleted file mode 100644
index d91b52f..0000000
--- a/clusters/oceanus/README.md
+++ /dev/null
@@ -1,5 +0,0 @@
-# Oceanus Cluster Scaffold
-
-This directory prepares the Flux and Kustomize layout for a future Oceanus-managed cluster.
-Populate `flux-system/` with `gotk-components.yaml` and related manifests after running `flux bootstrap`.
-Define node-specific resources under `infrastructure/modules/profiles/oceanus-validator/` and reference workloads in `applications/` as they come online.
diff --git a/docs/topology.md b/docs/topology.md
index 27b06f5..1e37235 100644
--- a/docs/topology.md
+++ b/docs/topology.md
@@ -2,15 +2,14 @@
 
 | Hostname   | Role / Function                | Managed By          | Notes |
 |------------|--------------------------------|---------------------|-------|
+| titan-db   | HA control plane database      | Ansible             | PostgreSQL / etcd backing services |
 | titan-0a   | Kubernetes control-plane       | Flux (atlas cluster)| HA leader, tainted for control only |
 | titan-0b   | Kubernetes control-plane       | Flux (atlas cluster)| Standby control node |
 | titan-0c   | Kubernetes control-plane       | Flux (atlas cluster)| Standby control node |
 | titan-04-19| Raspberry Pi workers           | Flux (atlas cluster)| Workload nodes, labelled per hardware |
+| titan-20&21| NVIDIA Jetson workers          | Flux (atlas cluster)| Workload nodes, labelled per hardware |
 | titan-22   | GPU mini-PC (Jellyfin)         | Flux + Ansible      | NVIDIA runtime managed via `modules/profiles/atlas-ha` |
+| titan-23   | Dedicated SUI validator Oceanus| Manual + Ansible    | Baremetal validator workloads, exposes metrics to atlas |
 | titan-24   | Tethys hybrid node             | Flux + Ansible      | Runs SUI metrics via K8s, validator via Ansible |
-| titan-db   | HA control plane database      | Ansible             | PostgreSQL / etcd backing services |
-| titan-jh   | Jumphost & bastion             | Ansible             | Entry point / future KVM services |
-| oceanus    | Dedicated SUI validator host   | Ansible / Flux prep | Baremetal validator workloads, exposes metrics to atlas; Kustomize scaffold under `clusters/oceanus/` |
+| titan-jh   | Jumphost & bastion & lesavka   | Ansible             | Entry point / future KVM services / custom kvm - lesavaka |
 | styx       | Air-gapped workstation         | Manual / Scripts    | Remains isolated, scripts tracked in `hosts/styx` |
-
-Use the `clusters/` directory for cluster-scoped state and the `hosts/` directory for baremetal orchestration.
diff --git a/hosts/styx/README.md b/hosts/styx/NOTES.md
similarity index 100%
rename from hosts/styx/README.md
rename to hosts/styx/NOTES.md
diff --git a/infrastructure/core/kustomization.yaml b/infrastructure/core/kustomization.yaml
index 1f56f6d..14d6a02 100644
--- a/infrastructure/core/kustomization.yaml
+++ b/infrastructure/core/kustomization.yaml
@@ -5,3 +5,4 @@ resources:
   - ../modules/base
   - ../modules/profiles/atlas-ha
   - ../sources/cert-manager/letsencrypt.yaml
+  - ../sources/cert-manager/letsencrypt-prod.yaml
diff --git a/infrastructure/sources/cert-manager/letsencrypt-prod.yaml b/infrastructure/sources/cert-manager/letsencrypt-prod.yaml
new file mode 100644
index 0000000..65bf316
--- /dev/null
+++ b/infrastructure/sources/cert-manager/letsencrypt-prod.yaml
@@ -0,0 +1,14 @@
+apiVersion: cert-manager.io/v1
+kind: ClusterIssuer
+metadata:
+  name: letsencrypt-prod
+spec:
+  acme:
+    email: brad.stein@gmail.com
+    server: https://acme-v02.api.letsencrypt.org/directory
+    privateKeySecretRef:
+      name: letsencrypt-prod-account-key
+    solvers:
+      - http01:
+          ingress:
+            class: traefik
diff --git a/infrastructure/sources/helm/kustomization.yaml b/infrastructure/sources/helm/kustomization.yaml
new file mode 100644
index 0000000..a0f55b0
--- /dev/null
+++ b/infrastructure/sources/helm/kustomization.yaml
@@ -0,0 +1,10 @@
+# infrastructure/sources/helm/kustomization.yaml
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+resources:
+  - grafana.yaml
+  - hashicorp.yaml
+  - jetstack.yaml
+  - mailu.yaml
+  - prometheus.yaml
+  - victoria-metrics.yaml
diff --git a/infrastructure/sources/helm/mailu.yaml b/infrastructure/sources/helm/mailu.yaml
new file mode 100644
index 0000000..5cd56e2
--- /dev/null
+++ b/infrastructure/sources/helm/mailu.yaml
@@ -0,0 +1,9 @@
+# infrastructure/sources/helm/mailu.yaml
+apiVersion: source.toolkit.fluxcd.io/v1
+kind: HelmRepository
+metadata:
+  name: mailu
+  namespace: flux-system
+spec:
+  interval: 1h
+  url: https://mailu.github.io/helm-charts
diff --git a/scripts/dashboards_render_atlas.py b/scripts/dashboards_render_atlas.py
index f577eab..894edb9 100644
--- a/scripts/dashboards_render_atlas.py
+++ b/scripts/dashboards_render_atlas.py
@@ -36,11 +36,12 @@ PUBLIC_FOLDER = "overview"
 PRIVATE_FOLDER = "atlas-internal"
 
 PERCENT_THRESHOLDS = {
-    "mode": "percentage",
+    "mode": "absolute",
     "steps": [
         {"color": "green", "value": None},
-        {"color": "yellow", "value": 70},
-        {"color": "red", "value": 85},
+        {"color": "yellow", "value": 50},
+        {"color": "orange", "value": 75},
+        {"color": "red", "value": 91.5},
     ],
 }
 
@@ -81,7 +82,7 @@ CONTROL_SUFFIX = f"/{CONTROL_TOTAL}"
 WORKER_SUFFIX = f"/{WORKER_TOTAL}"
 CP_ALLOWED_NS = "kube-system|kube-public|kube-node-lease|longhorn-system|monitoring|flux-system"
 LONGHORN_NODE_REGEX = "titan-1[2-9]|titan-2[24]"
-GAUGE_WIDTHS = [5, 5, 5, 5, 4]
+GAUGE_WIDTHS = [4, 3, 3, 4, 3, 3, 4]
 CONTROL_WORKLOADS_EXPR = (
     f'sum(kube_pod_info{{node=~"{CONTROL_REGEX}",namespace!~"{CP_ALLOWED_NS}"}}) or on() vector(0)'
 )
@@ -187,17 +188,64 @@ def namespace_gpu_share_expr():
     return namespace_share_expr(NAMESPACE_GPU_RAW)
 
 
-PROBLEM_PODS_EXPR = 'sum(max by (namespace,pod) (kube_pod_status_phase{phase!~"Running|Succeeded"}))'
+PROBLEM_PODS_EXPR = (
+    'sum(max by (namespace,pod) (kube_pod_status_phase{phase!~"Running|Succeeded"})) '
+    "or on() vector(0)"
+)
 CRASHLOOP_EXPR = (
     'sum(max by (namespace,pod) (kube_pod_container_status_waiting_reason'
-    '{reason=~"CrashLoopBackOff|ImagePullBackOff"}))'
+    '{reason=~"CrashLoopBackOff|ImagePullBackOff"})) '
+    "or on() vector(0)"
 )
 STUCK_TERMINATING_EXPR = (
     'sum(max by (namespace,pod) ('
     '((time() - kube_pod_deletion_timestamp{pod!=""}) > bool 600)'
     ' and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=""} > bool 0)'
-    '))'
+    ')) '
+    "or on() vector(0)"
 )
+UPTIME_WINDOW = "30d"
+TRAEFIK_READY_EXPR = (
+    "("
+    'sum(kube_deployment_status_replicas_available{namespace=~"traefik|kube-system",deployment="traefik"})'
+    " / clamp_min("
+    'sum(kube_deployment_spec_replicas{namespace=~"traefik|kube-system",deployment="traefik"}), 1)'
+    ")"
+)
+CONTROL_READY_FRACTION_EXPR = (
+    f"(sum(kube_node_status_condition{{condition=\"Ready\",status=\"true\",node=~\"{CONTROL_REGEX}\"}})"
+    f" / {CONTROL_TOTAL})"
+)
+UPTIME_AVAIL_EXPR = (
+    f"min(({CONTROL_READY_FRACTION_EXPR}), ({TRAEFIK_READY_EXPR}))"
+)
+
+# Tie-breaker to deterministically pick one node per namespace when shares tie.
+NODE_TIEBREAKER = " + ".join(
+    f"({node_filter(node)}) * 1e-6 * {idx}"
+    for idx, node in enumerate(CONTROL_ALL + WORKER_NODES, start=1)
+)
+UPTIME_AVG_EXPR = f"avg_over_time(({UPTIME_AVAIL_EXPR})[{UPTIME_WINDOW}:5m])"
+UPTIME_PERCENT_EXPR = UPTIME_AVG_EXPR
+UPTIME_NINES_EXPR = f"-log10(1 - clamp_max({UPTIME_AVG_EXPR}, 0.999999999))"
+UPTIME_THRESHOLDS = {
+    "mode": "absolute",
+    "steps": [
+        {"color": "red", "value": None},
+        {"color": "orange", "value": 2},
+        {"color": "yellow", "value": 3},
+        {"color": "green", "value": 3.5},
+    ],
+}
+UPTIME_PERCENT_THRESHOLDS = {
+    "mode": "absolute",
+    "steps": [
+        {"color": "red", "value": None},
+        {"color": "orange", "value": 0.999},
+        {"color": "yellow", "value": 0.9999},
+        {"color": "green", "value": 0.99999},
+    ],
+}
 PROBLEM_TABLE_EXPR = (
     "(time() - kube_pod_created{pod!=\"\"}) "
     "* on(namespace,pod) group_left(node) kube_pod_info "
@@ -291,6 +339,34 @@ NET_INTERNAL_EXPR = (
     '+ rate(container_network_transmit_bytes_total{namespace!="traefik",pod!=""}[5m]))'
     ' or on() vector(0)'
 )
+APISERVER_5XX_RATE = 'sum(rate(apiserver_request_total{code=~"5.."}[5m]))'
+APISERVER_P99_LATENCY_MS = (
+    "histogram_quantile(0.99, sum by (le) (rate(apiserver_request_duration_seconds_bucket[5m]))) * 1000"
+)
+ETCD_P99_LATENCY_MS = (
+    "histogram_quantile(0.99, sum by (le) (rate(etcd_request_duration_seconds_bucket[5m]))) * 1000"
+)
+TRAEFIK_TOTAL_5M = "sum(rate(traefik_entrypoint_requests_total[5m]))"
+TRAEFIK_SUCCESS_5M = 'sum(rate(traefik_entrypoint_requests_total{code!~"5.."}[5m]))'
+TRAEFIK_SLI_5M = f"({TRAEFIK_SUCCESS_5M}) / clamp_min({TRAEFIK_TOTAL_5M}, 1)"
+TRAEFIK_P99_LATENCY_MS = (
+    "histogram_quantile(0.99, sum by (le) (rate(traefik_entrypoint_request_duration_seconds_bucket[5m]))) * 1000"
+)
+TRAEFIK_P95_LATENCY_MS = (
+    "histogram_quantile(0.95, sum by (le) (rate(traefik_entrypoint_request_duration_seconds_bucket[5m]))) * 1000"
+)
+SLO_AVAILABILITY = 0.999
+
+
+def traefik_sli(window):
+    total = f'sum(rate(traefik_entrypoint_requests_total[{window}]))'
+    success = f'sum(rate(traefik_entrypoint_requests_total{{code!~"5.."}}[{window}]))'
+    return f"({success}) / clamp_min({total}, 1)"
+
+
+def traefik_burn(window):
+    sli = traefik_sli(window)
+    return f"(1 - ({sli})) / {1 - SLO_AVAILABILITY}"
 
 # ---------------------------------------------------------------------------
 # Panel factories
@@ -304,6 +380,7 @@ def stat_panel(
     grid,
     *,
     unit="none",
+    decimals=None,
     thresholds=None,
     text_mode="value",
     legend=None,
@@ -313,7 +390,7 @@ def stat_panel(
 ):
     """Return a Grafana stat panel definition."""
     defaults = {
-        "color": {"mode": "palette-classic"},
+        "color": {"mode": "thresholds"},
         "mappings": [],
         "thresholds": thresholds
         or {
@@ -328,6 +405,8 @@ def stat_panel(
     }
     if value_suffix:
         defaults["custom"]["valueSuffix"] = value_suffix
+    if decimals is not None:
+        defaults["decimals"] = decimals
     panel = {
         "id": panel_id,
         "type": "stat",
@@ -446,17 +525,32 @@ def table_panel(
     *,
     unit="none",
     transformations=None,
+    instant=False,
+    options=None,
+    filterable=True,
+    footer=None,
+    format=None,
 ):
     """Return a Grafana table panel definition."""
+    # Optional PromQL subquery helpers in expr: share(), etc.
+    panel_options = {"showHeader": True, "columnFilters": False}
+    if options:
+        panel_options.update(options)
+    if footer is not None:
+        panel_options["footer"] = footer
+    field_defaults = {"unit": unit, "custom": {"filterable": filterable}}
+    target = {"expr": expr, "refId": "A", **({"instant": True} if instant else {})}
+    if format:
+        target["format"] = format
     panel = {
         "id": panel_id,
         "type": "table",
         "title": title,
         "datasource": PROM_DS,
         "gridPos": grid,
-        "targets": [{"expr": expr, "refId": "A"}],
-        "fieldConfig": {"defaults": {"unit": unit}, "overrides": []},
-        "options": {"showHeader": True},
+        "targets": [target],
+        "fieldConfig": {"defaults": field_defaults, "overrides": []},
+        "options": panel_options,
     }
     if transformations:
         panel["transformations"] = transformations
@@ -482,7 +576,7 @@ def pie_panel(panel_id, title, expr, grid):
         "options": {
             "legend": {"displayMode": "list", "placement": "right"},
             "pieType": "pie",
-            "displayLabels": ["percent"],
+            "displayLabels": [],
             "tooltip": {"mode": "single"},
             "colorScheme": "interpolateSpectral",
             "colorBy": "value",
@@ -491,7 +585,19 @@ def pie_panel(panel_id, title, expr, grid):
     }
 
 
-def bargauge_panel(panel_id, title, expr, grid, *, unit="none", links=None):
+def bargauge_panel(
+    panel_id,
+    title,
+    expr,
+    grid,
+    *,
+    unit="none",
+    links=None,
+    limit=None,
+    thresholds=None,
+    decimals=None,
+    instant=False,
+):
     """Return a bar gauge panel with label-aware reduction."""
     panel = {
         "id": panel_id,
@@ -499,13 +605,16 @@ def bargauge_panel(panel_id, title, expr, grid, *, unit="none", links=None):
         "title": title,
         "datasource": PROM_DS,
         "gridPos": grid,
-        "targets": [{"expr": expr, "refId": "A", "legendFormat": "{{node}}"}],
+        "targets": [
+            {"expr": expr, "refId": "A", "legendFormat": "{{node}}", **({"instant": True} if instant else {})}
+        ],
         "fieldConfig": {
             "defaults": {
                 "unit": unit,
                 "min": 0,
                 "max": 100 if unit == "percent" else None,
-                "thresholds": {
+                "thresholds": thresholds
+                or {
                     "mode": "absolute",
                     "steps": [
                         {"color": "green", "value": None},
@@ -527,8 +636,19 @@ def bargauge_panel(panel_id, title, expr, grid, *, unit="none", links=None):
             },
         },
     }
+    if decimals is not None:
+        panel["fieldConfig"]["defaults"]["decimals"] = decimals
     if links:
         panel["links"] = links
+    # Keep bars ordered by value descending for readability.
+    panel["transformations"] = [
+        {
+            "id": "sortBy",
+            "options": {"fields": ["Value"], "order": "desc"},
+        }
+    ]
+    if limit:
+        panel["transformations"].append({"id": "limit", "options": {"limit": limit}})
     return panel
 
 
@@ -555,81 +675,37 @@ def link_to(uid):
 def build_overview():
     panels = []
 
+    count_thresholds = {
+        "mode": "absolute",
+        "steps": [
+            {"color": "green", "value": None},
+            {"color": "yellow", "value": 1},
+            {"color": "orange", "value": 2},
+            {"color": "red", "value": 3},
+        ],
+    }
+
     row1_stats = [
-        (
-            1,
-            "Workers Ready",
-            f'sum(kube_node_status_condition{{condition="Ready",status="true",node=~"{WORKER_REGEX}"}})',
-            WORKER_SUFFIX,
-            WORKER_TOTAL,
-            None,
-        ),
-        (
-            2,
-            "Control Plane Ready",
-            f'sum(kube_node_status_condition{{condition="Ready",status="true",node=~"{CONTROL_REGEX}"}})',
-            CONTROL_SUFFIX,
-            CONTROL_TOTAL,
-            None,
-        ),
-        (
-            3,
-            "Control Plane Workloads",
-            CONTROL_WORKLOADS_EXPR,
-            None,
-            4,
-            link_to("atlas-pods"),
-        ),
-        (
-            4,
-            "Problem Pods",
-            PROBLEM_PODS_EXPR,
-            None,
-            1,
-            link_to("atlas-pods"),
-        ),
-        (
-            5,
-            "Stuck Terminating",
-            STUCK_TERMINATING_EXPR,
-            None,
-            1,
-            link_to("atlas-pods"),
-        ),
-    ]
-
-    def gauge_grid(idx):
-        width = GAUGE_WIDTHS[idx] if idx < len(GAUGE_WIDTHS) else 4
-        x = sum(GAUGE_WIDTHS[:idx])
-        return width, x
-
-    for idx, (panel_id, title, expr, suffix, ok_value, links) in enumerate(row1_stats):
-        thresholds = None
-        min_value = 0
-        max_value = ok_value or 5
-        if panel_id == 1:
-            max_value = WORKER_TOTAL
-            thresholds = {
-                "mode": "absolute",
-                "steps": [
-                    {"color": "red", "value": None},
-                    {"color": "orange", "value": WORKER_TOTAL - 2},
-                    {"color": "yellow", "value": WORKER_TOTAL - 1},
-                    {"color": "green", "value": WORKER_TOTAL},
-                ],
-            }
-        elif panel_id == 2:
-            max_value = CONTROL_TOTAL
-            thresholds = {
+        {
+            "id": 2,
+            "title": "Control Plane Ready",
+            "expr": f'sum(kube_node_status_condition{{condition="Ready",status="true",node=~"{CONTROL_REGEX}"}})',
+            "kind": "gauge",
+            "max_value": CONTROL_TOTAL,
+            "thresholds": {
                 "mode": "absolute",
                 "steps": [
                     {"color": "red", "value": None},
                     {"color": "green", "value": CONTROL_TOTAL},
                 ],
-            }
-        elif panel_id in (3, 4, 5):
-            max_value = 4
-            thresholds = {
+            },
+        },
+        {
+            "id": 3,
+            "title": "Control Plane Workloads",
+            "expr": CONTROL_WORKLOADS_EXPR,
+            "kind": "stat",
+            "thresholds": {
                 "mode": "absolute",
                 "steps": [
                     {"color": "green", "value": None},
@@ -637,40 +713,122 @@ def build_overview():
                     {"color": "orange", "value": 2},
                     {"color": "red", "value": 3},
                 ],
-            }
-        else:
-            thresholds = {
+            },
+            "links": link_to("atlas-pods"),
+        },
+        {
+            "id": 5,
+            "title": "Stuck Terminating",
+            "expr": STUCK_TERMINATING_EXPR,
+            "kind": "stat",
+            "thresholds": {
                 "mode": "absolute",
                 "steps": [
                     {"color": "green", "value": None},
-                    {"color": "red", "value": max_value},
+                    {"color": "yellow", "value": 1},
+                    {"color": "orange", "value": 2},
+                    {"color": "red", "value": 3},
                 ],
-            }
+            },
+            "links": link_to("atlas-pods"),
+        },
+        {
+            "id": 27,
+            "title": "Atlas Availability (30d)",
+            "expr": UPTIME_PERCENT_EXPR,
+            "kind": "stat",
+            "thresholds": UPTIME_PERCENT_THRESHOLDS,
+            "unit": "percentunit",
+            "decimals": 3,
+            "text_mode": "value",
+        },
+        {
+            "id": 4,
+            "title": "Problem Pods",
+            "expr": PROBLEM_PODS_EXPR,
+            "kind": "stat",
+            "thresholds": {
+                "mode": "absolute",
+                "steps": [
+                    {"color": "green", "value": None},
+                    {"color": "yellow", "value": 1},
+                    {"color": "orange", "value": 2},
+                    {"color": "red", "value": 3},
+                ],
+            },
+            "links": link_to("atlas-pods"),
+        },
+        {
+            "id": 6,
+            "title": "CrashLoop / ImagePull",
+            "expr": CRASHLOOP_EXPR,
+            "kind": "stat",
+            "thresholds": {
+                "mode": "absolute",
+                "steps": [
+                    {"color": "green", "value": None},
+                    {"color": "yellow", "value": 1},
+                    {"color": "orange", "value": 2},
+                    {"color": "red", "value": 3},
+                ],
+            },
+            "links": link_to("atlas-pods"),
+        },
+        {
+            "id": 1,
+            "title": "Workers Ready",
+            "expr": f'sum(kube_node_status_condition{{condition="Ready",status="true",node=~"{WORKER_REGEX}"}})',
+            "kind": "gauge",
+            "max_value": WORKER_TOTAL,
+            "thresholds": {
+                "mode": "absolute",
+                "steps": [
+                    {"color": "red", "value": None},
+                    {"color": "orange", "value": WORKER_TOTAL - 2},
+                    {"color": "yellow", "value": WORKER_TOTAL - 1},
+                    {"color": "green", "value": WORKER_TOTAL},
+                ],
+            },
+        },
+    ]
+
+    def gauge_grid(idx):
+        width = GAUGE_WIDTHS[idx] if idx < len(GAUGE_WIDTHS) else 4
+        x = sum(GAUGE_WIDTHS[:idx])
+        return width, x
+
+    for idx, item in enumerate(row1_stats):
+        panel_id = item["id"]
         width, x = gauge_grid(idx)
-        if panel_id in (3, 4, 5):
+        grid = {"h": 5, "w": width, "x": x, "y": 0}
+        kind = item.get("kind", "gauge")
+        if kind == "stat":
             panels.append(
                 stat_panel(
                     panel_id,
-                    title,
-                    expr,
-                    {"h": 5, "w": width, "x": x, "y": 0},
-                    thresholds=thresholds,
-                    legend=None,
-                    links=links,
-                    text_mode="value",
-                )
-            )
+                    item["title"],
+                    item["expr"],
+                    grid,
+                    thresholds=item.get("thresholds"),
+            legend=None,
+            links=item.get("links"),
+            text_mode=item.get("text_mode", "value"),
+            value_suffix=item.get("value_suffix"),
+            unit=item.get("unit", "none"),
+            decimals=item.get("decimals"),
+        )
+    )
         else:
             panels.append(
                 gauge_panel(
                     panel_id,
-                    title,
-                    expr,
-                    {"h": 5, "w": width, "x": x, "y": 0},
-                    min_value=min_value,
-                    max_value=max_value,
-                    thresholds=thresholds,
-                    links=links,
+                    item["title"],
+                    item["expr"],
+                    grid,
+                    min_value=0,
+                    max_value=item.get("max_value", 5),
+                    thresholds=item.get("thresholds"),
+                    links=item.get("links"),
                 )
             )
 
@@ -774,7 +932,7 @@ def build_overview():
         timeseries_panel(
             16,
             "Control plane CPU",
-            node_cpu_expr(CONTROL_REGEX),
+            node_cpu_expr(CONTROL_ALL_REGEX),
             {"h": 10, "w": 12, "x": 0, "y": 44},
             unit="percent",
             legend="{{node}}",
@@ -786,7 +944,7 @@ def build_overview():
         timeseries_panel(
             17,
             "Control plane RAM",
-            node_mem_expr(CONTROL_REGEX),
+            node_mem_expr(CONTROL_ALL_REGEX),
             {"h": 10, "w": 12, "x": 12, "y": 44},
             unit="percent",
             legend="{{node}}",
@@ -795,6 +953,36 @@ def build_overview():
         )
     )
 
+    panels.append(
+        pie_panel(
+            28,
+            "Node Pod Share",
+            '(sum(kube_pod_info{pod!="" , node!=""}) by (node) / clamp_min(sum(kube_pod_info{pod!="" , node!=""}), 1)) * 100',
+            {"h": 10, "w": 12, "x": 0, "y": 54},
+        )
+    )
+    panels.append(
+        bargauge_panel(
+            29,
+            "Top Nodes by Pod Count",
+            'topk(12, sum(kube_pod_info{pod!="" , node!=""}) by (node))',
+            {"h": 10, "w": 12, "x": 12, "y": 54},
+            unit="none",
+            limit=12,
+            decimals=0,
+            thresholds={
+                "mode": "absolute",
+                "steps": [
+                    {"color": "green", "value": None},
+                    {"color": "yellow", "value": 50},
+                    {"color": "orange", "value": 75},
+                    {"color": "red", "value": 100},
+                ],
+            },
+            instant=True,
+        )
+    )
+
     panels.append(
         timeseries_panel(
             18,
@@ -840,7 +1028,7 @@ def build_overview():
             21,
             "Root Filesystem Usage",
             root_usage_expr(),
-            {"h": 16, "w": 12, "x": 0, "y": 54},
+            {"h": 16, "w": 12, "x": 0, "y": 64},
             unit="percent",
             legend="{{node}}",
             legend_calcs=["last"],
@@ -855,8 +1043,9 @@ def build_overview():
             22,
             "Nodes Closest to Full Root Disks",
             f"topk(12, {root_usage_expr()})",
-            {"h": 16, "w": 12, "x": 12, "y": 54},
+            {"h": 16, "w": 12, "x": 12, "y": 64},
             unit="percent",
+            thresholds=PERCENT_THRESHOLDS,
             links=link_to("atlas-storage"),
         )
     )
@@ -874,13 +1063,7 @@ def build_overview():
         "templating": {"list": []},
         "time": {"from": "now-1h", "to": "now"},
         "refresh": "1m",
-        "links": [
-            {"title": "Atlas Pods", "type": "dashboard", "dashboardUid": "atlas-pods", "keepTime": False},
-            {"title": "Atlas Nodes", "type": "dashboard", "dashboardUid": "atlas-nodes", "keepTime": False},
-            {"title": "Atlas Storage", "type": "dashboard", "dashboardUid": "atlas-storage", "keepTime": False},
-            {"title": "Atlas Network", "type": "dashboard", "dashboardUid": "atlas-network", "keepTime": False},
-            {"title": "Atlas GPU", "type": "dashboard", "dashboardUid": "atlas-gpu", "keepTime": False},
-        ],
+        "links": [],
     }
 
 
@@ -980,6 +1163,91 @@ def build_pods_dashboard():
             ],
         )
     )
+    panels.append(
+        pie_panel(
+            8,
+            "Node Pod Share",
+            '(sum(kube_pod_info{pod!="" , node!=""}) by (node) / clamp_min(sum(kube_pod_info{pod!="" , node!=""}), 1)) * 100',
+            {"h": 8, "w": 12, "x": 12, "y": 34},
+        )
+    )
+    panels.append(
+        bargauge_panel(
+            9,
+            "Top Nodes by Pod Count",
+            'topk(12, sum(kube_pod_info{pod!="" , node!=""}) by (node))',
+            {"h": 8, "w": 12, "x": 0, "y": 34},
+            unit="none",
+            limit=12,
+            decimals=0,
+            thresholds={
+                "mode": "absolute",
+                "steps": [
+                    {"color": "green", "value": None},
+                    {"color": "yellow", "value": 50},
+                    {"color": "orange", "value": 75},
+                    {"color": "red", "value": 100},
+                ],
+            },
+            instant=True,
+        )
+    )
+
+    share_expr = (
+        '(sum by (namespace,node) (kube_pod_info{pod!="" , node!=""}) '
+        '/ on(namespace) group_left() clamp_min(sum by (namespace) (kube_pod_info{pod!=""}), 1) * 100)'
+    )
+    rank_terms = [
+        f"(sum by (node) (kube_node_info{{node=\"{node}\"}}) * 0 + {idx * 1e-3})"
+        for idx, node in enumerate(CONTROL_ALL + WORKER_NODES, start=1)
+    ]
+    rank_expr = " or ".join(rank_terms)
+    score_expr = f"{share_expr} + on(node) group_left() ({rank_expr})"
+    mask_expr = (
+        f"{score_expr} == bool on(namespace) group_left() "
+        f"(max by (namespace) ({score_expr}))"
+    )
+    panels.append(
+        table_panel(
+            10,
+            "Namespace Plurality by Node v27",
+            (
+                f"{share_expr} * on(namespace,node) group_left() "
+                f"({mask_expr})"
+            ),
+            {"h": 8, "w": 24, "x": 0, "y": 42},
+            unit="percent",
+            transformations=[
+                {"id": "labelsToFields", "options": {}},
+                {"id": "organize", "options": {"excludeByName": {"Time": True}}},
+                {"id": "filterByValue", "options": {"match": "Value", "operator": "gt", "value": 0}},
+                {
+                    "id": "sortBy",
+                    "options": {"fields": ["Value"], "order": "desc"},
+                },
+                {
+                    "id": "groupBy",
+                    "options": {
+                        "fields": {
+                            "namespace": {
+                                "aggregations": [
+                                    {"field": "Value", "operation": "max"},
+                                    {"field": "node", "operation": "first"},
+                                ]
+                            }
+                        },
+                        "rowBy": ["namespace"],
+                    },
+                },
+            ],
+            instant=True,
+            options={"showColumnFilters": False},
+            filterable=False,
+            footer={"show": False, "fields": "", "calcs": []},
+            format="table",
+        )
+    )
+
     return {
         "uid": "atlas-pods",
         "title": "Atlas Pods",
@@ -1022,12 +1290,69 @@ def build_nodes_dashboard():
             {"h": 4, "w": 8, "x": 16, "y": 0},
         )
     )
+    panels.append(
+        stat_panel(
+            9,
+            "API Server 5xx rate",
+            APISERVER_5XX_RATE,
+            {"h": 4, "w": 8, "x": 0, "y": 4},
+            unit="req/s",
+            thresholds={
+                "mode": "absolute",
+                "steps": [
+                    {"color": "green", "value": None},
+                    {"color": "yellow", "value": 0.05},
+                    {"color": "orange", "value": 0.2},
+                    {"color": "red", "value": 0.5},
+                ],
+            },
+            decimals=3,
+        )
+    )
+    panels.append(
+        stat_panel(
+            10,
+            "API Server P99 latency",
+            APISERVER_P99_LATENCY_MS,
+            {"h": 4, "w": 8, "x": 8, "y": 4},
+            unit="ms",
+            thresholds={
+                "mode": "absolute",
+                "steps": [
+                    {"color": "green", "value": None},
+                    {"color": "yellow", "value": 250},
+                    {"color": "orange", "value": 400},
+                    {"color": "red", "value": 600},
+                ],
+            },
+            decimals=1,
+        )
+    )
+    panels.append(
+        stat_panel(
+            11,
+            "etcd P99 latency",
+            ETCD_P99_LATENCY_MS,
+            {"h": 4, "w": 8, "x": 16, "y": 4},
+            unit="ms",
+            thresholds={
+                "mode": "absolute",
+                "steps": [
+                    {"color": "green", "value": None},
+                    {"color": "yellow", "value": 50},
+                    {"color": "orange", "value": 100},
+                    {"color": "red", "value": 200},
+                ],
+            },
+            decimals=1,
+        )
+    )
     panels.append(
         timeseries_panel(
             4,
             "Node CPU",
             node_cpu_expr(),
-            {"h": 9, "w": 24, "x": 0, "y": 4},
+            {"h": 9, "w": 24, "x": 0, "y": 8},
             unit="percent",
             legend="{{node}}",
             legend_calcs=["last"],
@@ -1040,7 +1365,7 @@ def build_nodes_dashboard():
             5,
             "Node RAM",
             node_mem_expr(),
-            {"h": 9, "w": 24, "x": 0, "y": 13},
+            {"h": 9, "w": 24, "x": 0, "y": 17},
             unit="percent",
             legend="{{node}}",
             legend_calcs=["last"],
@@ -1053,7 +1378,7 @@ def build_nodes_dashboard():
             6,
             "Control Plane (incl. titan-db) CPU",
             node_cpu_expr(CONTROL_ALL_REGEX),
-            {"h": 9, "w": 12, "x": 0, "y": 22},
+            {"h": 9, "w": 12, "x": 0, "y": 26},
             unit="percent",
             legend="{{node}}",
             legend_display="table",
@@ -1065,7 +1390,7 @@ def build_nodes_dashboard():
             7,
             "Control Plane (incl. titan-db) RAM",
             node_mem_expr(CONTROL_ALL_REGEX),
-            {"h": 9, "w": 12, "x": 12, "y": 22},
+            {"h": 9, "w": 12, "x": 12, "y": 26},
             unit="percent",
             legend="{{node}}",
             legend_display="table",
@@ -1077,7 +1402,7 @@ def build_nodes_dashboard():
             8,
             "Root Filesystem Usage",
             root_usage_expr(),
-            {"h": 9, "w": 24, "x": 0, "y": 31},
+            {"h": 9, "w": 24, "x": 0, "y": 35},
             unit="percent",
             legend="{{node}}",
             legend_display="table",
@@ -1204,43 +1529,107 @@ def build_network_dashboard():
     panels.append(
         stat_panel(
             1,
-            "Ingress Traffic",
-            NET_INGRESS_EXPR,
-            {"h": 4, "w": 8, "x": 0, "y": 0},
-            unit="Bps",
+            "Ingress Success Rate (5m)",
+            TRAEFIK_SLI_5M,
+            {"h": 4, "w": 6, "x": 0, "y": 0},
+            unit="percentunit",
+            decimals=2,
+            thresholds={
+                "mode": "absolute",
+                "steps": [
+                    {"color": "red", "value": None},
+                    {"color": "orange", "value": 0.995},
+                    {"color": "yellow", "value": 0.999},
+                    {"color": "green", "value": 0.9995},
+                ],
+            },
         )
     )
     panels.append(
         stat_panel(
             2,
-            "Egress Traffic",
-            NET_EGRESS_EXPR,
-            {"h": 4, "w": 8, "x": 8, "y": 0},
-            unit="Bps",
+            "Error Budget Burn (1h)",
+            traefik_burn("1h"),
+            {"h": 4, "w": 6, "x": 6, "y": 0},
+            thresholds={
+                "mode": "absolute",
+                "steps": [
+                    {"color": "green", "value": None},
+                    {"color": "yellow", "value": 1},
+                    {"color": "orange", "value": 2},
+                    {"color": "red", "value": 4},
+                ],
+            },
+            decimals=2,
         )
     )
     panels.append(
         stat_panel(
             3,
-            "Intra-Cluster Traffic",
-            NET_INTERNAL_EXPR,
-            {"h": 4, "w": 8, "x": 16, "y": 0},
-            unit="Bps",
+            "Error Budget Burn (6h)",
+            traefik_burn("6h"),
+            {"h": 4, "w": 6, "x": 12, "y": 0},
+            thresholds={
+                "mode": "absolute",
+                "steps": [
+                    {"color": "green", "value": None},
+                    {"color": "yellow", "value": 1},
+                    {"color": "orange", "value": 2},
+                    {"color": "red", "value": 4},
+                ],
+            },
+            decimals=2,
         )
     )
     panels.append(
         stat_panel(
             4,
-            "Top Router req/s",
-            f"topk(1, {TRAEFIK_ROUTER_EXPR})",
+            "Edge P99 Latency (ms)",
+            TRAEFIK_P99_LATENCY_MS,
+            {"h": 4, "w": 6, "x": 18, "y": 0},
+            unit="ms",
+            thresholds={
+                "mode": "absolute",
+                "steps": [
+                    {"color": "green", "value": None},
+                    {"color": "yellow", "value": 200},
+                    {"color": "orange", "value": 350},
+                    {"color": "red", "value": 500},
+                ],
+            },
+            decimals=1,
+        )
+    )
+    panels.append(
+        stat_panel(
+            5,
+            "Ingress Traffic",
+            NET_INGRESS_EXPR,
             {"h": 4, "w": 8, "x": 0, "y": 4},
-            unit="req/s",
-            legend="{{router}}",
+            unit="Bps",
+        )
+    )
+    panels.append(
+        stat_panel(
+            6,
+            "Egress Traffic",
+            NET_EGRESS_EXPR,
+            {"h": 4, "w": 8, "x": 8, "y": 4},
+            unit="Bps",
+        )
+    )
+    panels.append(
+        stat_panel(
+            7,
+            "Intra-Cluster Traffic",
+            NET_INTERNAL_EXPR,
+            {"h": 4, "w": 8, "x": 16, "y": 4},
+            unit="Bps",
         )
     )
     panels.append(
         timeseries_panel(
-            5,
+            8,
             "Per-Node Throughput",
             f'avg by (node) (({NET_NODE_TX_PHYS} + {NET_NODE_RX_PHYS}) * on(instance) group_left(node) {NODE_INFO})',
             {"h": 8, "w": 24, "x": 0, "y": 8},
@@ -1252,7 +1641,7 @@ def build_network_dashboard():
     )
     panels.append(
         table_panel(
-            6,
+            9,
             "Top Namespaces",
             'topk(10, sum(rate(container_network_transmit_bytes_total{namespace!=""}[5m]) '
             '+ rate(container_network_receive_bytes_total{namespace!=""}[5m])) by (namespace))',
@@ -1263,7 +1652,7 @@ def build_network_dashboard():
     )
     panels.append(
         table_panel(
-            7,
+            10,
             "Top Pods",
             'topk(10, sum(rate(container_network_transmit_bytes_total{pod!=""}[5m]) '
             '+ rate(container_network_receive_bytes_total{pod!=""}[5m])) by (namespace,pod))',
@@ -1274,7 +1663,7 @@ def build_network_dashboard():
     )
     panels.append(
         timeseries_panel(
-            8,
+            11,
             "Traefik Routers (req/s)",
             f"topk(10, {TRAEFIK_ROUTER_EXPR})",
             {"h": 9, "w": 12, "x": 0, "y": 25},
@@ -1286,7 +1675,7 @@ def build_network_dashboard():
     )
     panels.append(
         timeseries_panel(
-            9,
+            12,
             "Traefik Entrypoints (req/s)",
             'sum by (entrypoint) (rate(traefik_entrypoint_requests_total[5m]))',
             {"h": 9, "w": 12, "x": 12, "y": 25},
diff --git a/scripts/mailu_sync.py b/scripts/mailu_sync.py
new file mode 100644
index 0000000..ee8aa18
--- /dev/null
+++ b/scripts/mailu_sync.py
@@ -0,0 +1,204 @@
+#!/usr/bin/env python3
+"""
+Sync Keycloak users to Mailu mailboxes.
+ - Generates/stores a mailu_app_password attribute in Keycloak (admin-only)
+ - Upserts the mailbox in Mailu Postgres using that password
+"""
+
+import os
+import sys
+import json
+import time
+import secrets
+import string
+import datetime
+import requests
+import psycopg2
+from psycopg2.extras import RealDictCursor
+from passlib.hash import bcrypt_sha256
+
+
+KC_BASE = os.environ["KEYCLOAK_BASE_URL"].rstrip("/")
+KC_REALM = os.environ["KEYCLOAK_REALM"]
+KC_CLIENT_ID = os.environ["KEYCLOAK_CLIENT_ID"]
+KC_CLIENT_SECRET = os.environ["KEYCLOAK_CLIENT_SECRET"]
+
+MAILU_DOMAIN = os.environ["MAILU_DOMAIN"]
+MAILU_DEFAULT_QUOTA = int(os.environ.get("MAILU_DEFAULT_QUOTA", "20000000000"))
+
+DB_CONFIG = {
+    "host": os.environ["MAILU_DB_HOST"],
+    "port": int(os.environ.get("MAILU_DB_PORT", "5432")),
+    "dbname": os.environ["MAILU_DB_NAME"],
+    "user": os.environ["MAILU_DB_USER"],
+    "password": os.environ["MAILU_DB_PASSWORD"],
+}
+
+SESSION = requests.Session()
+
+
+def log(msg):
+    sys.stdout.write(f"{msg}\n")
+    sys.stdout.flush()
+
+
+def get_kc_token():
+    resp = SESSION.post(
+        f"{KC_BASE}/realms/{KC_REALM}/protocol/openid-connect/token",
+        data={
+            "grant_type": "client_credentials",
+            "client_id": KC_CLIENT_ID,
+            "client_secret": KC_CLIENT_SECRET,
+        },
+        timeout=15,
+    )
+    resp.raise_for_status()
+    return resp.json()["access_token"]
+
+
+def kc_get_users(token):
+    users = []
+    first = 0
+    max_results = 200
+    headers = {"Authorization": f"Bearer {token}"}
+    while True:
+        resp = SESSION.get(
+            f"{KC_BASE}/admin/realms/{KC_REALM}/users",
+            params={"first": first, "max": max_results, "enabled": "true"},
+            headers=headers,
+            timeout=20,
+        )
+        resp.raise_for_status()
+        batch = resp.json()
+        users.extend(batch)
+        if len(batch) < max_results:
+            break
+        first += max_results
+    return users
+
+
+def kc_update_attributes(token, user, attributes):
+    headers = {
+        "Authorization": f"Bearer {token}",
+        "Content-Type": "application/json",
+    }
+    payload = {
+        "firstName": user.get("firstName"),
+        "lastName": user.get("lastName"),
+        "email": user.get("email"),
+        "enabled": user.get("enabled", True),
+        "username": user["username"],
+        "emailVerified": user.get("emailVerified", False),
+        "attributes": attributes,
+    }
+    user_url = f"{KC_BASE}/admin/realms/{KC_REALM}/users/{user['id']}"
+    resp = SESSION.put(user_url, headers=headers, json=payload, timeout=20)
+    resp.raise_for_status()
+    verify = SESSION.get(
+        user_url,
+        headers={"Authorization": f"Bearer {token}"},
+        params={"briefRepresentation": "false"},
+        timeout=15,
+    )
+    verify.raise_for_status()
+    attrs = verify.json().get("attributes") or {}
+    if not attrs.get("mailu_app_password"):
+        raise Exception(f"attribute not persisted for {user.get('email') or user['username']}")
+
+
+def random_password():
+    alphabet = string.ascii_letters + string.digits
+    return "".join(secrets.choice(alphabet) for _ in range(24))
+
+
+def ensure_mailu_user(cursor, email, password, display_name):
+    localpart, domain = email.split("@", 1)
+    if domain.lower() != MAILU_DOMAIN.lower():
+        return
+    hashed = bcrypt_sha256.hash(password)
+    now = datetime.datetime.utcnow()
+    cursor.execute(
+        """
+        INSERT INTO "user" (
+          email, localpart, domain_name, password,
+          quota_bytes, quota_bytes_used,
+          global_admin, enabled, enable_imap, enable_pop, allow_spoofing,
+          forward_enabled, forward_destination, forward_keep,
+          reply_enabled, reply_subject, reply_body, reply_startdate, reply_enddate,
+          displayed_name, spam_enabled, spam_mark_as_read, spam_threshold,
+          change_pw_next_login, created_at, updated_at, comment
+        )
+        VALUES (
+          %(email)s, %(localpart)s, %(domain)s, %(password)s,
+          %(quota)s, 0,
+          false, true, true, true, false,
+          false, '', true,
+          false, NULL, NULL, DATE '1900-01-01', DATE '2999-12-31',
+          %(display)s, true, true, 80,
+          false, CURRENT_DATE, %(now)s, ''
+        )
+        ON CONFLICT (email) DO UPDATE
+        SET password = EXCLUDED.password,
+            enabled = true,
+            updated_at = EXCLUDED.updated_at
+        """,
+        {
+            "email": email,
+            "localpart": localpart,
+            "domain": domain,
+            "password": hashed,
+            "quota": MAILU_DEFAULT_QUOTA,
+            "display": display_name or localpart,
+            "now": now,
+        },
+    )
+
+
+def main():
+    token = get_kc_token()
+    users = kc_get_users(token)
+    if not users:
+        log("No users found; exiting.")
+        return
+
+    conn = psycopg2.connect(**DB_CONFIG)
+    conn.autocommit = True
+    cursor = conn.cursor(cursor_factory=RealDictCursor)
+
+    for user in users:
+        attrs = user.get("attributes", {}) or {}
+        app_pw_value = attrs.get("mailu_app_password")
+        if isinstance(app_pw_value, list):
+            app_pw = app_pw_value[0] if app_pw_value else None
+        elif isinstance(app_pw_value, str):
+            app_pw = app_pw_value
+        else:
+            app_pw = None
+
+        email = user.get("email")
+        if not email:
+            email = f"{user['username']}@{MAILU_DOMAIN}"
+
+        if not app_pw:
+            app_pw = random_password()
+            attrs["mailu_app_password"] = app_pw
+            kc_update_attributes(token, user, attrs)
+            log(f"Set mailu_app_password for {email}")
+
+        display_name = " ".join(
+            part for part in [user.get("firstName"), user.get("lastName")] if part
+        ).strip()
+
+        ensure_mailu_user(cursor, email, app_pw, display_name)
+        log(f"Synced mailbox for {email}")
+
+    cursor.close()
+    conn.close()
+
+
+if __name__ == "__main__":
+    try:
+        main()
+    except Exception as exc:
+        log(f"ERROR: {exc}")
+        sys.exit(1)
diff --git a/scripts/nextcloud-mail-sync.sh b/scripts/nextcloud-mail-sync.sh
new file mode 100755
index 0000000..7feeec6
--- /dev/null
+++ b/scripts/nextcloud-mail-sync.sh
@@ -0,0 +1,49 @@
+#!/bin/bash
+set -euo pipefail
+
+KC_BASE="${KC_BASE:?}"
+KC_REALM="${KC_REALM:?}"
+KC_ADMIN_USER="${KC_ADMIN_USER:?}"
+KC_ADMIN_PASS="${KC_ADMIN_PASS:?}"
+
+if ! command -v jq >/dev/null 2>&1; then
+  apt-get update && apt-get install -y jq curl >/dev/null
+fi
+
+account_exists() {
+  # Skip if the account email is already present in the mail app.
+  runuser -u www-data -- php occ mail:account:list 2>/dev/null | grep -Fq " ${1}" || \
+    runuser -u www-data -- php occ mail:account:list 2>/dev/null | grep -Fq "${1} "
+}
+
+token=$(
+  curl -s -d "grant_type=password" \
+    -d "client_id=admin-cli" \
+    -d "username=${KC_ADMIN_USER}" \
+    -d "password=${KC_ADMIN_PASS}" \
+    "${KC_BASE}/realms/master/protocol/openid-connect/token" | jq -r '.access_token'
+)
+
+if [[ -z "${token}" || "${token}" == "null" ]]; then
+  echo "Failed to obtain admin token"
+  exit 1
+fi
+
+users=$(curl -s -H "Authorization: Bearer ${token}" \
+  "${KC_BASE}/admin/realms/${KC_REALM}/users?max=2000")
+
+echo "${users}" | jq -c '.[]' | while read -r user; do
+  username=$(echo "${user}" | jq -r '.username')
+  email=$(echo "${user}" | jq -r '.email // empty')
+  app_pw=$(echo "${user}" | jq -r '.attributes.mailu_app_password[0] // empty')
+  [[ -z "${email}" || -z "${app_pw}" ]] && continue
+  if account_exists "${email}"; then
+    echo "Skipping ${email}, already exists"
+    continue
+  fi
+  echo "Syncing ${email}"
+  runuser -u www-data -- php occ mail:account:create \
+    "${username}" "${username}" "${email}" \
+    mail.bstein.dev 993 ssl "${email}" "${app_pw}" \
+    mail.bstein.dev 587 tls "${email}" "${app_pw}" login || true
+done
diff --git a/scripts/nextcloud-maintenance.sh b/scripts/nextcloud-maintenance.sh
new file mode 100755
index 0000000..e8ea18c
--- /dev/null
+++ b/scripts/nextcloud-maintenance.sh
@@ -0,0 +1,65 @@
+#!/bin/bash
+set -euo pipefail
+
+NC_URL="${NC_URL:-https://cloud.bstein.dev}"
+ADMIN_USER="${ADMIN_USER:?}"
+ADMIN_PASS="${ADMIN_PASS:?}"
+
+export DEBIAN_FRONTEND=noninteractive
+apt-get update -qq
+apt-get install -y -qq curl jq >/dev/null
+
+run_occ() {
+  runuser -u www-data -- php occ "$@"
+}
+
+log() { echo "[$(date -Is)] $*"; }
+
+log "Applying Atlas theming"
+run_occ theming:config name "Atlas Cloud"
+run_occ theming:config slogan "Unified access to Atlas services"
+run_occ theming:config url "https://cloud.bstein.dev"
+run_occ theming:config color "#0f172a"
+run_occ theming:config disable-user-theming yes
+
+log "Setting default quota to 200 GB"
+run_occ config:app:set files default_quota --value "200 GB"
+
+API_BASE="${NC_URL}/ocs/v2.php/apps/external/api/v1"
+AUTH=(-u "${ADMIN_USER}:${ADMIN_PASS}" -H "OCS-APIRequest: true")
+
+log "Removing existing external links"
+existing=$(curl -sf "${AUTH[@]}" "${API_BASE}?format=json" | jq -r '.ocs.data[].id // empty')
+for id in ${existing}; do
+  curl -sf "${AUTH[@]}" -X DELETE "${API_BASE}/sites/${id}?format=json" >/dev/null || true
+done
+
+SITES=(
+  "Vaultwarden|https://vault.bstein.dev"
+  "Jellyfin|https://stream.bstein.dev"
+  "Gitea|https://scm.bstein.dev"
+  "Jenkins|https://ci.bstein.dev"
+  "Zot|https://registry.bstein.dev"
+  "Vault|https://secret.bstein.dev"
+  "Jitsi|https://meet.bstein.dev"
+  "Grafana|https://metrics.bstein.dev"
+  "Chat LLM|https://chat.ai.bstein.dev"
+  "Vision|https://draw.ai.bstein.dev"
+  "STT/TTS|https://talk.ai.bstein.dev"
+)
+
+log "Seeding external links"
+for entry in "${SITES[@]}"; do
+  IFS="|" read -r name url <<<"${entry}"
+  curl -sf "${AUTH[@]}" -X POST "${API_BASE}/sites?format=json" \
+    -d "name=${name}" \
+    -d "url=${url}" \
+    -d "lang=" \
+    -d "type=link" \
+    -d "device=" \
+    -d "icon=" \
+    -d "groups[]=" \
+    -d "redirect=1" >/dev/null
+done
+
+log "Maintenance run completed"
diff --git a/scripts/tests/test_dashboards_render_atlas.py b/scripts/tests/test_dashboards_render_atlas.py
new file mode 100644
index 0000000..865aa68
--- /dev/null
+++ b/scripts/tests/test_dashboards_render_atlas.py
@@ -0,0 +1,58 @@
+import importlib.util
+import pathlib
+
+
+def load_module():
+    path = pathlib.Path(__file__).resolve().parents[1] / "dashboards_render_atlas.py"
+    spec = importlib.util.spec_from_file_location("dashboards_render_atlas", path)
+    module = importlib.util.module_from_spec(spec)
+    assert spec.loader is not None
+    spec.loader.exec_module(module)
+    return module
+
+
+def test_table_panel_options_and_filterable():
+    mod = load_module()
+    panel = mod.table_panel(
+        1,
+        "test",
+        "metric",
+        {"h": 1, "w": 1, "x": 0, "y": 0},
+        unit="percent",
+        transformations=[{"id": "labelsToFields", "options": {}}],
+        instant=True,
+        options={"showColumnFilters": False},
+        filterable=False,
+        footer={"show": False, "fields": "", "calcs": []},
+        format="table",
+    )
+    assert panel["fieldConfig"]["defaults"]["unit"] == "percent"
+    assert panel["fieldConfig"]["defaults"]["custom"]["filterable"] is False
+    assert panel["options"]["showHeader"] is True
+    assert panel["targets"][0]["format"] == "table"
+
+
+def test_node_filter_and_expr_helpers():
+    mod = load_module()
+    expr = mod.node_filter("titan-.*")
+    assert "label_replace" in expr
+    cpu_expr = mod.node_cpu_expr("titan-.*")
+    mem_expr = mod.node_mem_expr("titan-.*")
+    assert "node_cpu_seconds_total" in cpu_expr
+    assert "node_memory_MemAvailable_bytes" in mem_expr
+
+
+def test_render_configmap_writes(tmp_path):
+    mod = load_module()
+    mod.DASHBOARD_DIR = tmp_path / "dash"
+    mod.ROOT = tmp_path
+    uid = "atlas-test"
+    info = {"configmap": tmp_path / "cm.yaml"}
+    data = {"title": "Atlas Test"}
+    mod.write_json(uid, data)
+    mod.render_configmap(uid, info)
+    json_path = mod.DASHBOARD_DIR / f"{uid}.json"
+    assert json_path.exists()
+    content = (tmp_path / "cm.yaml").read_text()
+    assert "kind: ConfigMap" in content
+    assert f"{uid}.json" in content
diff --git a/scripts/tests/test_mailu_sync.py b/scripts/tests/test_mailu_sync.py
new file mode 100644
index 0000000..41616b2
--- /dev/null
+++ b/scripts/tests/test_mailu_sync.py
@@ -0,0 +1,181 @@
+import importlib.util
+import pathlib
+
+import pytest
+
+
+def load_sync_module(monkeypatch):
+    # Minimal env required by module import
+    env = {
+        "KEYCLOAK_BASE_URL": "http://keycloak",
+        "KEYCLOAK_REALM": "atlas",
+        "KEYCLOAK_CLIENT_ID": "mailu-sync",
+        "KEYCLOAK_CLIENT_SECRET": "secret",
+        "MAILU_DOMAIN": "example.com",
+        "MAILU_DB_HOST": "localhost",
+        "MAILU_DB_PORT": "5432",
+        "MAILU_DB_NAME": "mailu",
+        "MAILU_DB_USER": "mailu",
+        "MAILU_DB_PASSWORD": "pw",
+    }
+    for k, v in env.items():
+        monkeypatch.setenv(k, v)
+    module_path = pathlib.Path(__file__).resolve().parents[1] / "mailu_sync.py"
+    spec = importlib.util.spec_from_file_location("mailu_sync_testmod", module_path)
+    module = importlib.util.module_from_spec(spec)
+    assert spec.loader is not None
+    spec.loader.exec_module(module)
+    return module
+
+
+def test_random_password_length_and_charset(monkeypatch):
+    sync = load_sync_module(monkeypatch)
+    pw = sync.random_password()
+    assert len(pw) == 24
+    assert all(ch.isalnum() for ch in pw)
+
+
+class _FakeResponse:
+    def __init__(self, json_data=None, status=200):
+        self._json_data = json_data or {}
+        self.status_code = status
+
+    def raise_for_status(self):
+        if self.status_code >= 400:
+            raise AssertionError(f"status {self.status_code}")
+
+    def json(self):
+        return self._json_data
+
+
+class _FakeSession:
+    def __init__(self, put_resp, get_resp):
+        self.put_resp = put_resp
+        self.get_resp = get_resp
+        self.put_called = False
+        self.get_called = False
+
+    def post(self, *args, **kwargs):
+        return _FakeResponse({"access_token": "dummy"})
+
+    def put(self, *args, **kwargs):
+        self.put_called = True
+        return self.put_resp
+
+    def get(self, *args, **kwargs):
+        self.get_called = True
+        return self.get_resp
+
+
+def test_kc_update_attributes_succeeds(monkeypatch):
+    sync = load_sync_module(monkeypatch)
+    ok_resp = _FakeResponse({"attributes": {"mailu_app_password": ["abc"]}})
+    sync.SESSION = _FakeSession(_FakeResponse({}), ok_resp)
+    sync.kc_update_attributes("token", {"id": "u1", "username": "u1"}, {"mailu_app_password": "abc"})
+    assert sync.SESSION.put_called and sync.SESSION.get_called
+
+
+def test_kc_update_attributes_raises_without_attribute(monkeypatch):
+    sync = load_sync_module(monkeypatch)
+    missing_attr_resp = _FakeResponse({"attributes": {}}, status=200)
+    sync.SESSION = _FakeSession(_FakeResponse({}), missing_attr_resp)
+    with pytest.raises(Exception):
+        sync.kc_update_attributes("token", {"id": "u1", "username": "u1"}, {"mailu_app_password": "abc"})
+
+
+def test_kc_get_users_paginates(monkeypatch):
+    sync = load_sync_module(monkeypatch)
+
+    class _PagedSession:
+        def __init__(self):
+            self.calls = 0
+
+        def post(self, *_, **__):
+            return _FakeResponse({"access_token": "tok"})
+
+        def get(self, *_, **__):
+            self.calls += 1
+            if self.calls == 1:
+                return _FakeResponse([{"id": "u1"}, {"id": "u2"}])
+            return _FakeResponse([])  # stop pagination
+
+    sync.SESSION = _PagedSession()
+    users = sync.kc_get_users("tok")
+    assert [u["id"] for u in users] == ["u1", "u2"]
+    assert sync.SESSION.calls == 2
+
+
+def test_ensure_mailu_user_skips_foreign_domain(monkeypatch):
+    sync = load_sync_module(monkeypatch)
+    executed = []
+
+    class _Cursor:
+        def execute(self, sql, params):
+            executed.append((sql, params))
+
+    sync.ensure_mailu_user(_Cursor(), "user@other.com", "pw", "User")
+    assert not executed
+
+
+def test_ensure_mailu_user_upserts(monkeypatch):
+    sync = load_sync_module(monkeypatch)
+    captured = {}
+
+    class _Cursor:
+        def execute(self, sql, params):
+            captured.update(params)
+
+    sync.ensure_mailu_user(_Cursor(), "user@example.com", "pw", "User Example")
+    assert captured["email"] == "user@example.com"
+    assert captured["localpart"] == "user"
+    # password should be hashed, not the raw string
+    assert captured["password"] != "pw"
+
+
+def test_main_generates_password_and_upserts(monkeypatch):
+    sync = load_sync_module(monkeypatch)
+    users = [
+        {"id": "u1", "username": "user1", "email": "user1@example.com", "attributes": {}},
+        {"id": "u2", "username": "user2", "email": "user2@example.com", "attributes": {"mailu_app_password": ["keepme"]}},
+        {"id": "u3", "username": "user3", "email": "user3@other.com", "attributes": {}},
+    ]
+    updated = []
+
+    class _Cursor:
+        def __init__(self):
+            self.executions = []
+
+        def execute(self, sql, params):
+            self.executions.append(params)
+
+        def close(self):
+            return None
+
+    class _Conn:
+        def __init__(self):
+            self.autocommit = False
+            self._cursor = _Cursor()
+
+        def cursor(self, cursor_factory=None):
+            return self._cursor
+
+        def close(self):
+            return None
+
+    monkeypatch.setattr(sync, "get_kc_token", lambda: "tok")
+    monkeypatch.setattr(sync, "kc_get_users", lambda token: users)
+    monkeypatch.setattr(sync, "kc_update_attributes", lambda token, user, attrs: updated.append((user["id"], attrs["mailu_app_password"])))
+    conns = []
+
+    def _connect(**kwargs):
+        conn = _Conn()
+        conns.append(conn)
+        return conn
+
+    monkeypatch.setattr(sync.psycopg2, "connect", _connect)
+
+    sync.main()
+
+    # Should attempt two inserts (third user skipped due to domain mismatch)
+    assert len(updated) == 1  # only one missing attr was backfilled
+    assert conns and len(conns[0]._cursor.executions) == 2
diff --git a/services/gitea/ingress.yaml b/services/gitea/ingress.yaml
index 375dba3..0077ba4 100644
--- a/services/gitea/ingress.yaml
+++ b/services/gitea/ingress.yaml
@@ -5,7 +5,7 @@ metadata:
   name: gitea-ingress
   namespace: gitea
   annotations:
-    cert-manager.io/cluster-issuer: "letsencrypt-prod"
+    cert-manager.io/cluster-issuer: letsencrypt
     nginx.ingress.kubernetes.io/ssl-redirect: "true"
 spec:
   tls:
diff --git a/services/gitops-ui/helmrelease.yaml b/services/gitops-ui/helmrelease.yaml
new file mode 100644
index 0000000..27b610d
--- /dev/null
+++ b/services/gitops-ui/helmrelease.yaml
@@ -0,0 +1,49 @@
+# services/gitops-ui/helmrelease.yaml
+apiVersion: helm.toolkit.fluxcd.io/v2
+kind: HelmRelease
+metadata:
+  name: weave-gitops
+  namespace: flux-system
+spec:
+  interval: 30m
+  chart:
+    spec:
+      chart: ./charts/gitops-server
+      sourceRef:
+        kind: GitRepository
+        name: weave-gitops-upstream
+        namespace: flux-system
+      # track upstream tag; see source object for version pin
+  install:
+    remediation:
+      retries: 3
+  upgrade:
+    remediation:
+      retries: 3
+      remediateLastFailure: true
+    cleanupOnFail: true
+  values:
+    adminUser:
+      create: true
+      createClusterRole: true
+      createSecret: true
+      username: admin
+      # bcrypt hash for temporary password "G1tOps!2025" (rotate after login)
+      passwordHash: "$2y$12$wDEOzR1Gc2dbvNSJ3ZXNdOBVFEjC6YASIxnZmHIbO.W1m0fie/QVi"
+    ingress:
+      enabled: true
+      className: traefik
+      annotations:
+        cert-manager.io/cluster-issuer: letsencrypt-prod
+        traefik.ingress.kubernetes.io/router.entrypoints: websecure
+      hosts:
+        - host: cd.bstein.dev
+          paths:
+            - path: /
+              pathType: Prefix
+      tls:
+        - secretName: gitops-ui-tls
+          hosts:
+            - cd.bstein.dev
+    metrics:
+      enabled: true
diff --git a/services/gitops-ui/kustomization.yaml b/services/gitops-ui/kustomization.yaml
new file mode 100644
index 0000000..53a903e
--- /dev/null
+++ b/services/gitops-ui/kustomization.yaml
@@ -0,0 +1,7 @@
+# services/gitops-ui/kustomization.yaml
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+namespace: flux-system
+resources:
+  - source.yaml
+  - helmrelease.yaml
diff --git a/services/gitops-ui/source.yaml b/services/gitops-ui/source.yaml
new file mode 100644
index 0000000..0e87524
--- /dev/null
+++ b/services/gitops-ui/source.yaml
@@ -0,0 +1,11 @@
+# services/gitops-ui/source.yaml
+apiVersion: source.toolkit.fluxcd.io/v1
+kind: GitRepository
+metadata:
+  name: weave-gitops-upstream
+  namespace: flux-system
+spec:
+  interval: 1h
+  url: https://github.com/weaveworks/weave-gitops.git
+  ref:
+    tag: v0.38.0
diff --git a/services/jitsi/ingress.yaml b/services/jitsi/ingress.yaml
index c09b669..3336c37 100644
--- a/services/jitsi/ingress.yaml
+++ b/services/jitsi/ingress.yaml
@@ -5,7 +5,7 @@ metadata:
   name: jitsi
   namespace: jitsi
   annotations:
-    cert-manager.io/cluster-issuer: "letsencrypt-prod"
+    cert-manager.io/cluster-issuer: letsencrypt
 spec:
   ingressClassName: traefik
   tls:
diff --git a/services/keycloak/README.md b/services/keycloak/NOTES.md
similarity index 100%
rename from services/keycloak/README.md
rename to services/keycloak/NOTES.md
diff --git a/services/keycloak/deployment.yaml b/services/keycloak/deployment.yaml
index af7839f..c4ffcda 100644
--- a/services/keycloak/deployment.yaml
+++ b/services/keycloak/deployment.yaml
@@ -48,6 +48,20 @@ spec:
         runAsGroup: 0
         fsGroup: 1000
         fsGroupChangePolicy: OnRootMismatch
+      imagePullSecrets:
+        - name: zot-regcred
+      initContainers:
+        - name: mailu-http-listener
+          image: registry.bstein.dev/sso/mailu-http-listener:0.1.0
+          imagePullPolicy: IfNotPresent
+          command: ["/bin/sh", "-c"]
+          args:
+            - |
+              cp /plugin/mailu-http-listener-0.1.0.jar /providers/
+              cp -r /plugin/src /providers/src
+          volumeMounts:
+            - name: providers
+              mountPath: /providers
       containers:
         - name: keycloak
           image: quay.io/keycloak/keycloak:26.0.7
@@ -104,6 +118,10 @@ spec:
                 secretKeyRef:
                   name: keycloak-admin
                   key: password
+            - name: KC_EVENTS_LISTENERS
+              value: jboss-logging,mailu-http
+            - name: KC_SPI_EVENTS_LISTENER_MAILU-HTTP_ENDPOINT
+              value: http://mailu-sync-listener.mailu-mailserver.svc.cluster.local:8080/events
           ports:
             - containerPort: 8080
               name: http
@@ -126,7 +144,11 @@ spec:
           volumeMounts:
             - name: data
               mountPath: /opt/keycloak/data
+            - name: providers
+              mountPath: /opt/keycloak/providers
       volumes:
         - name: data
           persistentVolumeClaim:
             claimName: keycloak-data
+        - name: providers
+          emptyDir: {}
diff --git a/services/mailu/certificate.yaml b/services/mailu/certificate.yaml
new file mode 100644
index 0000000..83cc17c
--- /dev/null
+++ b/services/mailu/certificate.yaml
@@ -0,0 +1,13 @@
+# services/mailu/certificate.yaml
+apiVersion: cert-manager.io/v1
+kind: Certificate
+metadata:
+  name: mailu-tls
+  namespace: mailu-mailserver
+spec:
+  secretName: mailu-certificates
+  issuerRef:
+    kind: ClusterIssuer
+    name: letsencrypt-prod
+  dnsNames:
+    - mail.bstein.dev
diff --git a/services/mailu/helmrelease.yaml b/services/mailu/helmrelease.yaml
new file mode 100644
index 0000000..c8b0975
--- /dev/null
+++ b/services/mailu/helmrelease.yaml
@@ -0,0 +1,287 @@
+# services/mailu/helmrelease.yaml
+apiVersion: helm.toolkit.fluxcd.io/v2
+kind: HelmRelease
+metadata:
+  name: mailu
+  namespace: mailu-mailserver
+spec:
+  interval: 30m
+  chart:
+    spec:
+      chart: mailu
+      version: 2.1.2
+      sourceRef:
+        kind: HelmRepository
+        name: mailu
+        namespace: flux-system
+  install:
+    remediation: { retries: 3 }
+    timeout: 10m
+  upgrade:
+    remediation:
+      retries: 3
+      remediateLastFailure: true
+    cleanupOnFail: true
+    timeout: 10m
+  values:
+    mailuVersion: "2024.06"
+    domain: bstein.dev
+    hostnames: [mail.bstein.dev]
+    domains:
+      - name: bstein.dev
+        enabled: true
+        dkim:
+          enabled: true
+    externalRelay:
+      host: "[email-smtp.us-east-2.amazonaws.com]:587"
+      existingSecret: mailu-ses-relay
+      usernameKey: relay-username
+      passwordKey: relay-password
+    timezone: Etc/UTC
+    subnet: 10.42.0.0/16
+    existingSecret: mailu-secret
+    tls:
+      outboundLevel: encrypt
+    externalDatabase:
+      enabled: true
+      type: postgresql
+      host: postgres-service.postgres.svc.cluster.local
+      port: 5432
+      database: mailu
+      username: mailu
+      existingSecret: mailu-db-secret
+      existingSecretUsernameKey: username
+      existingSecretPasswordKey: password
+      existingSecretDatabaseKey: database
+    initialAccount:
+      enabled: true
+      username: test
+      domain: bstein.dev
+      existingSecret: mailu-initial-account-secret
+      existingSecretPasswordKey: password
+    persistence:
+      accessModes: [ReadWriteMany]
+      size: 100Gi
+      storageClass: astreae
+      single_pvc: true
+    front:
+      hostnames: [mail.bstein.dev]
+      proxied: true
+      hostPort:
+        enabled: false
+      https:
+        enabled: false
+        external: false
+        forceHttps: false
+      externalService:
+        enabled: true
+        type: LoadBalancer
+        externalTrafficPolicy: Cluster
+        ports:
+          submission: true
+        nodePorts:
+          pop3: 30010
+          pop3s: 30011
+          imap: 30143
+          imaps: 30993
+          manageSieve: 30419
+          smtp: 30025
+          smtps: 30465
+          submission: 30587
+      logLevel: DEBUG
+      nodeSelector:
+        hardware: rpi4
+    admin:
+      logLevel: DEBUG
+      nodeSelector:
+        hardware: rpi4
+      podLivenessProbe:
+        enabled: true
+        initialDelaySeconds: 30
+        periodSeconds: 10
+        timeoutSeconds: 5
+        failureThreshold: 6
+        successThreshold: 1
+      podReadinessProbe:
+        enabled: true
+        initialDelaySeconds: 20
+        periodSeconds: 10
+        timeoutSeconds: 5
+        failureThreshold: 6
+        successThreshold: 1
+      extraEnvVars:
+        - name: FLASK_DEBUG
+          value: "1"
+        - name: ACCESSLOG
+          value: /dev/stdout
+        - name: ERRORLOG
+          value: /dev/stderr
+        - name: WEBROOT_REDIRECT
+          value: ""
+        - name: FORWARDED_ALLOW_IPS
+          value: 127.0.0.1,10.42.0.0/16
+        - name: DNS_RESOLVERS
+          value: 1.1.1.1,9.9.9.9
+      extraVolumes:
+        - name: unbound-config
+          configMap:
+            name: mailu-unbound
+        - name: unbound-run
+          emptyDir: {}
+      extraVolumeMounts:
+        - name: unbound-run
+          mountPath: /var/lib/unbound
+      extraContainers:
+        - name: unbound
+          image: docker.io/alpine:3.20
+          command: ["/bin/sh", "-c"]
+          args:
+            - |
+              while :; do
+                printf "nameserver 10.43.0.10\n" > /etc/resolv.conf
+                if apk add --no-cache unbound bind-tools; then
+                  break
+                fi
+                echo "apk failed, retrying" >&2
+                sleep 10
+              done
+              cat >/etc/resolv.conf <<'EOF'
+              search mailu-mailserver.svc.cluster.local svc.cluster.local cluster.local
+              nameserver 127.0.0.1
+              EOF
+              unbound-anchor -a /var/lib/unbound/root.key || true
+              exec unbound -d -c /opt/unbound/etc/unbound/unbound.conf
+          ports:
+            - containerPort: 53
+              protocol: UDP
+            - containerPort: 53
+              protocol: TCP
+          volumeMounts:
+            - name: unbound-config
+              mountPath: /opt/unbound/etc/unbound
+            - name: unbound-run
+              mountPath: /var/lib/unbound
+      dnsPolicy: None
+      dnsConfig:
+        nameservers:
+          - 127.0.0.1
+        searches:
+          - mailu-mailserver.svc.cluster.local
+          - svc.cluster.local
+          - cluster.local
+    clamav:
+      image:
+        repository: clamav/clamav-debian
+        tag: "1.4"
+      logLevel: DEBUG
+      nodeSelector:
+        hardware: rpi5
+      resources:
+        requests:
+          cpu: 200m
+          memory: 1Gi
+        limits:
+          cpu: 500m
+          memory: 3Gi
+      livenessProbe:
+        enabled: false
+        initialDelaySeconds: 300
+        periodSeconds: 30
+        timeoutSeconds: 5
+        failureThreshold: 6
+        successThreshold: 1
+      startupProbe:
+        enabled: false
+        initialDelaySeconds: 60
+        periodSeconds: 30
+        timeoutSeconds: 5
+        failureThreshold: 20
+        successThreshold: 1
+      readinessProbe:
+        enabled: false
+        initialDelaySeconds: 300
+        periodSeconds: 30
+        timeoutSeconds: 5
+        failureThreshold: 6
+        successThreshold: 1
+    dovecot:
+      logLevel: DEBUG
+      nodeSelector:
+        hardware: rpi4
+    oletools:
+      logLevel: DEBUG
+      nodeSelector:
+        hardware: rpi4
+    postfix:
+      logLevel: DEBUG
+      nodeSelector:
+        hardware: rpi4
+      overrides:
+        smtp_use_tls: "yes"
+        smtp_tls_security_level: "encrypt"
+        smtp_sasl_security_options: "noanonymous"
+    redis:
+      enabled: true
+      architecture: standalone
+      logLevel: DEBUG
+      image:
+        repository: bitnamilegacy/redis
+        tag: 8.0.3-debian-12-r3
+      master:
+        nodeSelector:
+          hardware: rpi4
+        persistence:
+          enabled: true
+          accessModes: [ReadWriteMany]
+          size: 8Gi
+          storageClass: astreae
+    rspamd:
+      logLevel: DEBUG
+      nodeSelector:
+        hardware: rpi4
+      persistence:
+        accessModes: [ReadWriteOnce]
+        size: 8Gi
+        storageClass: astreae
+    tika:
+      logLevel: DEBUG
+      nodeSelector:
+        hardware: rpi4
+    global:
+      logLevel: DEBUG
+      storageClass: astreae
+    webmail:
+      enabled: false
+      nodeSelector:
+        hardware: rpi4
+    ingress:
+      enabled: false
+      ingressClassName: traefik
+      tls: true
+      existingSecret: mailu-certificates
+      annotations:
+        traefik.ingress.kubernetes.io/router.entrypoints: websecure
+        traefik.ingress.kubernetes.io/service.serversscheme: https
+        traefik.ingress.kubernetes.io/service.serverstransport: mailu-transport@kubernetescrd
+      extraRules:
+        - host: mail.bstein.dev
+          http:
+            paths:
+              - path: /
+                pathType: Prefix
+                backend:
+                  service:
+                    name: mailu-front
+                    port:
+                      number: 443
+    service:
+      ports:
+        smtp:
+          port: 25
+          targetPort: 25
+        smtps:
+          port: 465
+          targetPort: 465
+        submission:
+          port: 587
+          targetPort: 587
diff --git a/services/mailu/ingressroute.yaml b/services/mailu/ingressroute.yaml
new file mode 100644
index 0000000..d4bc4f6
--- /dev/null
+++ b/services/mailu/ingressroute.yaml
@@ -0,0 +1,19 @@
+# services/mailu/ingressroute.yaml
+apiVersion: traefik.io/v1alpha1
+kind: IngressRoute
+metadata:
+  name: mailu
+  namespace: mailu-mailserver
+spec:
+  entryPoints:
+    - websecure
+  routes:
+    - match: Host(`mail.bstein.dev`)
+      kind: Rule
+      services:
+        - name: mailu-front
+          port: 443
+          scheme: https
+          serversTransport: mailu-transport
+  tls:
+    secretName: mailu-certificates
diff --git a/services/mailu/kustomization.yaml b/services/mailu/kustomization.yaml
new file mode 100644
index 0000000..2df7440
--- /dev/null
+++ b/services/mailu/kustomization.yaml
@@ -0,0 +1,23 @@
+# services/mailu/kustomization.yaml
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+namespace: mailu-mailserver
+resources:
+  - namespace.yaml
+  - helmrelease.yaml
+  - certificate.yaml
+  - vip-controller.yaml
+  - unbound-configmap.yaml
+  - serverstransport.yaml
+  - ingressroute.yaml
+  - mailu-sync-job.yaml
+  - mailu-sync-cronjob.yaml
+  - mailu-sync-listener.yaml
+
+configMapGenerator:
+  - name: mailu-sync-script
+    namespace: mailu-mailserver
+    files:
+      - sync.py=../../scripts/mailu_sync.py
+    options:
+      disableNameSuffixHash: true
diff --git a/services/mailu/mailu-sync-cronjob.yaml b/services/mailu/mailu-sync-cronjob.yaml
new file mode 100644
index 0000000..268680f
--- /dev/null
+++ b/services/mailu/mailu-sync-cronjob.yaml
@@ -0,0 +1,77 @@
+# services/mailu/mailu-sync-cronjob.yaml
+apiVersion: batch/v1
+kind: CronJob
+metadata:
+  name: mailu-sync-nightly
+  namespace: mailu-mailserver
+spec:
+  schedule: "30 4 * * *"
+  concurrencyPolicy: Forbid
+  jobTemplate:
+    spec:
+      template:
+        spec:
+          restartPolicy: OnFailure
+          containers:
+            - name: mailu-sync
+              image: python:3.11-alpine
+              imagePullPolicy: IfNotPresent
+              command: ["/bin/sh", "-c"]
+              args:
+                - |
+                  pip install --no-cache-dir requests psycopg2-binary passlib >/tmp/pip.log \
+                  && python /app/sync.py
+              env:
+                - name: KEYCLOAK_BASE_URL
+                  value: http://keycloak.sso.svc.cluster.local
+                - name: KEYCLOAK_REALM
+                  value: atlas
+                - name: MAILU_DOMAIN
+                  value: bstein.dev
+                - name: MAILU_DEFAULT_QUOTA
+                  value: "20000000000"
+                - name: MAILU_DB_HOST
+                  value: postgres-service.postgres.svc.cluster.local
+                - name: MAILU_DB_PORT
+                  value: "5432"
+                - name: MAILU_DB_NAME
+                  valueFrom:
+                    secretKeyRef:
+                      name: mailu-db-secret
+                      key: database
+                - name: MAILU_DB_USER
+                  valueFrom:
+                    secretKeyRef:
+                      name: mailu-db-secret
+                      key: username
+                - name: MAILU_DB_PASSWORD
+                  valueFrom:
+                    secretKeyRef:
+                      name: mailu-db-secret
+                      key: password
+                - name: KEYCLOAK_CLIENT_ID
+                  valueFrom:
+                    secretKeyRef:
+                      name: mailu-sync-credentials
+                      key: client-id
+                - name: KEYCLOAK_CLIENT_SECRET
+                  valueFrom:
+                    secretKeyRef:
+                      name: mailu-sync-credentials
+                      key: client-secret
+              volumeMounts:
+                - name: sync-script
+                  mountPath: /app/sync.py
+                  subPath: sync.py
+              resources:
+                requests:
+                  cpu: 50m
+                  memory: 128Mi
+                limits:
+                  cpu: 200m
+                  memory: 256Mi
+          volumes:
+            - name: sync-script
+              configMap:
+                name: mailu-sync-script
+                defaultMode: 0444
diff --git a/services/mailu/mailu-sync-job.yaml b/services/mailu/mailu-sync-job.yaml
new file mode 100644
index 0000000..7230c1d
--- /dev/null
+++ b/services/mailu/mailu-sync-job.yaml
@@ -0,0 +1,73 @@
+# services/mailu/mailu-sync-job.yaml
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: mailu-sync
+  namespace: mailu-mailserver
+spec:
+  template:
+    spec:
+      restartPolicy: OnFailure
+      containers:
+        - name: mailu-sync
+          image: python:3.11-alpine
+          imagePullPolicy: IfNotPresent
+          command: ["/bin/sh", "-c"]
+          args:
+            - |
+              pip install --no-cache-dir requests psycopg2-binary passlib >/tmp/pip.log \
+              && python /app/sync.py
+          env:
+            - name: KEYCLOAK_BASE_URL
+              value: http://keycloak.sso.svc.cluster.local
+            - name: KEYCLOAK_REALM
+              value: atlas
+            - name: MAILU_DOMAIN
+              value: bstein.dev
+            - name: MAILU_DEFAULT_QUOTA
+              value: "20000000000"
+            - name: MAILU_DB_HOST
+              value: postgres-service.postgres.svc.cluster.local
+            - name: MAILU_DB_PORT
+              value: "5432"
+            - name: MAILU_DB_NAME
+              valueFrom:
+                secretKeyRef:
+                  name: mailu-db-secret
+                  key: database
+            - name: MAILU_DB_USER
+              valueFrom:
+                secretKeyRef:
+                  name: mailu-db-secret
+                  key: username
+            - name: MAILU_DB_PASSWORD
+              valueFrom:
+                secretKeyRef:
+                  name: mailu-db-secret
+                  key: password
+            - name: KEYCLOAK_CLIENT_ID
+              valueFrom:
+                secretKeyRef:
+                  name: mailu-sync-credentials
+                  key: client-id
+            - name: KEYCLOAK_CLIENT_SECRET
+              valueFrom:
+                secretKeyRef:
+                  name: mailu-sync-credentials
+                  key: client-secret
+          volumeMounts:
+            - name: sync-script
+              mountPath: /app/sync.py
+              subPath: sync.py
+          resources:
+            requests:
+              cpu: 50m
+              memory: 128Mi
+            limits:
+              cpu: 200m
+              memory: 256Mi
+      volumes:
+        - name: sync-script
+          configMap:
+            name: mailu-sync-script
+            defaultMode: 0444
diff --git a/services/mailu/mailu-sync-listener.yaml b/services/mailu/mailu-sync-listener.yaml
new file mode 100644
index 0000000..04e8070
--- /dev/null
+++ b/services/mailu/mailu-sync-listener.yaml
@@ -0,0 +1,154 @@
+# services/mailu/mailu-sync-listener.yaml
+apiVersion: v1
+kind: Service
+metadata:
+  name: mailu-sync-listener
+  namespace: mailu-mailserver
+spec:
+  selector:
+    app: mailu-sync-listener
+  ports:
+    - name: http
+      port: 8080
+      targetPort: 8080
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: mailu-sync-listener
+  namespace: mailu-mailserver
+  labels:
+    app: mailu-sync-listener
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: mailu-sync-listener
+  template:
+    metadata:
+      labels:
+        app: mailu-sync-listener
+    spec:
+      restartPolicy: Always
+      containers:
+        - name: listener
+          image: python:3.11-alpine
+          imagePullPolicy: IfNotPresent
+          command: ["/bin/sh", "-c"]
+          args:
+            - |
+              pip install --no-cache-dir requests psycopg2-binary passlib >/tmp/pip.log \
+              && python /app/listener.py
+          env:
+            - name: KEYCLOAK_BASE_URL
+              value: http://keycloak.sso.svc.cluster.local
+            - name: KEYCLOAK_REALM
+              value: atlas
+            - name: MAILU_DOMAIN
+              value: bstein.dev
+            - name: MAILU_DEFAULT_QUOTA
+              value: "20000000000"
+            - name: MAILU_DB_HOST
+              value: postgres-service.postgres.svc.cluster.local
+            - name: MAILU_DB_PORT
+              value: "5432"
+            - name: MAILU_DB_NAME
+              valueFrom:
+                secretKeyRef:
+                  name: mailu-db-secret
+                  key: database
+            - name: MAILU_DB_USER
+              valueFrom:
+                secretKeyRef:
+                  name: mailu-db-secret
+                  key: username
+            - name: MAILU_DB_PASSWORD
+              valueFrom:
+                secretKeyRef:
+                  name: mailu-db-secret
+                  key: password
+            - name: KEYCLOAK_CLIENT_ID
+              valueFrom:
+                secretKeyRef:
+                  name: mailu-sync-credentials
+                  key: client-id
+            - name: KEYCLOAK_CLIENT_SECRET
+              valueFrom:
+                secretKeyRef:
+                  name: mailu-sync-credentials
+                  key: client-secret
+          volumeMounts:
+            - name: sync-script
+              mountPath: /app/sync.py
+              subPath: sync.py
+            - name: listener-script
+              mountPath: /app/listener.py
+              subPath: listener.py
+          resources:
+            requests:
+              cpu: 50m
+              memory: 128Mi
+            limits:
+              cpu: 200m
+              memory: 256Mi
+      volumes:
+        - name: sync-script
+          configMap:
+            name: mailu-sync-script
+            defaultMode: 0444
+        - name: listener-script
+          configMap:
+            name: mailu-sync-listener
+            defaultMode: 0444
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: mailu-sync-listener
+  namespace: mailu-mailserver
+data:
+  listener.py: |
+    import http.server
+    import json
+    import os
+    import subprocess
+    import threading
+
+    from time import time
+
+    # Simple debounce to avoid hammering on bursts
+    MIN_INTERVAL_SECONDS = 10
+    last_run = 0.0
+    lock = threading.Lock()
+
+    def trigger_sync():
+        global last_run
+        with lock:
+            now = time()
+            if now - last_run < MIN_INTERVAL_SECONDS:
+                return
+            last_run = now
+        # Fire and forget; output to stdout
+        subprocess.Popen(["python", "/app/sync.py"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
+
+    class Handler(http.server.BaseHTTPRequestHandler):
+        def do_POST(self):
+            length = int(self.headers.get("Content-Length", 0))
+            body = self.rfile.read(length) if length else b""
+            try:
+                json.loads(body or b"{}")
+            except json.JSONDecodeError:
+                self.send_response(400)
+                self.end_headers()
+                return
+            trigger_sync()
+            self.send_response(202)
+            self.end_headers()
+
+        def log_message(self, fmt, *args):
+            # Quiet logging
+            return
+
+    if __name__ == "__main__":
+        server = http.server.ThreadingHTTPServer(("", 8080), Handler)
+        server.serve_forever()
diff --git a/services/mailu/namespace.yaml b/services/mailu/namespace.yaml
new file mode 100644
index 0000000..1f3831b
--- /dev/null
+++ b/services/mailu/namespace.yaml
@@ -0,0 +1,5 @@
+# services/mailu/namespace.yaml
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: mailu-mailserver
diff --git a/services/mailu/serverstransport.yaml b/services/mailu/serverstransport.yaml
new file mode 100644
index 0000000..ace7b31
--- /dev/null
+++ b/services/mailu/serverstransport.yaml
@@ -0,0 +1,10 @@
+# services/mailu/serverstransport.yaml
+apiVersion: traefik.io/v1alpha1
+kind: ServersTransport
+metadata:
+  name: mailu-transport
+  namespace: mailu-mailserver
+spec:
+  # Force SNI to mail.bstein.dev and skip backend cert verification (backend cert is for the host, not the pod IP).
+  serverName: mail.bstein.dev
+  insecureSkipVerify: true
diff --git a/services/mailu/unbound-configmap.yaml b/services/mailu/unbound-configmap.yaml
new file mode 100644
index 0000000..219cafb
--- /dev/null
+++ b/services/mailu/unbound-configmap.yaml
@@ -0,0 +1,49 @@
+# services/mailu/unbound-configmap.yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: mailu-unbound
+  namespace: mailu-mailserver
+data:
+  unbound.conf: |
+    server:
+      verbosity: 1
+      interface: 0.0.0.0
+      do-ip4: yes
+      do-ip6: no
+      do-udp: yes
+      do-tcp: yes
+      auto-trust-anchor-file: "/var/lib/unbound/root.key"
+      prefetch: yes
+      qname-minimisation: yes
+      harden-dnssec-stripped: yes
+      val-clean-additional: yes
+      domain-insecure: "mailu-mailserver.svc.cluster.local."
+      domain-insecure: "svc.cluster.local."
+      domain-insecure: "cluster.local."
+      cache-min-ttl: 120
+      cache-max-ttl: 86400
+      access-control: 0.0.0.0/0 allow
+
+    forward-zone:
+      name: "mailu-mailserver.svc.cluster.local."
+      forward-addr: 10.43.0.10
+      forward-no-cache: yes
+      forward-first: yes
+
+    forward-zone:
+      name: "svc.cluster.local."
+      forward-addr: 10.43.0.10
+      forward-no-cache: yes
+      forward-first: yes
+
+    forward-zone:
+      name: "cluster.local."
+      forward-addr: 10.43.0.10
+      forward-no-cache: yes
+      forward-first: yes
+
+    forward-zone:
+      name: "."
+      forward-addr: 9.9.9.9
+      forward-addr: 1.1.1.1
diff --git a/services/mailu/vip-controller.yaml b/services/mailu/vip-controller.yaml
new file mode 100644
index 0000000..a6d8c1f
--- /dev/null
+++ b/services/mailu/vip-controller.yaml
@@ -0,0 +1,71 @@
+# services/mailu/vip-controller.yaml
+---
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: vip-controller
+  namespace: mailu-mailserver
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: Role
+metadata:
+  name: vip-controller-role
+  namespace: mailu-mailserver
+rules:
+  - apiGroups: ["apps"]
+    resources: ["deployments"]
+    verbs: ["get", "list", "patch", "update"]
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: RoleBinding
+metadata:
+  name: vip-controller-binding
+  namespace: mailu-mailserver
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: Role
+  name: vip-controller-role
+subjects:
+  - kind: ServiceAccount
+    name: vip-controller
+    namespace: mailu-mailserver
+---
+apiVersion: apps/v1
+kind: DaemonSet
+metadata:
+  name: vip-controller
+  namespace: mailu-mailserver
+spec:
+  selector:
+    matchLabels:
+      app: vip-controller
+  template:
+    metadata:
+      labels:
+        app: vip-controller
+    spec:
+      serviceAccountName: vip-controller
+      hostNetwork: true
+      nodeSelector:
+        mailu.bstein.dev/vip: "true"
+      containers:
+        - name: vip-controller
+          image: lachlanevenson/k8s-kubectl:latest
+          imagePullPolicy: IfNotPresent
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - |
+              set -e
+              while true; do
+                if ip addr show end0 | grep -q 'inet 192\.168\.22\.9/32'; then
+                  NODE=$(hostname)
+                  echo "VIP found on node ${NODE}."
+                  kubectl patch deployment mailu-front -n mailu-mailserver --type='merge' \
+                    -p "{\"spec\":{\"template\":{\"spec\":{\"nodeSelector\":{\"kubernetes.io/hostname\":\"${NODE}\"}}}}}"
+                else
+                  echo "No VIP on node ${HOSTNAME}."
+                fi
+                sleep 60
+              done
diff --git a/services/monitoring/README.md b/services/monitoring/README.md
deleted file mode 100644
index 835ae1d..0000000
--- a/services/monitoring/README.md
+++ /dev/null
@@ -1,28 +0,0 @@
-# services/monitoring
-
-## Grafana admin secret
-
-The Grafana Helm release expects a pre-existing secret named `grafana-admin`
-in the `monitoring` namespace. Create or rotate it with:
-
-```bash
-kubectl create secret generic grafana-admin \
-  --namespace monitoring \
-  --from-literal=admin-user=admin \
-  --from-literal=admin-password='REPLACE_ME'
-```
-
-Update the password whenever you rotate credentials.
-
-## DCGM exporter image
-
-The NVIDIA GPU metrics DaemonSet expects `registry.bstein.dev/monitoring/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04`, mirrored from `docker.io/nvidia/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04`. Refresh it in Zot when bumping versions:
-
-```bash
-skopeo copy \
-  --all \
-  docker://docker.io/nvidia/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04 \
-  docker://registry.bstein.dev/monitoring/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04
-```
-
-When finished mirroring from the control-plane, you can remove temporary tooling with `sudo apt-get purge -y skopeo && sudo apt-get autoremove -y` and clear `~/.config/containers/auth.json`.
diff --git a/services/monitoring/dashboards/atlas-gpu.json b/services/monitoring/dashboards/atlas-gpu.json
index 9071b0a..572c2c6 100644
--- a/services/monitoring/dashboards/atlas-gpu.json
+++ b/services/monitoring/dashboards/atlas-gpu.json
@@ -40,9 +40,7 @@
           "placement": "right"
         },
         "pieType": "pie",
-        "displayLabels": [
-          "percent"
-        ],
+        "displayLabels": [],
         "tooltip": {
           "mode": "single"
         },
@@ -153,12 +151,16 @@
       ],
       "fieldConfig": {
         "defaults": {
-          "unit": "percent"
+          "unit": "percent",
+          "custom": {
+            "filterable": true
+          }
         },
         "overrides": []
       },
       "options": {
-        "showHeader": true
+        "showHeader": true,
+        "columnFilters": false
       },
       "transformations": [
         {
diff --git a/services/monitoring/dashboards/atlas-network.json b/services/monitoring/dashboards/atlas-network.json
index ff0af9b..09e9383 100644
--- a/services/monitoring/dashboards/atlas-network.json
+++ b/services/monitoring/dashboards/atlas-network.json
@@ -7,46 +7,55 @@
     {
       "id": 1,
       "type": "stat",
-      "title": "Ingress Traffic",
+      "title": "Ingress Success Rate (5m)",
       "datasource": {
         "type": "prometheus",
         "uid": "atlas-vm"
       },
       "gridPos": {
         "h": 4,
-        "w": 8,
+        "w": 6,
         "x": 0,
         "y": 0
       },
       "targets": [
         {
-          "expr": "sum(rate(node_network_receive_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)",
+          "expr": "(sum(rate(traefik_entrypoint_requests_total{code!~\"5..\"}[5m]))) / clamp_min(sum(rate(traefik_entrypoint_requests_total[5m])), 1)",
           "refId": "A"
         }
       ],
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
             "mode": "absolute",
             "steps": [
               {
-                "color": "rgba(115, 115, 115, 1)",
+                "color": "red",
                 "value": null
               },
+              {
+                "color": "orange",
+                "value": 0.995
+              },
+              {
+                "color": "yellow",
+                "value": 0.999
+              },
               {
                 "color": "green",
-                "value": 1
+                "value": 0.9995
               }
             ]
           },
-          "unit": "Bps",
+          "unit": "percentunit",
           "custom": {
             "displayMode": "auto"
-          }
+          },
+          "decimals": 2
         },
         "overrides": []
       },
@@ -67,46 +76,55 @@
     {
       "id": 2,
       "type": "stat",
-      "title": "Egress Traffic",
+      "title": "Error Budget Burn (1h)",
       "datasource": {
         "type": "prometheus",
         "uid": "atlas-vm"
       },
       "gridPos": {
         "h": 4,
-        "w": 8,
-        "x": 8,
+        "w": 6,
+        "x": 6,
         "y": 0
       },
       "targets": [
         {
-          "expr": "sum(rate(node_network_transmit_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)",
+          "expr": "(1 - ((sum(rate(traefik_entrypoint_requests_total{code!~\"5..\"}[1h]))) / clamp_min(sum(rate(traefik_entrypoint_requests_total[1h])), 1))) / 0.0010000000000000009",
           "refId": "A"
         }
       ],
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
             "mode": "absolute",
             "steps": [
               {
-                "color": "rgba(115, 115, 115, 1)",
+                "color": "green",
                 "value": null
               },
               {
-                "color": "green",
+                "color": "yellow",
                 "value": 1
+              },
+              {
+                "color": "orange",
+                "value": 2
+              },
+              {
+                "color": "red",
+                "value": 4
               }
             ]
           },
-          "unit": "Bps",
+          "unit": "none",
           "custom": {
             "displayMode": "auto"
-          }
+          },
+          "decimals": 2
         },
         "overrides": []
       },
@@ -127,7 +145,145 @@
     {
       "id": 3,
       "type": "stat",
-      "title": "Intra-Cluster Traffic",
+      "title": "Error Budget Burn (6h)",
+      "datasource": {
+        "type": "prometheus",
+        "uid": "atlas-vm"
+      },
+      "gridPos": {
+        "h": 4,
+        "w": 6,
+        "x": 12,
+        "y": 0
+      },
+      "targets": [
+        {
+          "expr": "(1 - ((sum(rate(traefik_entrypoint_requests_total{code!~\"5..\"}[6h]))) / clamp_min(sum(rate(traefik_entrypoint_requests_total[6h])), 1))) / 0.0010000000000000009",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 1
+              },
+              {
+                "color": "orange",
+                "value": 2
+              },
+              {
+                "color": "red",
+                "value": 4
+              }
+            ]
+          },
+          "unit": "none",
+          "custom": {
+            "displayMode": "auto"
+          },
+          "decimals": 2
+        },
+        "overrides": []
+      },
+      "options": {
+        "colorMode": "value",
+        "graphMode": "area",
+        "justifyMode": "center",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "value"
+      }
+    },
+    {
+      "id": 4,
+      "type": "stat",
+      "title": "Edge P99 Latency (ms)",
+      "datasource": {
+        "type": "prometheus",
+        "uid": "atlas-vm"
+      },
+      "gridPos": {
+        "h": 4,
+        "w": 6,
+        "x": 18,
+        "y": 0
+      },
+      "targets": [
+        {
+          "expr": "histogram_quantile(0.99, sum by (le) (rate(traefik_entrypoint_request_duration_seconds_bucket[5m]))) * 1000",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 200
+              },
+              {
+                "color": "orange",
+                "value": 350
+              },
+              {
+                "color": "red",
+                "value": 500
+              }
+            ]
+          },
+          "unit": "ms",
+          "custom": {
+            "displayMode": "auto"
+          },
+          "decimals": 1
+        },
+        "overrides": []
+      },
+      "options": {
+        "colorMode": "value",
+        "graphMode": "area",
+        "justifyMode": "center",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "value"
+      }
+    },
+    {
+      "id": 5,
+      "type": "stat",
+      "title": "Ingress Traffic",
       "datasource": {
         "type": "prometheus",
         "uid": "atlas-vm"
@@ -135,19 +291,19 @@
       "gridPos": {
         "h": 4,
         "w": 8,
-        "x": 16,
-        "y": 0
+        "x": 0,
+        "y": 4
       },
       "targets": [
         {
-          "expr": "sum(rate(container_network_receive_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m]) + rate(container_network_transmit_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m])) or on() vector(0)",
+          "expr": "sum(rate(node_network_receive_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)",
           "refId": "A"
         }
       ],
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
@@ -185,9 +341,9 @@
       }
     },
     {
-      "id": 4,
+      "id": 6,
       "type": "stat",
-      "title": "Top Router req/s",
+      "title": "Egress Traffic",
       "datasource": {
         "type": "prometheus",
         "uid": "atlas-vm"
@@ -195,20 +351,19 @@
       "gridPos": {
         "h": 4,
         "w": 8,
-        "x": 0,
+        "x": 8,
         "y": 4
       },
       "targets": [
         {
-          "expr": "topk(1, sum by (router) (rate(traefik_router_requests_total[5m])))",
-          "refId": "A",
-          "legendFormat": "{{router}}"
+          "expr": "sum(rate(node_network_transmit_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)",
+          "refId": "A"
         }
       ],
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
@@ -224,7 +379,7 @@
               }
             ]
           },
-          "unit": "req/s",
+          "unit": "Bps",
           "custom": {
             "displayMode": "auto"
           }
@@ -246,7 +401,67 @@
       }
     },
     {
-      "id": 5,
+      "id": 7,
+      "type": "stat",
+      "title": "Intra-Cluster Traffic",
+      "datasource": {
+        "type": "prometheus",
+        "uid": "atlas-vm"
+      },
+      "gridPos": {
+        "h": 4,
+        "w": 8,
+        "x": 16,
+        "y": 4
+      },
+      "targets": [
+        {
+          "expr": "sum(rate(container_network_receive_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m]) + rate(container_network_transmit_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m])) or on() vector(0)",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "rgba(115, 115, 115, 1)",
+                "value": null
+              },
+              {
+                "color": "green",
+                "value": 1
+              }
+            ]
+          },
+          "unit": "Bps",
+          "custom": {
+            "displayMode": "auto"
+          }
+        },
+        "overrides": []
+      },
+      "options": {
+        "colorMode": "value",
+        "graphMode": "area",
+        "justifyMode": "center",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "value"
+      }
+    },
+    {
+      "id": 8,
       "type": "timeseries",
       "title": "Per-Node Throughput",
       "datasource": {
@@ -283,7 +498,7 @@
       }
     },
     {
-      "id": 6,
+      "id": 9,
       "type": "table",
       "title": "Top Namespaces",
       "datasource": {
@@ -304,12 +519,16 @@
       ],
       "fieldConfig": {
         "defaults": {
-          "unit": "Bps"
+          "unit": "Bps",
+          "custom": {
+            "filterable": true
+          }
         },
         "overrides": []
       },
       "options": {
-        "showHeader": true
+        "showHeader": true,
+        "columnFilters": false
       },
       "transformations": [
         {
@@ -319,7 +538,7 @@
       ]
     },
     {
-      "id": 7,
+      "id": 10,
       "type": "table",
       "title": "Top Pods",
       "datasource": {
@@ -340,12 +559,16 @@
       ],
       "fieldConfig": {
         "defaults": {
-          "unit": "Bps"
+          "unit": "Bps",
+          "custom": {
+            "filterable": true
+          }
         },
         "overrides": []
       },
       "options": {
-        "showHeader": true
+        "showHeader": true,
+        "columnFilters": false
       },
       "transformations": [
         {
@@ -355,7 +578,7 @@
       ]
     },
     {
-      "id": 8,
+      "id": 11,
       "type": "timeseries",
       "title": "Traefik Routers (req/s)",
       "datasource": {
@@ -392,7 +615,7 @@
       }
     },
     {
-      "id": 9,
+      "id": 12,
       "type": "timeseries",
       "title": "Traefik Entrypoints (req/s)",
       "datasource": {
diff --git a/services/monitoring/dashboards/atlas-nodes.json b/services/monitoring/dashboards/atlas-nodes.json
index 802fe5a..495c622 100644
--- a/services/monitoring/dashboards/atlas-nodes.json
+++ b/services/monitoring/dashboards/atlas-nodes.json
@@ -27,7 +27,7 @@
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
@@ -88,7 +88,7 @@
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
@@ -149,7 +149,7 @@
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
@@ -186,6 +186,213 @@
         "textMode": "value"
       }
     },
+    {
+      "id": 9,
+      "type": "stat",
+      "title": "API Server 5xx rate",
+      "datasource": {
+        "type": "prometheus",
+        "uid": "atlas-vm"
+      },
+      "gridPos": {
+        "h": 4,
+        "w": 8,
+        "x": 0,
+        "y": 4
+      },
+      "targets": [
+        {
+          "expr": "sum(rate(apiserver_request_total{code=~\"5..\"}[5m]))",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 0.05
+              },
+              {
+                "color": "orange",
+                "value": 0.2
+              },
+              {
+                "color": "red",
+                "value": 0.5
+              }
+            ]
+          },
+          "unit": "req/s",
+          "custom": {
+            "displayMode": "auto"
+          },
+          "decimals": 3
+        },
+        "overrides": []
+      },
+      "options": {
+        "colorMode": "value",
+        "graphMode": "area",
+        "justifyMode": "center",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "value"
+      }
+    },
+    {
+      "id": 10,
+      "type": "stat",
+      "title": "API Server P99 latency",
+      "datasource": {
+        "type": "prometheus",
+        "uid": "atlas-vm"
+      },
+      "gridPos": {
+        "h": 4,
+        "w": 8,
+        "x": 8,
+        "y": 4
+      },
+      "targets": [
+        {
+          "expr": "histogram_quantile(0.99, sum by (le) (rate(apiserver_request_duration_seconds_bucket[5m]))) * 1000",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 250
+              },
+              {
+                "color": "orange",
+                "value": 400
+              },
+              {
+                "color": "red",
+                "value": 600
+              }
+            ]
+          },
+          "unit": "ms",
+          "custom": {
+            "displayMode": "auto"
+          },
+          "decimals": 1
+        },
+        "overrides": []
+      },
+      "options": {
+        "colorMode": "value",
+        "graphMode": "area",
+        "justifyMode": "center",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "value"
+      }
+    },
+    {
+      "id": 11,
+      "type": "stat",
+      "title": "etcd P99 latency",
+      "datasource": {
+        "type": "prometheus",
+        "uid": "atlas-vm"
+      },
+      "gridPos": {
+        "h": 4,
+        "w": 8,
+        "x": 16,
+        "y": 4
+      },
+      "targets": [
+        {
+          "expr": "histogram_quantile(0.99, sum by (le) (rate(etcd_request_duration_seconds_bucket[5m]))) * 1000",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 50
+              },
+              {
+                "color": "orange",
+                "value": 100
+              },
+              {
+                "color": "red",
+                "value": 200
+              }
+            ]
+          },
+          "unit": "ms",
+          "custom": {
+            "displayMode": "auto"
+          },
+          "decimals": 1
+        },
+        "overrides": []
+      },
+      "options": {
+        "colorMode": "value",
+        "graphMode": "area",
+        "justifyMode": "center",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "value"
+      }
+    },
     {
       "id": 4,
       "type": "timeseries",
@@ -198,7 +405,7 @@
         "h": 9,
         "w": 24,
         "x": 0,
-        "y": 4
+        "y": 8
       },
       "targets": [
         {
@@ -238,7 +445,7 @@
         "h": 9,
         "w": 24,
         "x": 0,
-        "y": 13
+        "y": 17
       },
       "targets": [
         {
@@ -278,7 +485,7 @@
         "h": 9,
         "w": 12,
         "x": 0,
-        "y": 22
+        "y": 26
       },
       "targets": [
         {
@@ -315,7 +522,7 @@
         "h": 9,
         "w": 12,
         "x": 12,
-        "y": 22
+        "y": 26
       },
       "targets": [
         {
@@ -352,7 +559,7 @@
         "h": 9,
         "w": 24,
         "x": 0,
-        "y": 31
+        "y": 35
       },
       "targets": [
         {
diff --git a/services/monitoring/dashboards/atlas-overview.json b/services/monitoring/dashboards/atlas-overview.json
index beb676e..93ee927 100644
--- a/services/monitoring/dashboards/atlas-overview.json
+++ b/services/monitoring/dashboards/atlas-overview.json
@@ -7,67 +7,6 @@
     "list": []
   },
   "panels": [
-    {
-      "id": 1,
-      "type": "gauge",
-      "title": "Workers Ready",
-      "datasource": {
-        "type": "prometheus",
-        "uid": "atlas-vm"
-      },
-      "gridPos": {
-        "h": 5,
-        "w": 5,
-        "x": 0,
-        "y": 0
-      },
-      "targets": [
-        {
-          "expr": "sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-04|titan-05|titan-06|titan-07|titan-08|titan-09|titan-10|titan-11|titan-12|titan-13|titan-14|titan-15|titan-16|titan-17|titan-18|titan-19|titan-22|titan-24\"})",
-          "refId": "A"
-        }
-      ],
-      "fieldConfig": {
-        "defaults": {
-          "min": 0,
-          "max": 18,
-          "thresholds": {
-            "mode": "absolute",
-            "steps": [
-              {
-                "color": "red",
-                "value": null
-              },
-              {
-                "color": "orange",
-                "value": 16
-              },
-              {
-                "color": "yellow",
-                "value": 17
-              },
-              {
-                "color": "green",
-                "value": 18
-              }
-            ]
-          }
-        },
-        "overrides": []
-      },
-      "options": {
-        "reduceOptions": {
-          "calcs": [
-            "lastNotNull"
-          ],
-          "fields": "",
-          "values": false
-        },
-        "orientation": "auto",
-        "showThresholdMarkers": false,
-        "showThresholdLabels": false
-      }
-    },
     {
       "id": 2,
       "type": "gauge",
@@ -78,8 +17,8 @@
       },
       "gridPos": {
         "h": 5,
-        "w": 5,
-        "x": 5,
+        "w": 4,
+        "x": 0,
         "y": 0
       },
       "targets": [
@@ -131,8 +70,8 @@
       },
       "gridPos": {
         "h": 5,
-        "w": 5,
-        "x": 10,
+        "w": 3,
+        "x": 4,
         "y": 0
       },
       "targets": [
@@ -144,82 +83,7 @@
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
-          },
-          "mappings": [],
-          "thresholds": {
-            "mode": "absolute",
-            "steps": [
-              {
-                "color": "green",
-                "value": null
-              },
-              {
-                "color": "yellow",
-                "value": 1
-              },
-              {
-                "color": "orange",
-                "value": 2
-              },
-              {
-                "color": "red",
-                "value": 3
-              }
-            ]
-          },
-          "unit": "none",
-          "custom": {
-            "displayMode": "auto"
-          }
-        },
-        "overrides": []
-      },
-      "options": {
-        "colorMode": "value",
-        "graphMode": "area",
-        "justifyMode": "center",
-        "reduceOptions": {
-          "calcs": [
-            "lastNotNull"
-          ],
-          "fields": "",
-          "values": false
-        },
-        "textMode": "value"
-      },
-      "links": [
-        {
-          "title": "Open atlas-pods dashboard",
-          "url": "/d/atlas-pods",
-          "targetBlank": true
-        }
-      ]
-    },
-    {
-      "id": 4,
-      "type": "stat",
-      "title": "Problem Pods",
-      "datasource": {
-        "type": "prometheus",
-        "uid": "atlas-vm"
-      },
-      "gridPos": {
-        "h": 5,
-        "w": 5,
-        "x": 15,
-        "y": 0
-      },
-      "targets": [
-        {
-          "expr": "sum(max by (namespace,pod) (kube_pod_status_phase{phase!~\"Running|Succeeded\"}))",
-          "refId": "A"
-        }
-      ],
-      "fieldConfig": {
-        "defaults": {
-          "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
@@ -281,20 +145,20 @@
       },
       "gridPos": {
         "h": 5,
-        "w": 4,
-        "x": 20,
+        "w": 3,
+        "x": 7,
         "y": 0
       },
       "targets": [
         {
-          "expr": "sum(max by (namespace,pod) (((time() - kube_pod_deletion_timestamp{pod!=\"\"}) > bool 600) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0)))",
+          "expr": "sum(max by (namespace,pod) (((time() - kube_pod_deletion_timestamp{pod!=\"\"}) > bool 600) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0))) or on() vector(0)",
           "refId": "A"
         }
       ],
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
@@ -346,6 +210,286 @@
         }
       ]
     },
+    {
+      "id": 27,
+      "type": "stat",
+      "title": "Atlas Availability (30d)",
+      "datasource": {
+        "type": "prometheus",
+        "uid": "atlas-vm"
+      },
+      "gridPos": {
+        "h": 5,
+        "w": 4,
+        "x": 10,
+        "y": 0
+      },
+      "targets": [
+        {
+          "expr": "avg_over_time((min(((sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-0a|titan-0b|titan-0c\"}) / 3)), ((sum(kube_deployment_status_replicas_available{namespace=~\"traefik|kube-system\",deployment=\"traefik\"}) / clamp_min(sum(kube_deployment_spec_replicas{namespace=~\"traefik|kube-system\",deployment=\"traefik\"}), 1)))))[30d:5m])",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "red",
+                "value": null
+              },
+              {
+                "color": "orange",
+                "value": 0.999
+              },
+              {
+                "color": "yellow",
+                "value": 0.9999
+              },
+              {
+                "color": "green",
+                "value": 0.99999
+              }
+            ]
+          },
+          "unit": "percentunit",
+          "custom": {
+            "displayMode": "auto"
+          },
+          "decimals": 3
+        },
+        "overrides": []
+      },
+      "options": {
+        "colorMode": "value",
+        "graphMode": "area",
+        "justifyMode": "center",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "value"
+      }
+    },
+    {
+      "id": 4,
+      "type": "stat",
+      "title": "Problem Pods",
+      "datasource": {
+        "type": "prometheus",
+        "uid": "atlas-vm"
+      },
+      "gridPos": {
+        "h": 5,
+        "w": 3,
+        "x": 14,
+        "y": 0
+      },
+      "targets": [
+        {
+          "expr": "sum(max by (namespace,pod) (kube_pod_status_phase{phase!~\"Running|Succeeded\"})) or on() vector(0)",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 1
+              },
+              {
+                "color": "orange",
+                "value": 2
+              },
+              {
+                "color": "red",
+                "value": 3
+              }
+            ]
+          },
+          "unit": "none",
+          "custom": {
+            "displayMode": "auto"
+          }
+        },
+        "overrides": []
+      },
+      "options": {
+        "colorMode": "value",
+        "graphMode": "area",
+        "justifyMode": "center",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "value"
+      },
+      "links": [
+        {
+          "title": "Open atlas-pods dashboard",
+          "url": "/d/atlas-pods",
+          "targetBlank": true
+        }
+      ]
+    },
+    {
+      "id": 6,
+      "type": "stat",
+      "title": "CrashLoop / ImagePull",
+      "datasource": {
+        "type": "prometheus",
+        "uid": "atlas-vm"
+      },
+      "gridPos": {
+        "h": 5,
+        "w": 3,
+        "x": 17,
+        "y": 0
+      },
+      "targets": [
+        {
+          "expr": "sum(max by (namespace,pod) (kube_pod_container_status_waiting_reason{reason=~\"CrashLoopBackOff|ImagePullBackOff\"})) or on() vector(0)",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 1
+              },
+              {
+                "color": "orange",
+                "value": 2
+              },
+              {
+                "color": "red",
+                "value": 3
+              }
+            ]
+          },
+          "unit": "none",
+          "custom": {
+            "displayMode": "auto"
+          }
+        },
+        "overrides": []
+      },
+      "options": {
+        "colorMode": "value",
+        "graphMode": "area",
+        "justifyMode": "center",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "value"
+      },
+      "links": [
+        {
+          "title": "Open atlas-pods dashboard",
+          "url": "/d/atlas-pods",
+          "targetBlank": true
+        }
+      ]
+    },
+    {
+      "id": 1,
+      "type": "gauge",
+      "title": "Workers Ready",
+      "datasource": {
+        "type": "prometheus",
+        "uid": "atlas-vm"
+      },
+      "gridPos": {
+        "h": 5,
+        "w": 4,
+        "x": 20,
+        "y": 0
+      },
+      "targets": [
+        {
+          "expr": "sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-04|titan-05|titan-06|titan-07|titan-08|titan-09|titan-10|titan-11|titan-12|titan-13|titan-14|titan-15|titan-16|titan-17|titan-18|titan-19|titan-22|titan-24\"})",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "min": 0,
+          "max": 18,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "red",
+                "value": null
+              },
+              {
+                "color": "orange",
+                "value": 16
+              },
+              {
+                "color": "yellow",
+                "value": 17
+              },
+              {
+                "color": "green",
+                "value": 18
+              }
+            ]
+          }
+        },
+        "overrides": []
+      },
+      "options": {
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "orientation": "auto",
+        "showThresholdMarkers": false,
+        "showThresholdLabels": false
+      }
+    },
     {
       "id": 7,
       "type": "stat",
@@ -371,11 +515,11 @@
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
-            "mode": "percentage",
+            "mode": "absolute",
             "steps": [
               {
                 "color": "green",
@@ -383,11 +527,15 @@
               },
               {
                 "color": "yellow",
-                "value": 70
+                "value": 50
+              },
+              {
+                "color": "orange",
+                "value": 75
               },
               {
                 "color": "red",
-                "value": 85
+                "value": 91.5
               }
             ]
           },
@@ -444,11 +592,11 @@
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
-            "mode": "percentage",
+            "mode": "absolute",
             "steps": [
               {
                 "color": "green",
@@ -456,11 +604,15 @@
               },
               {
                 "color": "yellow",
-                "value": 70
+                "value": 50
+              },
+              {
+                "color": "orange",
+                "value": 75
               },
               {
                 "color": "red",
-                "value": 85
+                "value": 91.5
               }
             ]
           },
@@ -517,7 +669,7 @@
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
@@ -586,7 +738,7 @@
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
@@ -653,11 +805,11 @@
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
-            "mode": "percentage",
+            "mode": "absolute",
             "steps": [
               {
                 "color": "green",
@@ -665,11 +817,15 @@
               },
               {
                 "color": "yellow",
-                "value": 70
+                "value": 50
+              },
+              {
+                "color": "orange",
+                "value": 75
               },
               {
                 "color": "red",
-                "value": 85
+                "value": 91.5
               }
             ]
           },
@@ -724,11 +880,11 @@
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
-            "mode": "percentage",
+            "mode": "absolute",
             "steps": [
               {
                 "color": "green",
@@ -736,11 +892,15 @@
               },
               {
                 "color": "yellow",
-                "value": 70
+                "value": 50
+              },
+              {
+                "color": "orange",
+                "value": 75
               },
               {
                 "color": "red",
-                "value": 85
+                "value": 91.5
               }
             ]
           },
@@ -795,7 +955,7 @@
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
@@ -862,7 +1022,7 @@
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
@@ -942,9 +1102,7 @@
           "placement": "right"
         },
         "pieType": "pie",
-        "displayLabels": [
-          "percent"
-        ],
+        "displayLabels": [],
         "tooltip": {
           "mode": "single"
         },
@@ -995,9 +1153,7 @@
           "placement": "right"
         },
         "pieType": "pie",
-        "displayLabels": [
-          "percent"
-        ],
+        "displayLabels": [],
         "tooltip": {
           "mode": "single"
         },
@@ -1048,9 +1204,7 @@
           "placement": "right"
         },
         "pieType": "pie",
-        "displayLabels": [
-          "percent"
-        ],
+        "displayLabels": [],
         "tooltip": {
           "mode": "single"
         },
@@ -1175,7 +1329,7 @@
       },
       "targets": [
         {
-          "expr": "(avg by (node) (((1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")",
+          "expr": "(avg by (node) (((1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c|titan-db\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")",
           "refId": "A",
           "legendFormat": "{{node}}"
         }
@@ -1212,7 +1366,7 @@
       },
       "targets": [
         {
-          "expr": "(avg by (node) ((avg by (instance) ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")",
+          "expr": "(avg by (node) ((avg by (instance) ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c|titan-db\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")",
           "refId": "A",
           "legendFormat": "{{node}}"
         }
@@ -1233,6 +1387,138 @@
         }
       }
     },
+    {
+      "id": 28,
+      "type": "piechart",
+      "title": "Node Pod Share",
+      "datasource": {
+        "type": "prometheus",
+        "uid": "atlas-vm"
+      },
+      "gridPos": {
+        "h": 10,
+        "w": 12,
+        "x": 0,
+        "y": 54
+      },
+      "targets": [
+        {
+          "expr": "(sum(kube_pod_info{pod!=\"\" , node!=\"\"}) by (node) / clamp_min(sum(kube_pod_info{pod!=\"\" , node!=\"\"}), 1)) * 100",
+          "refId": "A",
+          "legendFormat": "{{namespace}}"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "percent",
+          "color": {
+            "mode": "palette-classic"
+          }
+        },
+        "overrides": []
+      },
+      "options": {
+        "legend": {
+          "displayMode": "list",
+          "placement": "right"
+        },
+        "pieType": "pie",
+        "displayLabels": [],
+        "tooltip": {
+          "mode": "single"
+        },
+        "colorScheme": "interpolateSpectral",
+        "colorBy": "value",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        }
+      }
+    },
+    {
+      "id": 29,
+      "type": "bargauge",
+      "title": "Top Nodes by Pod Count",
+      "datasource": {
+        "type": "prometheus",
+        "uid": "atlas-vm"
+      },
+      "gridPos": {
+        "h": 10,
+        "w": 12,
+        "x": 12,
+        "y": 54
+      },
+      "targets": [
+        {
+          "expr": "topk(12, sum(kube_pod_info{pod!=\"\" , node!=\"\"}) by (node))",
+          "refId": "A",
+          "legendFormat": "{{node}}",
+          "instant": true
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "none",
+          "min": 0,
+          "max": null,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 50
+              },
+              {
+                "color": "orange",
+                "value": 75
+              },
+              {
+                "color": "red",
+                "value": 100
+              }
+            ]
+          },
+          "decimals": 0
+        },
+        "overrides": []
+      },
+      "options": {
+        "displayMode": "gradient",
+        "orientation": "horizontal",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        }
+      },
+      "transformations": [
+        {
+          "id": "sortBy",
+          "options": {
+            "fields": [
+              "Value"
+            ],
+            "order": "desc"
+          }
+        },
+        {
+          "id": "limit",
+          "options": {
+            "limit": 12
+          }
+        }
+      ]
+    },
     {
       "id": 18,
       "type": "timeseries",
@@ -1377,7 +1663,7 @@
         "h": 16,
         "w": 12,
         "x": 0,
-        "y": 54
+        "y": 64
       },
       "targets": [
         {
@@ -1425,7 +1711,7 @@
         "h": 16,
         "w": 12,
         "x": 12,
-        "y": 54
+        "y": 64
       },
       "targets": [
         {
@@ -1452,11 +1738,11 @@
               },
               {
                 "color": "orange",
-                "value": 70
+                "value": 75
               },
               {
                 "color": "red",
-                "value": 85
+                "value": 91.5
               }
             ]
           }
@@ -1480,6 +1766,17 @@
           "url": "/d/atlas-storage",
           "targetBlank": true
         }
+      ],
+      "transformations": [
+        {
+          "id": "sortBy",
+          "options": {
+            "fields": [
+              "Value"
+            ],
+            "order": "desc"
+          }
+        }
       ]
     }
   ],
@@ -1497,36 +1794,5 @@
     "to": "now"
   },
   "refresh": "1m",
-  "links": [
-    {
-      "title": "Atlas Pods",
-      "type": "dashboard",
-      "dashboardUid": "atlas-pods",
-      "keepTime": false
-    },
-    {
-      "title": "Atlas Nodes",
-      "type": "dashboard",
-      "dashboardUid": "atlas-nodes",
-      "keepTime": false
-    },
-    {
-      "title": "Atlas Storage",
-      "type": "dashboard",
-      "dashboardUid": "atlas-storage",
-      "keepTime": false
-    },
-    {
-      "title": "Atlas Network",
-      "type": "dashboard",
-      "dashboardUid": "atlas-network",
-      "keepTime": false
-    },
-    {
-      "title": "Atlas GPU",
-      "type": "dashboard",
-      "dashboardUid": "atlas-gpu",
-      "keepTime": false
-    }
-  ]
+  "links": []
 }
diff --git a/services/monitoring/dashboards/atlas-pods.json b/services/monitoring/dashboards/atlas-pods.json
index ef616e0..4b2a54a 100644
--- a/services/monitoring/dashboards/atlas-pods.json
+++ b/services/monitoring/dashboards/atlas-pods.json
@@ -20,14 +20,14 @@
       },
       "targets": [
         {
-          "expr": "sum(max by (namespace,pod) (kube_pod_status_phase{phase!~\"Running|Succeeded\"}))",
+          "expr": "sum(max by (namespace,pod) (kube_pod_status_phase{phase!~\"Running|Succeeded\"})) or on() vector(0)",
           "refId": "A"
         }
       ],
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
@@ -80,14 +80,14 @@
       },
       "targets": [
         {
-          "expr": "sum(max by (namespace,pod) (kube_pod_container_status_waiting_reason{reason=~\"CrashLoopBackOff|ImagePullBackOff\"}))",
+          "expr": "sum(max by (namespace,pod) (kube_pod_container_status_waiting_reason{reason=~\"CrashLoopBackOff|ImagePullBackOff\"})) or on() vector(0)",
           "refId": "A"
         }
       ],
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
@@ -140,14 +140,14 @@
       },
       "targets": [
         {
-          "expr": "sum(max by (namespace,pod) (((time() - kube_pod_deletion_timestamp{pod!=\"\"}) > bool 600) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0)))",
+          "expr": "sum(max by (namespace,pod) (((time() - kube_pod_deletion_timestamp{pod!=\"\"}) > bool 600) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0))) or on() vector(0)",
           "refId": "A"
         }
       ],
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
@@ -207,7 +207,7 @@
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
@@ -266,12 +266,16 @@
       ],
       "fieldConfig": {
         "defaults": {
-          "unit": "s"
+          "unit": "s",
+          "custom": {
+            "filterable": true
+          }
         },
         "overrides": []
       },
       "options": {
-        "showHeader": true
+        "showHeader": true,
+        "columnFilters": false
       },
       "transformations": [
         {
@@ -302,12 +306,16 @@
       ],
       "fieldConfig": {
         "defaults": {
-          "unit": "s"
+          "unit": "s",
+          "custom": {
+            "filterable": true
+          }
         },
         "overrides": []
       },
       "options": {
-        "showHeader": true
+        "showHeader": true,
+        "columnFilters": false
       },
       "transformations": [
         {
@@ -338,12 +346,16 @@
       ],
       "fieldConfig": {
         "defaults": {
-          "unit": "s"
+          "unit": "s",
+          "custom": {
+            "filterable": true
+          }
         },
         "overrides": []
       },
       "options": {
-        "showHeader": true
+        "showHeader": true,
+        "columnFilters": false
       },
       "transformations": [
         {
@@ -359,6 +371,233 @@
           }
         }
       ]
+    },
+    {
+      "id": 8,
+      "type": "piechart",
+      "title": "Node Pod Share",
+      "datasource": {
+        "type": "prometheus",
+        "uid": "atlas-vm"
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 12,
+        "y": 34
+      },
+      "targets": [
+        {
+          "expr": "(sum(kube_pod_info{pod!=\"\" , node!=\"\"}) by (node) / clamp_min(sum(kube_pod_info{pod!=\"\" , node!=\"\"}), 1)) * 100",
+          "refId": "A",
+          "legendFormat": "{{namespace}}"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "percent",
+          "color": {
+            "mode": "palette-classic"
+          }
+        },
+        "overrides": []
+      },
+      "options": {
+        "legend": {
+          "displayMode": "list",
+          "placement": "right"
+        },
+        "pieType": "pie",
+        "displayLabels": [],
+        "tooltip": {
+          "mode": "single"
+        },
+        "colorScheme": "interpolateSpectral",
+        "colorBy": "value",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        }
+      }
+    },
+    {
+      "id": 9,
+      "type": "bargauge",
+      "title": "Top Nodes by Pod Count",
+      "datasource": {
+        "type": "prometheus",
+        "uid": "atlas-vm"
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 0,
+        "y": 34
+      },
+      "targets": [
+        {
+          "expr": "topk(12, sum(kube_pod_info{pod!=\"\" , node!=\"\"}) by (node))",
+          "refId": "A",
+          "legendFormat": "{{node}}",
+          "instant": true
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "none",
+          "min": 0,
+          "max": null,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 50
+              },
+              {
+                "color": "orange",
+                "value": 75
+              },
+              {
+                "color": "red",
+                "value": 100
+              }
+            ]
+          },
+          "decimals": 0
+        },
+        "overrides": []
+      },
+      "options": {
+        "displayMode": "gradient",
+        "orientation": "horizontal",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        }
+      },
+      "transformations": [
+        {
+          "id": "sortBy",
+          "options": {
+            "fields": [
+              "Value"
+            ],
+            "order": "desc"
+          }
+        },
+        {
+          "id": "limit",
+          "options": {
+            "limit": 12
+          }
+        }
+      ]
+    },
+    {
+      "id": 10,
+      "type": "table",
+      "title": "Namespace Plurality by Node v27",
+      "datasource": {
+        "type": "prometheus",
+        "uid": "atlas-vm"
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 24,
+        "x": 0,
+        "y": 42
+      },
+      "targets": [
+        {
+          "expr": "(sum by (namespace,node) (kube_pod_info{pod!=\"\" , node!=\"\"}) / on(namespace) group_left() clamp_min(sum by (namespace) (kube_pod_info{pod!=\"\"}), 1) * 100) * on(namespace,node) group_left() ((sum by (namespace,node) (kube_pod_info{pod!=\"\" , node!=\"\"}) / on(namespace) group_left() clamp_min(sum by (namespace) (kube_pod_info{pod!=\"\"}), 1) * 100) + on(node) group_left() ((sum by (node) (kube_node_info{node=\"titan-0a\"}) * 0 + 0.001) or (sum by (node) (kube_node_info{node=\"titan-0b\"}) * 0 + 0.002) or (sum by (node) (kube_node_info{node=\"titan-0c\"}) * 0 + 0.003) or (sum by (node) (kube_node_info{node=\"titan-db\"}) * 0 + 0.004) or (sum by (node) (kube_node_info{node=\"titan-04\"}) * 0 + 0.005) or (sum by (node) (kube_node_info{node=\"titan-05\"}) * 0 + 0.006) or (sum by (node) (kube_node_info{node=\"titan-06\"}) * 0 + 0.007) or (sum by (node) (kube_node_info{node=\"titan-07\"}) * 0 + 0.008) or (sum by (node) (kube_node_info{node=\"titan-08\"}) * 0 + 0.009000000000000001) or (sum by (node) (kube_node_info{node=\"titan-09\"}) * 0 + 0.01) or (sum by (node) (kube_node_info{node=\"titan-10\"}) * 0 + 0.011) or (sum by (node) (kube_node_info{node=\"titan-11\"}) * 0 + 0.012) or (sum by (node) (kube_node_info{node=\"titan-12\"}) * 0 + 0.013000000000000001) or (sum by (node) (kube_node_info{node=\"titan-13\"}) * 0 + 0.014) or (sum by (node) (kube_node_info{node=\"titan-14\"}) * 0 + 0.015) or (sum by (node) (kube_node_info{node=\"titan-15\"}) * 0 + 0.016) or (sum by (node) (kube_node_info{node=\"titan-16\"}) * 0 + 0.017) or (sum by (node) (kube_node_info{node=\"titan-17\"}) * 0 + 0.018000000000000002) or (sum by (node) (kube_node_info{node=\"titan-18\"}) * 0 + 0.019) or (sum by (node) (kube_node_info{node=\"titan-19\"}) * 0 + 0.02) or (sum by (node) (kube_node_info{node=\"titan-22\"}) * 0 + 0.021) or (sum by (node) (kube_node_info{node=\"titan-24\"}) * 0 + 0.022)) == bool on(namespace) group_left() (max by (namespace) ((sum by (namespace,node) (kube_pod_info{pod!=\"\" , node!=\"\"}) / on(namespace) group_left() clamp_min(sum by (namespace) (kube_pod_info{pod!=\"\"}), 1) * 100) + on(node) group_left() ((sum by (node) (kube_node_info{node=\"titan-0a\"}) * 0 + 0.001) or (sum by (node) (kube_node_info{node=\"titan-0b\"}) * 0 + 0.002) or (sum by (node) (kube_node_info{node=\"titan-0c\"}) * 0 + 0.003) or (sum by (node) (kube_node_info{node=\"titan-db\"}) * 0 + 0.004) or (sum by (node) (kube_node_info{node=\"titan-04\"}) * 0 + 0.005) or (sum by (node) (kube_node_info{node=\"titan-05\"}) * 0 + 0.006) or (sum by (node) (kube_node_info{node=\"titan-06\"}) * 0 + 0.007) or (sum by (node) (kube_node_info{node=\"titan-07\"}) * 0 + 0.008) or (sum by (node) (kube_node_info{node=\"titan-08\"}) * 0 + 0.009000000000000001) or (sum by (node) (kube_node_info{node=\"titan-09\"}) * 0 + 0.01) or (sum by (node) (kube_node_info{node=\"titan-10\"}) * 0 + 0.011) or (sum by (node) (kube_node_info{node=\"titan-11\"}) * 0 + 0.012) or (sum by (node) (kube_node_info{node=\"titan-12\"}) * 0 + 0.013000000000000001) or (sum by (node) (kube_node_info{node=\"titan-13\"}) * 0 + 0.014) or (sum by (node) (kube_node_info{node=\"titan-14\"}) * 0 + 0.015) or (sum by (node) (kube_node_info{node=\"titan-15\"}) * 0 + 0.016) or (sum by (node) (kube_node_info{node=\"titan-16\"}) * 0 + 0.017) or (sum by (node) (kube_node_info{node=\"titan-17\"}) * 0 + 0.018000000000000002) or (sum by (node) (kube_node_info{node=\"titan-18\"}) * 0 + 0.019) or (sum by (node) (kube_node_info{node=\"titan-19\"}) * 0 + 0.02) or (sum by (node) (kube_node_info{node=\"titan-22\"}) * 0 + 0.021) or (sum by (node) (kube_node_info{node=\"titan-24\"}) * 0 + 0.022)))))",
+          "refId": "A",
+          "instant": true,
+          "format": "table"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "percent",
+          "custom": {
+            "filterable": false
+          }
+        },
+        "overrides": []
+      },
+      "options": {
+        "showHeader": true,
+        "columnFilters": false,
+        "showColumnFilters": false,
+        "footer": {
+          "show": false,
+          "fields": "",
+          "calcs": []
+        }
+      },
+      "transformations": [
+        {
+          "id": "labelsToFields",
+          "options": {}
+        },
+        {
+          "id": "organize",
+          "options": {
+            "excludeByName": {
+              "Time": true
+            }
+          }
+        },
+        {
+          "id": "filterByValue",
+          "options": {
+            "match": "Value",
+            "operator": "gt",
+            "value": 0
+          }
+        },
+        {
+          "id": "sortBy",
+          "options": {
+            "fields": [
+              "Value"
+            ],
+            "order": "desc"
+          }
+        },
+        {
+          "id": "groupBy",
+          "options": {
+            "fields": {
+              "namespace": {
+                "aggregations": [
+                  {
+                    "field": "Value",
+                    "operation": "max"
+                  },
+                  {
+                    "field": "node",
+                    "operation": "first"
+                  }
+                ]
+              }
+            },
+            "rowBy": [
+              "namespace"
+            ]
+          }
+        }
+      ]
     }
   ],
   "time": {
diff --git a/services/monitoring/dashboards/atlas-storage.json b/services/monitoring/dashboards/atlas-storage.json
index 1d07040..2e548b2 100644
--- a/services/monitoring/dashboards/atlas-storage.json
+++ b/services/monitoring/dashboards/atlas-storage.json
@@ -27,11 +27,11 @@
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
-            "mode": "percentage",
+            "mode": "absolute",
             "steps": [
               {
                 "color": "green",
@@ -39,11 +39,15 @@
               },
               {
                 "color": "yellow",
-                "value": 70
+                "value": 50
+              },
+              {
+                "color": "orange",
+                "value": 75
               },
               {
                 "color": "red",
-                "value": 85
+                "value": 91.5
               }
             ]
           },
@@ -91,11 +95,11 @@
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
-            "mode": "percentage",
+            "mode": "absolute",
             "steps": [
               {
                 "color": "green",
@@ -103,11 +107,15 @@
               },
               {
                 "color": "yellow",
-                "value": 70
+                "value": 50
+              },
+              {
+                "color": "orange",
+                "value": 75
               },
               {
                 "color": "red",
-                "value": 85
+                "value": 91.5
               }
             ]
           },
@@ -155,7 +163,7 @@
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
@@ -215,7 +223,7 @@
       "fieldConfig": {
         "defaults": {
           "color": {
-            "mode": "palette-classic"
+            "mode": "thresholds"
           },
           "mappings": [],
           "thresholds": {
diff --git a/services/monitoring/grafana-dashboard-gpu.yaml b/services/monitoring/grafana-dashboard-gpu.yaml
index b5c2c18..48725de 100644
--- a/services/monitoring/grafana-dashboard-gpu.yaml
+++ b/services/monitoring/grafana-dashboard-gpu.yaml
@@ -49,9 +49,7 @@ data:
               "placement": "right"
             },
             "pieType": "pie",
-            "displayLabels": [
-              "percent"
-            ],
+            "displayLabels": [],
             "tooltip": {
               "mode": "single"
             },
@@ -162,12 +160,16 @@ data:
           ],
           "fieldConfig": {
             "defaults": {
-              "unit": "percent"
+              "unit": "percent",
+              "custom": {
+                "filterable": true
+              }
             },
             "overrides": []
           },
           "options": {
-            "showHeader": true
+            "showHeader": true,
+            "columnFilters": false
           },
           "transformations": [
             {
diff --git a/services/monitoring/grafana-dashboard-network.yaml b/services/monitoring/grafana-dashboard-network.yaml
index fd1f5d6..a87600f 100644
--- a/services/monitoring/grafana-dashboard-network.yaml
+++ b/services/monitoring/grafana-dashboard-network.yaml
@@ -16,46 +16,55 @@ data:
         {
           "id": 1,
           "type": "stat",
-          "title": "Ingress Traffic",
+          "title": "Ingress Success Rate (5m)",
           "datasource": {
             "type": "prometheus",
             "uid": "atlas-vm"
           },
           "gridPos": {
             "h": 4,
-            "w": 8,
+            "w": 6,
             "x": 0,
             "y": 0
           },
           "targets": [
             {
-              "expr": "sum(rate(node_network_receive_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)",
+              "expr": "(sum(rate(traefik_entrypoint_requests_total{code!~\"5..\"}[5m]))) / clamp_min(sum(rate(traefik_entrypoint_requests_total[5m])), 1)",
               "refId": "A"
             }
           ],
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
                 "mode": "absolute",
                 "steps": [
                   {
-                    "color": "rgba(115, 115, 115, 1)",
+                    "color": "red",
                     "value": null
                   },
+                  {
+                    "color": "orange",
+                    "value": 0.995
+                  },
+                  {
+                    "color": "yellow",
+                    "value": 0.999
+                  },
                   {
                     "color": "green",
-                    "value": 1
+                    "value": 0.9995
                   }
                 ]
               },
-              "unit": "Bps",
+              "unit": "percentunit",
               "custom": {
                 "displayMode": "auto"
-              }
+              },
+              "decimals": 2
             },
             "overrides": []
           },
@@ -76,46 +85,55 @@ data:
         {
           "id": 2,
           "type": "stat",
-          "title": "Egress Traffic",
+          "title": "Error Budget Burn (1h)",
           "datasource": {
             "type": "prometheus",
             "uid": "atlas-vm"
           },
           "gridPos": {
             "h": 4,
-            "w": 8,
-            "x": 8,
+            "w": 6,
+            "x": 6,
             "y": 0
           },
           "targets": [
             {
-              "expr": "sum(rate(node_network_transmit_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)",
+              "expr": "(1 - ((sum(rate(traefik_entrypoint_requests_total{code!~\"5..\"}[1h]))) / clamp_min(sum(rate(traefik_entrypoint_requests_total[1h])), 1))) / 0.0010000000000000009",
               "refId": "A"
             }
           ],
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
                 "mode": "absolute",
                 "steps": [
                   {
-                    "color": "rgba(115, 115, 115, 1)",
+                    "color": "green",
                     "value": null
                   },
                   {
-                    "color": "green",
+                    "color": "yellow",
                     "value": 1
+                  },
+                  {
+                    "color": "orange",
+                    "value": 2
+                  },
+                  {
+                    "color": "red",
+                    "value": 4
                   }
                 ]
               },
-              "unit": "Bps",
+              "unit": "none",
               "custom": {
                 "displayMode": "auto"
-              }
+              },
+              "decimals": 2
             },
             "overrides": []
           },
@@ -136,7 +154,145 @@ data:
         {
           "id": 3,
           "type": "stat",
-          "title": "Intra-Cluster Traffic",
+          "title": "Error Budget Burn (6h)",
+          "datasource": {
+            "type": "prometheus",
+            "uid": "atlas-vm"
+          },
+          "gridPos": {
+            "h": 4,
+            "w": 6,
+            "x": 12,
+            "y": 0
+          },
+          "targets": [
+            {
+              "expr": "(1 - ((sum(rate(traefik_entrypoint_requests_total{code!~\"5..\"}[6h]))) / clamp_min(sum(rate(traefik_entrypoint_requests_total[6h])), 1))) / 0.0010000000000000009",
+              "refId": "A"
+            }
+          ],
+          "fieldConfig": {
+            "defaults": {
+              "color": {
+                "mode": "thresholds"
+              },
+              "mappings": [],
+              "thresholds": {
+                "mode": "absolute",
+                "steps": [
+                  {
+                    "color": "green",
+                    "value": null
+                  },
+                  {
+                    "color": "yellow",
+                    "value": 1
+                  },
+                  {
+                    "color": "orange",
+                    "value": 2
+                  },
+                  {
+                    "color": "red",
+                    "value": 4
+                  }
+                ]
+              },
+              "unit": "none",
+              "custom": {
+                "displayMode": "auto"
+              },
+              "decimals": 2
+            },
+            "overrides": []
+          },
+          "options": {
+            "colorMode": "value",
+            "graphMode": "area",
+            "justifyMode": "center",
+            "reduceOptions": {
+              "calcs": [
+                "lastNotNull"
+              ],
+              "fields": "",
+              "values": false
+            },
+            "textMode": "value"
+          }
+        },
+        {
+          "id": 4,
+          "type": "stat",
+          "title": "Edge P99 Latency (ms)",
+          "datasource": {
+            "type": "prometheus",
+            "uid": "atlas-vm"
+          },
+          "gridPos": {
+            "h": 4,
+            "w": 6,
+            "x": 18,
+            "y": 0
+          },
+          "targets": [
+            {
+              "expr": "histogram_quantile(0.99, sum by (le) (rate(traefik_entrypoint_request_duration_seconds_bucket[5m]))) * 1000",
+              "refId": "A"
+            }
+          ],
+          "fieldConfig": {
+            "defaults": {
+              "color": {
+                "mode": "thresholds"
+              },
+              "mappings": [],
+              "thresholds": {
+                "mode": "absolute",
+                "steps": [
+                  {
+                    "color": "green",
+                    "value": null
+                  },
+                  {
+                    "color": "yellow",
+                    "value": 200
+                  },
+                  {
+                    "color": "orange",
+                    "value": 350
+                  },
+                  {
+                    "color": "red",
+                    "value": 500
+                  }
+                ]
+              },
+              "unit": "ms",
+              "custom": {
+                "displayMode": "auto"
+              },
+              "decimals": 1
+            },
+            "overrides": []
+          },
+          "options": {
+            "colorMode": "value",
+            "graphMode": "area",
+            "justifyMode": "center",
+            "reduceOptions": {
+              "calcs": [
+                "lastNotNull"
+              ],
+              "fields": "",
+              "values": false
+            },
+            "textMode": "value"
+          }
+        },
+        {
+          "id": 5,
+          "type": "stat",
+          "title": "Ingress Traffic",
           "datasource": {
             "type": "prometheus",
             "uid": "atlas-vm"
@@ -144,19 +300,19 @@ data:
           "gridPos": {
             "h": 4,
             "w": 8,
-            "x": 16,
-            "y": 0
+            "x": 0,
+            "y": 4
           },
           "targets": [
             {
-              "expr": "sum(rate(container_network_receive_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m]) + rate(container_network_transmit_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m])) or on() vector(0)",
+              "expr": "sum(rate(node_network_receive_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)",
               "refId": "A"
             }
           ],
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
@@ -194,9 +350,9 @@ data:
           }
         },
         {
-          "id": 4,
+          "id": 6,
           "type": "stat",
-          "title": "Top Router req/s",
+          "title": "Egress Traffic",
           "datasource": {
             "type": "prometheus",
             "uid": "atlas-vm"
@@ -204,20 +360,19 @@ data:
           "gridPos": {
             "h": 4,
             "w": 8,
-            "x": 0,
+            "x": 8,
             "y": 4
           },
           "targets": [
             {
-              "expr": "topk(1, sum by (router) (rate(traefik_router_requests_total[5m])))",
-              "refId": "A",
-              "legendFormat": "{{router}}"
+              "expr": "sum(rate(node_network_transmit_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)",
+              "refId": "A"
             }
           ],
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
@@ -233,7 +388,7 @@ data:
                   }
                 ]
               },
-              "unit": "req/s",
+              "unit": "Bps",
               "custom": {
                 "displayMode": "auto"
               }
@@ -255,7 +410,67 @@ data:
           }
         },
         {
-          "id": 5,
+          "id": 7,
+          "type": "stat",
+          "title": "Intra-Cluster Traffic",
+          "datasource": {
+            "type": "prometheus",
+            "uid": "atlas-vm"
+          },
+          "gridPos": {
+            "h": 4,
+            "w": 8,
+            "x": 16,
+            "y": 4
+          },
+          "targets": [
+            {
+              "expr": "sum(rate(container_network_receive_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m]) + rate(container_network_transmit_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m])) or on() vector(0)",
+              "refId": "A"
+            }
+          ],
+          "fieldConfig": {
+            "defaults": {
+              "color": {
+                "mode": "thresholds"
+              },
+              "mappings": [],
+              "thresholds": {
+                "mode": "absolute",
+                "steps": [
+                  {
+                    "color": "rgba(115, 115, 115, 1)",
+                    "value": null
+                  },
+                  {
+                    "color": "green",
+                    "value": 1
+                  }
+                ]
+              },
+              "unit": "Bps",
+              "custom": {
+                "displayMode": "auto"
+              }
+            },
+            "overrides": []
+          },
+          "options": {
+            "colorMode": "value",
+            "graphMode": "area",
+            "justifyMode": "center",
+            "reduceOptions": {
+              "calcs": [
+                "lastNotNull"
+              ],
+              "fields": "",
+              "values": false
+            },
+            "textMode": "value"
+          }
+        },
+        {
+          "id": 8,
           "type": "timeseries",
           "title": "Per-Node Throughput",
           "datasource": {
@@ -292,7 +507,7 @@ data:
           }
         },
         {
-          "id": 6,
+          "id": 9,
           "type": "table",
           "title": "Top Namespaces",
           "datasource": {
@@ -313,12 +528,16 @@ data:
           ],
           "fieldConfig": {
             "defaults": {
-              "unit": "Bps"
+              "unit": "Bps",
+              "custom": {
+                "filterable": true
+              }
             },
             "overrides": []
           },
           "options": {
-            "showHeader": true
+            "showHeader": true,
+            "columnFilters": false
           },
           "transformations": [
             {
@@ -328,7 +547,7 @@ data:
           ]
         },
         {
-          "id": 7,
+          "id": 10,
           "type": "table",
           "title": "Top Pods",
           "datasource": {
@@ -349,12 +568,16 @@ data:
           ],
           "fieldConfig": {
             "defaults": {
-              "unit": "Bps"
+              "unit": "Bps",
+              "custom": {
+                "filterable": true
+              }
             },
             "overrides": []
           },
           "options": {
-            "showHeader": true
+            "showHeader": true,
+            "columnFilters": false
           },
           "transformations": [
             {
@@ -364,7 +587,7 @@ data:
           ]
         },
         {
-          "id": 8,
+          "id": 11,
           "type": "timeseries",
           "title": "Traefik Routers (req/s)",
           "datasource": {
@@ -401,7 +624,7 @@ data:
           }
         },
         {
-          "id": 9,
+          "id": 12,
           "type": "timeseries",
           "title": "Traefik Entrypoints (req/s)",
           "datasource": {
diff --git a/services/monitoring/grafana-dashboard-nodes.yaml b/services/monitoring/grafana-dashboard-nodes.yaml
index 2facfed..542daca 100644
--- a/services/monitoring/grafana-dashboard-nodes.yaml
+++ b/services/monitoring/grafana-dashboard-nodes.yaml
@@ -36,7 +36,7 @@ data:
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
@@ -97,7 +97,7 @@ data:
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
@@ -158,7 +158,7 @@ data:
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
@@ -195,6 +195,213 @@ data:
             "textMode": "value"
           }
         },
+        {
+          "id": 9,
+          "type": "stat",
+          "title": "API Server 5xx rate",
+          "datasource": {
+            "type": "prometheus",
+            "uid": "atlas-vm"
+          },
+          "gridPos": {
+            "h": 4,
+            "w": 8,
+            "x": 0,
+            "y": 4
+          },
+          "targets": [
+            {
+              "expr": "sum(rate(apiserver_request_total{code=~\"5..\"}[5m]))",
+              "refId": "A"
+            }
+          ],
+          "fieldConfig": {
+            "defaults": {
+              "color": {
+                "mode": "thresholds"
+              },
+              "mappings": [],
+              "thresholds": {
+                "mode": "absolute",
+                "steps": [
+                  {
+                    "color": "green",
+                    "value": null
+                  },
+                  {
+                    "color": "yellow",
+                    "value": 0.05
+                  },
+                  {
+                    "color": "orange",
+                    "value": 0.2
+                  },
+                  {
+                    "color": "red",
+                    "value": 0.5
+                  }
+                ]
+              },
+              "unit": "req/s",
+              "custom": {
+                "displayMode": "auto"
+              },
+              "decimals": 3
+            },
+            "overrides": []
+          },
+          "options": {
+            "colorMode": "value",
+            "graphMode": "area",
+            "justifyMode": "center",
+            "reduceOptions": {
+              "calcs": [
+                "lastNotNull"
+              ],
+              "fields": "",
+              "values": false
+            },
+            "textMode": "value"
+          }
+        },
+        {
+          "id": 10,
+          "type": "stat",
+          "title": "API Server P99 latency",
+          "datasource": {
+            "type": "prometheus",
+            "uid": "atlas-vm"
+          },
+          "gridPos": {
+            "h": 4,
+            "w": 8,
+            "x": 8,
+            "y": 4
+          },
+          "targets": [
+            {
+              "expr": "histogram_quantile(0.99, sum by (le) (rate(apiserver_request_duration_seconds_bucket[5m]))) * 1000",
+              "refId": "A"
+            }
+          ],
+          "fieldConfig": {
+            "defaults": {
+              "color": {
+                "mode": "thresholds"
+              },
+              "mappings": [],
+              "thresholds": {
+                "mode": "absolute",
+                "steps": [
+                  {
+                    "color": "green",
+                    "value": null
+                  },
+                  {
+                    "color": "yellow",
+                    "value": 250
+                  },
+                  {
+                    "color": "orange",
+                    "value": 400
+                  },
+                  {
+                    "color": "red",
+                    "value": 600
+                  }
+                ]
+              },
+              "unit": "ms",
+              "custom": {
+                "displayMode": "auto"
+              },
+              "decimals": 1
+            },
+            "overrides": []
+          },
+          "options": {
+            "colorMode": "value",
+            "graphMode": "area",
+            "justifyMode": "center",
+            "reduceOptions": {
+              "calcs": [
+                "lastNotNull"
+              ],
+              "fields": "",
+              "values": false
+            },
+            "textMode": "value"
+          }
+        },
+        {
+          "id": 11,
+          "type": "stat",
+          "title": "etcd P99 latency",
+          "datasource": {
+            "type": "prometheus",
+            "uid": "atlas-vm"
+          },
+          "gridPos": {
+            "h": 4,
+            "w": 8,
+            "x": 16,
+            "y": 4
+          },
+          "targets": [
+            {
+              "expr": "histogram_quantile(0.99, sum by (le) (rate(etcd_request_duration_seconds_bucket[5m]))) * 1000",
+              "refId": "A"
+            }
+          ],
+          "fieldConfig": {
+            "defaults": {
+              "color": {
+                "mode": "thresholds"
+              },
+              "mappings": [],
+              "thresholds": {
+                "mode": "absolute",
+                "steps": [
+                  {
+                    "color": "green",
+                    "value": null
+                  },
+                  {
+                    "color": "yellow",
+                    "value": 50
+                  },
+                  {
+                    "color": "orange",
+                    "value": 100
+                  },
+                  {
+                    "color": "red",
+                    "value": 200
+                  }
+                ]
+              },
+              "unit": "ms",
+              "custom": {
+                "displayMode": "auto"
+              },
+              "decimals": 1
+            },
+            "overrides": []
+          },
+          "options": {
+            "colorMode": "value",
+            "graphMode": "area",
+            "justifyMode": "center",
+            "reduceOptions": {
+              "calcs": [
+                "lastNotNull"
+              ],
+              "fields": "",
+              "values": false
+            },
+            "textMode": "value"
+          }
+        },
         {
           "id": 4,
           "type": "timeseries",
@@ -207,7 +414,7 @@ data:
             "h": 9,
             "w": 24,
             "x": 0,
-            "y": 4
+            "y": 8
           },
           "targets": [
             {
@@ -247,7 +454,7 @@ data:
             "h": 9,
             "w": 24,
             "x": 0,
-            "y": 13
+            "y": 17
           },
           "targets": [
             {
@@ -287,7 +494,7 @@ data:
             "h": 9,
             "w": 12,
             "x": 0,
-            "y": 22
+            "y": 26
           },
           "targets": [
             {
@@ -324,7 +531,7 @@ data:
             "h": 9,
             "w": 12,
             "x": 12,
-            "y": 22
+            "y": 26
           },
           "targets": [
             {
@@ -361,7 +568,7 @@ data:
             "h": 9,
             "w": 24,
             "x": 0,
-            "y": 31
+            "y": 35
           },
           "targets": [
             {
diff --git a/services/monitoring/grafana-dashboard-overview.yaml b/services/monitoring/grafana-dashboard-overview.yaml
index ef17ebf..f4165bd 100644
--- a/services/monitoring/grafana-dashboard-overview.yaml
+++ b/services/monitoring/grafana-dashboard-overview.yaml
@@ -16,67 +16,6 @@ data:
         "list": []
       },
       "panels": [
-        {
-          "id": 1,
-          "type": "gauge",
-          "title": "Workers Ready",
-          "datasource": {
-            "type": "prometheus",
-            "uid": "atlas-vm"
-          },
-          "gridPos": {
-            "h": 5,
-            "w": 5,
-            "x": 0,
-            "y": 0
-          },
-          "targets": [
-            {
-              "expr": "sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-04|titan-05|titan-06|titan-07|titan-08|titan-09|titan-10|titan-11|titan-12|titan-13|titan-14|titan-15|titan-16|titan-17|titan-18|titan-19|titan-22|titan-24\"})",
-              "refId": "A"
-            }
-          ],
-          "fieldConfig": {
-            "defaults": {
-              "min": 0,
-              "max": 18,
-              "thresholds": {
-                "mode": "absolute",
-                "steps": [
-                  {
-                    "color": "red",
-                    "value": null
-                  },
-                  {
-                    "color": "orange",
-                    "value": 16
-                  },
-                  {
-                    "color": "yellow",
-                    "value": 17
-                  },
-                  {
-                    "color": "green",
-                    "value": 18
-                  }
-                ]
-              }
-            },
-            "overrides": []
-          },
-          "options": {
-            "reduceOptions": {
-              "calcs": [
-                "lastNotNull"
-              ],
-              "fields": "",
-              "values": false
-            },
-            "orientation": "auto",
-            "showThresholdMarkers": false,
-            "showThresholdLabels": false
-          }
-        },
         {
           "id": 2,
           "type": "gauge",
@@ -87,8 +26,8 @@ data:
           },
           "gridPos": {
             "h": 5,
-            "w": 5,
-            "x": 5,
+            "w": 4,
+            "x": 0,
             "y": 0
           },
           "targets": [
@@ -140,8 +79,8 @@ data:
           },
           "gridPos": {
             "h": 5,
-            "w": 5,
-            "x": 10,
+            "w": 3,
+            "x": 4,
             "y": 0
           },
           "targets": [
@@ -153,82 +92,7 @@ data:
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
-              },
-              "mappings": [],
-              "thresholds": {
-                "mode": "absolute",
-                "steps": [
-                  {
-                    "color": "green",
-                    "value": null
-                  },
-                  {
-                    "color": "yellow",
-                    "value": 1
-                  },
-                  {
-                    "color": "orange",
-                    "value": 2
-                  },
-                  {
-                    "color": "red",
-                    "value": 3
-                  }
-                ]
-              },
-              "unit": "none",
-              "custom": {
-                "displayMode": "auto"
-              }
-            },
-            "overrides": []
-          },
-          "options": {
-            "colorMode": "value",
-            "graphMode": "area",
-            "justifyMode": "center",
-            "reduceOptions": {
-              "calcs": [
-                "lastNotNull"
-              ],
-              "fields": "",
-              "values": false
-            },
-            "textMode": "value"
-          },
-          "links": [
-            {
-              "title": "Open atlas-pods dashboard",
-              "url": "/d/atlas-pods",
-              "targetBlank": true
-            }
-          ]
-        },
-        {
-          "id": 4,
-          "type": "stat",
-          "title": "Problem Pods",
-          "datasource": {
-            "type": "prometheus",
-            "uid": "atlas-vm"
-          },
-          "gridPos": {
-            "h": 5,
-            "w": 5,
-            "x": 15,
-            "y": 0
-          },
-          "targets": [
-            {
-              "expr": "sum(max by (namespace,pod) (kube_pod_status_phase{phase!~\"Running|Succeeded\"}))",
-              "refId": "A"
-            }
-          ],
-          "fieldConfig": {
-            "defaults": {
-              "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
@@ -290,20 +154,20 @@ data:
           },
           "gridPos": {
             "h": 5,
-            "w": 4,
-            "x": 20,
+            "w": 3,
+            "x": 7,
             "y": 0
           },
           "targets": [
             {
-              "expr": "sum(max by (namespace,pod) (((time() - kube_pod_deletion_timestamp{pod!=\"\"}) > bool 600) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0)))",
+              "expr": "sum(max by (namespace,pod) (((time() - kube_pod_deletion_timestamp{pod!=\"\"}) > bool 600) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0))) or on() vector(0)",
               "refId": "A"
             }
           ],
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
@@ -355,6 +219,286 @@ data:
             }
           ]
         },
+        {
+          "id": 27,
+          "type": "stat",
+          "title": "Atlas Availability (30d)",
+          "datasource": {
+            "type": "prometheus",
+            "uid": "atlas-vm"
+          },
+          "gridPos": {
+            "h": 5,
+            "w": 4,
+            "x": 10,
+            "y": 0
+          },
+          "targets": [
+            {
+              "expr": "avg_over_time((min(((sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-0a|titan-0b|titan-0c\"}) / 3)), ((sum(kube_deployment_status_replicas_available{namespace=~\"traefik|kube-system\",deployment=\"traefik\"}) / clamp_min(sum(kube_deployment_spec_replicas{namespace=~\"traefik|kube-system\",deployment=\"traefik\"}), 1)))))[30d:5m])",
+              "refId": "A"
+            }
+          ],
+          "fieldConfig": {
+            "defaults": {
+              "color": {
+                "mode": "thresholds"
+              },
+              "mappings": [],
+              "thresholds": {
+                "mode": "absolute",
+                "steps": [
+                  {
+                    "color": "red",
+                    "value": null
+                  },
+                  {
+                    "color": "orange",
+                    "value": 0.999
+                  },
+                  {
+                    "color": "yellow",
+                    "value": 0.9999
+                  },
+                  {
+                    "color": "green",
+                    "value": 0.99999
+                  }
+                ]
+              },
+              "unit": "percentunit",
+              "custom": {
+                "displayMode": "auto"
+              },
+              "decimals": 3
+            },
+            "overrides": []
+          },
+          "options": {
+            "colorMode": "value",
+            "graphMode": "area",
+            "justifyMode": "center",
+            "reduceOptions": {
+              "calcs": [
+                "lastNotNull"
+              ],
+              "fields": "",
+              "values": false
+            },
+            "textMode": "value"
+          }
+        },
+        {
+          "id": 4,
+          "type": "stat",
+          "title": "Problem Pods",
+          "datasource": {
+            "type": "prometheus",
+            "uid": "atlas-vm"
+          },
+          "gridPos": {
+            "h": 5,
+            "w": 3,
+            "x": 14,
+            "y": 0
+          },
+          "targets": [
+            {
+              "expr": "sum(max by (namespace,pod) (kube_pod_status_phase{phase!~\"Running|Succeeded\"})) or on() vector(0)",
+              "refId": "A"
+            }
+          ],
+          "fieldConfig": {
+            "defaults": {
+              "color": {
+                "mode": "thresholds"
+              },
+              "mappings": [],
+              "thresholds": {
+                "mode": "absolute",
+                "steps": [
+                  {
+                    "color": "green",
+                    "value": null
+                  },
+                  {
+                    "color": "yellow",
+                    "value": 1
+                  },
+                  {
+                    "color": "orange",
+                    "value": 2
+                  },
+                  {
+                    "color": "red",
+                    "value": 3
+                  }
+                ]
+              },
+              "unit": "none",
+              "custom": {
+                "displayMode": "auto"
+              }
+            },
+            "overrides": []
+          },
+          "options": {
+            "colorMode": "value",
+            "graphMode": "area",
+            "justifyMode": "center",
+            "reduceOptions": {
+              "calcs": [
+                "lastNotNull"
+              ],
+              "fields": "",
+              "values": false
+            },
+            "textMode": "value"
+          },
+          "links": [
+            {
+              "title": "Open atlas-pods dashboard",
+              "url": "/d/atlas-pods",
+              "targetBlank": true
+            }
+          ]
+        },
+        {
+          "id": 6,
+          "type": "stat",
+          "title": "CrashLoop / ImagePull",
+          "datasource": {
+            "type": "prometheus",
+            "uid": "atlas-vm"
+          },
+          "gridPos": {
+            "h": 5,
+            "w": 3,
+            "x": 17,
+            "y": 0
+          },
+          "targets": [
+            {
+              "expr": "sum(max by (namespace,pod) (kube_pod_container_status_waiting_reason{reason=~\"CrashLoopBackOff|ImagePullBackOff\"})) or on() vector(0)",
+              "refId": "A"
+            }
+          ],
+          "fieldConfig": {
+            "defaults": {
+              "color": {
+                "mode": "thresholds"
+              },
+              "mappings": [],
+              "thresholds": {
+                "mode": "absolute",
+                "steps": [
+                  {
+                    "color": "green",
+                    "value": null
+                  },
+                  {
+                    "color": "yellow",
+                    "value": 1
+                  },
+                  {
+                    "color": "orange",
+                    "value": 2
+                  },
+                  {
+                    "color": "red",
+                    "value": 3
+                  }
+                ]
+              },
+              "unit": "none",
+              "custom": {
+                "displayMode": "auto"
+              }
+            },
+            "overrides": []
+          },
+          "options": {
+            "colorMode": "value",
+            "graphMode": "area",
+            "justifyMode": "center",
+            "reduceOptions": {
+              "calcs": [
+                "lastNotNull"
+              ],
+              "fields": "",
+              "values": false
+            },
+            "textMode": "value"
+          },
+          "links": [
+            {
+              "title": "Open atlas-pods dashboard",
+              "url": "/d/atlas-pods",
+              "targetBlank": true
+            }
+          ]
+        },
+        {
+          "id": 1,
+          "type": "gauge",
+          "title": "Workers Ready",
+          "datasource": {
+            "type": "prometheus",
+            "uid": "atlas-vm"
+          },
+          "gridPos": {
+            "h": 5,
+            "w": 4,
+            "x": 20,
+            "y": 0
+          },
+          "targets": [
+            {
+              "expr": "sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-04|titan-05|titan-06|titan-07|titan-08|titan-09|titan-10|titan-11|titan-12|titan-13|titan-14|titan-15|titan-16|titan-17|titan-18|titan-19|titan-22|titan-24\"})",
+              "refId": "A"
+            }
+          ],
+          "fieldConfig": {
+            "defaults": {
+              "min": 0,
+              "max": 18,
+              "thresholds": {
+                "mode": "absolute",
+                "steps": [
+                  {
+                    "color": "red",
+                    "value": null
+                  },
+                  {
+                    "color": "orange",
+                    "value": 16
+                  },
+                  {
+                    "color": "yellow",
+                    "value": 17
+                  },
+                  {
+                    "color": "green",
+                    "value": 18
+                  }
+                ]
+              }
+            },
+            "overrides": []
+          },
+          "options": {
+            "reduceOptions": {
+              "calcs": [
+                "lastNotNull"
+              ],
+              "fields": "",
+              "values": false
+            },
+            "orientation": "auto",
+            "showThresholdMarkers": false,
+            "showThresholdLabels": false
+          }
+        },
         {
           "id": 7,
           "type": "stat",
@@ -380,11 +524,11 @@ data:
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
-                "mode": "percentage",
+                "mode": "absolute",
                 "steps": [
                   {
                     "color": "green",
@@ -392,11 +536,15 @@ data:
                   },
                   {
                     "color": "yellow",
-                    "value": 70
+                    "value": 50
+                  },
+                  {
+                    "color": "orange",
+                    "value": 75
                   },
                   {
                     "color": "red",
-                    "value": 85
+                    "value": 91.5
                   }
                 ]
               },
@@ -453,11 +601,11 @@ data:
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
-                "mode": "percentage",
+                "mode": "absolute",
                 "steps": [
                   {
                     "color": "green",
@@ -465,11 +613,15 @@ data:
                   },
                   {
                     "color": "yellow",
-                    "value": 70
+                    "value": 50
+                  },
+                  {
+                    "color": "orange",
+                    "value": 75
                   },
                   {
                     "color": "red",
-                    "value": 85
+                    "value": 91.5
                   }
                 ]
               },
@@ -526,7 +678,7 @@ data:
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
@@ -595,7 +747,7 @@ data:
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
@@ -662,11 +814,11 @@ data:
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
-                "mode": "percentage",
+                "mode": "absolute",
                 "steps": [
                   {
                     "color": "green",
@@ -674,11 +826,15 @@ data:
                   },
                   {
                     "color": "yellow",
-                    "value": 70
+                    "value": 50
+                  },
+                  {
+                    "color": "orange",
+                    "value": 75
                   },
                   {
                     "color": "red",
-                    "value": 85
+                    "value": 91.5
                   }
                 ]
               },
@@ -733,11 +889,11 @@ data:
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
-                "mode": "percentage",
+                "mode": "absolute",
                 "steps": [
                   {
                     "color": "green",
@@ -745,11 +901,15 @@ data:
                   },
                   {
                     "color": "yellow",
-                    "value": 70
+                    "value": 50
+                  },
+                  {
+                    "color": "orange",
+                    "value": 75
                   },
                   {
                     "color": "red",
-                    "value": 85
+                    "value": 91.5
                   }
                 ]
               },
@@ -804,7 +964,7 @@ data:
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
@@ -871,7 +1031,7 @@ data:
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
@@ -951,9 +1111,7 @@ data:
               "placement": "right"
             },
             "pieType": "pie",
-            "displayLabels": [
-              "percent"
-            ],
+            "displayLabels": [],
             "tooltip": {
               "mode": "single"
             },
@@ -1004,9 +1162,7 @@ data:
               "placement": "right"
             },
             "pieType": "pie",
-            "displayLabels": [
-              "percent"
-            ],
+            "displayLabels": [],
             "tooltip": {
               "mode": "single"
             },
@@ -1057,9 +1213,7 @@ data:
               "placement": "right"
             },
             "pieType": "pie",
-            "displayLabels": [
-              "percent"
-            ],
+            "displayLabels": [],
             "tooltip": {
               "mode": "single"
             },
@@ -1184,7 +1338,7 @@ data:
           },
           "targets": [
             {
-              "expr": "(avg by (node) (((1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")",
+              "expr": "(avg by (node) (((1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c|titan-db\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")",
               "refId": "A",
               "legendFormat": "{{node}}"
             }
@@ -1221,7 +1375,7 @@ data:
           },
           "targets": [
             {
-              "expr": "(avg by (node) ((avg by (instance) ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")",
+              "expr": "(avg by (node) ((avg by (instance) ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c|titan-db\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")",
               "refId": "A",
               "legendFormat": "{{node}}"
             }
@@ -1242,6 +1396,138 @@ data:
             }
           }
         },
+        {
+          "id": 28,
+          "type": "piechart",
+          "title": "Node Pod Share",
+          "datasource": {
+            "type": "prometheus",
+            "uid": "atlas-vm"
+          },
+          "gridPos": {
+            "h": 10,
+            "w": 12,
+            "x": 0,
+            "y": 54
+          },
+          "targets": [
+            {
+              "expr": "(sum(kube_pod_info{pod!=\"\" , node!=\"\"}) by (node) / clamp_min(sum(kube_pod_info{pod!=\"\" , node!=\"\"}), 1)) * 100",
+              "refId": "A",
+              "legendFormat": "{{namespace}}"
+            }
+          ],
+          "fieldConfig": {
+            "defaults": {
+              "unit": "percent",
+              "color": {
+                "mode": "palette-classic"
+              }
+            },
+            "overrides": []
+          },
+          "options": {
+            "legend": {
+              "displayMode": "list",
+              "placement": "right"
+            },
+            "pieType": "pie",
+            "displayLabels": [],
+            "tooltip": {
+              "mode": "single"
+            },
+            "colorScheme": "interpolateSpectral",
+            "colorBy": "value",
+            "reduceOptions": {
+              "calcs": [
+                "lastNotNull"
+              ],
+              "fields": "",
+              "values": false
+            }
+          }
+        },
+        {
+          "id": 29,
+          "type": "bargauge",
+          "title": "Top Nodes by Pod Count",
+          "datasource": {
+            "type": "prometheus",
+            "uid": "atlas-vm"
+          },
+          "gridPos": {
+            "h": 10,
+            "w": 12,
+            "x": 12,
+            "y": 54
+          },
+          "targets": [
+            {
+              "expr": "topk(12, sum(kube_pod_info{pod!=\"\" , node!=\"\"}) by (node))",
+              "refId": "A",
+              "legendFormat": "{{node}}",
+              "instant": true
+            }
+          ],
+          "fieldConfig": {
+            "defaults": {
+              "unit": "none",
+              "min": 0,
+              "max": null,
+              "thresholds": {
+                "mode": "absolute",
+                "steps": [
+                  {
+                    "color": "green",
+                    "value": null
+                  },
+                  {
+                    "color": "yellow",
+                    "value": 50
+                  },
+                  {
+                    "color": "orange",
+                    "value": 75
+                  },
+                  {
+                    "color": "red",
+                    "value": 100
+                  }
+                ]
+              },
+              "decimals": 0
+            },
+            "overrides": []
+          },
+          "options": {
+            "displayMode": "gradient",
+            "orientation": "horizontal",
+            "reduceOptions": {
+              "calcs": [
+                "lastNotNull"
+              ],
+              "fields": "",
+              "values": false
+            }
+          },
+          "transformations": [
+            {
+              "id": "sortBy",
+              "options": {
+                "fields": [
+                  "Value"
+                ],
+                "order": "desc"
+              }
+            },
+            {
+              "id": "limit",
+              "options": {
+                "limit": 12
+              }
+            }
+          ]
+        },
         {
           "id": 18,
           "type": "timeseries",
@@ -1386,7 +1672,7 @@ data:
             "h": 16,
             "w": 12,
             "x": 0,
-            "y": 54
+            "y": 64
           },
           "targets": [
             {
@@ -1434,7 +1720,7 @@ data:
             "h": 16,
             "w": 12,
             "x": 12,
-            "y": 54
+            "y": 64
           },
           "targets": [
             {
@@ -1461,11 +1747,11 @@ data:
                   },
                   {
                     "color": "orange",
-                    "value": 70
+                    "value": 75
                   },
                   {
                     "color": "red",
-                    "value": 85
+                    "value": 91.5
                   }
                 ]
               }
@@ -1489,6 +1775,17 @@ data:
               "url": "/d/atlas-storage",
               "targetBlank": true
             }
+          ],
+          "transformations": [
+            {
+              "id": "sortBy",
+              "options": {
+                "fields": [
+                  "Value"
+                ],
+                "order": "desc"
+              }
+            }
           ]
         }
       ],
@@ -1506,36 +1803,5 @@ data:
         "to": "now"
       },
       "refresh": "1m",
-      "links": [
-        {
-          "title": "Atlas Pods",
-          "type": "dashboard",
-          "dashboardUid": "atlas-pods",
-          "keepTime": false
-        },
-        {
-          "title": "Atlas Nodes",
-          "type": "dashboard",
-          "dashboardUid": "atlas-nodes",
-          "keepTime": false
-        },
-        {
-          "title": "Atlas Storage",
-          "type": "dashboard",
-          "dashboardUid": "atlas-storage",
-          "keepTime": false
-        },
-        {
-          "title": "Atlas Network",
-          "type": "dashboard",
-          "dashboardUid": "atlas-network",
-          "keepTime": false
-        },
-        {
-          "title": "Atlas GPU",
-          "type": "dashboard",
-          "dashboardUid": "atlas-gpu",
-          "keepTime": false
-        }
-      ]
+      "links": []
     }
diff --git a/services/monitoring/grafana-dashboard-pods.yaml b/services/monitoring/grafana-dashboard-pods.yaml
index f92adf1..b7c49d5 100644
--- a/services/monitoring/grafana-dashboard-pods.yaml
+++ b/services/monitoring/grafana-dashboard-pods.yaml
@@ -29,14 +29,14 @@ data:
           },
           "targets": [
             {
-              "expr": "sum(max by (namespace,pod) (kube_pod_status_phase{phase!~\"Running|Succeeded\"}))",
+              "expr": "sum(max by (namespace,pod) (kube_pod_status_phase{phase!~\"Running|Succeeded\"})) or on() vector(0)",
               "refId": "A"
             }
           ],
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
@@ -89,14 +89,14 @@ data:
           },
           "targets": [
             {
-              "expr": "sum(max by (namespace,pod) (kube_pod_container_status_waiting_reason{reason=~\"CrashLoopBackOff|ImagePullBackOff\"}))",
+              "expr": "sum(max by (namespace,pod) (kube_pod_container_status_waiting_reason{reason=~\"CrashLoopBackOff|ImagePullBackOff\"})) or on() vector(0)",
               "refId": "A"
             }
           ],
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
@@ -149,14 +149,14 @@ data:
           },
           "targets": [
             {
-              "expr": "sum(max by (namespace,pod) (((time() - kube_pod_deletion_timestamp{pod!=\"\"}) > bool 600) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0)))",
+              "expr": "sum(max by (namespace,pod) (((time() - kube_pod_deletion_timestamp{pod!=\"\"}) > bool 600) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0))) or on() vector(0)",
               "refId": "A"
             }
           ],
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
@@ -216,7 +216,7 @@ data:
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
@@ -275,12 +275,16 @@ data:
           ],
           "fieldConfig": {
             "defaults": {
-              "unit": "s"
+              "unit": "s",
+              "custom": {
+                "filterable": true
+              }
             },
             "overrides": []
           },
           "options": {
-            "showHeader": true
+            "showHeader": true,
+            "columnFilters": false
           },
           "transformations": [
             {
@@ -311,12 +315,16 @@ data:
           ],
           "fieldConfig": {
             "defaults": {
-              "unit": "s"
+              "unit": "s",
+              "custom": {
+                "filterable": true
+              }
             },
             "overrides": []
           },
           "options": {
-            "showHeader": true
+            "showHeader": true,
+            "columnFilters": false
           },
           "transformations": [
             {
@@ -347,12 +355,16 @@ data:
           ],
           "fieldConfig": {
             "defaults": {
-              "unit": "s"
+              "unit": "s",
+              "custom": {
+                "filterable": true
+              }
             },
             "overrides": []
           },
           "options": {
-            "showHeader": true
+            "showHeader": true,
+            "columnFilters": false
           },
           "transformations": [
             {
@@ -368,6 +380,233 @@ data:
               }
             }
           ]
+        },
+        {
+          "id": 8,
+          "type": "piechart",
+          "title": "Node Pod Share",
+          "datasource": {
+            "type": "prometheus",
+            "uid": "atlas-vm"
+          },
+          "gridPos": {
+            "h": 8,
+            "w": 12,
+            "x": 12,
+            "y": 34
+          },
+          "targets": [
+            {
+              "expr": "(sum(kube_pod_info{pod!=\"\" , node!=\"\"}) by (node) / clamp_min(sum(kube_pod_info{pod!=\"\" , node!=\"\"}), 1)) * 100",
+              "refId": "A",
+              "legendFormat": "{{namespace}}"
+            }
+          ],
+          "fieldConfig": {
+            "defaults": {
+              "unit": "percent",
+              "color": {
+                "mode": "palette-classic"
+              }
+            },
+            "overrides": []
+          },
+          "options": {
+            "legend": {
+              "displayMode": "list",
+              "placement": "right"
+            },
+            "pieType": "pie",
+            "displayLabels": [],
+            "tooltip": {
+              "mode": "single"
+            },
+            "colorScheme": "interpolateSpectral",
+            "colorBy": "value",
+            "reduceOptions": {
+              "calcs": [
+                "lastNotNull"
+              ],
+              "fields": "",
+              "values": false
+            }
+          }
+        },
+        {
+          "id": 9,
+          "type": "bargauge",
+          "title": "Top Nodes by Pod Count",
+          "datasource": {
+            "type": "prometheus",
+            "uid": "atlas-vm"
+          },
+          "gridPos": {
+            "h": 8,
+            "w": 12,
+            "x": 0,
+            "y": 34
+          },
+          "targets": [
+            {
+              "expr": "topk(12, sum(kube_pod_info{pod!=\"\" , node!=\"\"}) by (node))",
+              "refId": "A",
+              "legendFormat": "{{node}}",
+              "instant": true
+            }
+          ],
+          "fieldConfig": {
+            "defaults": {
+              "unit": "none",
+              "min": 0,
+              "max": null,
+              "thresholds": {
+                "mode": "absolute",
+                "steps": [
+                  {
+                    "color": "green",
+                    "value": null
+                  },
+                  {
+                    "color": "yellow",
+                    "value": 50
+                  },
+                  {
+                    "color": "orange",
+                    "value": 75
+                  },
+                  {
+                    "color": "red",
+                    "value": 100
+                  }
+                ]
+              },
+              "decimals": 0
+            },
+            "overrides": []
+          },
+          "options": {
+            "displayMode": "gradient",
+            "orientation": "horizontal",
+            "reduceOptions": {
+              "calcs": [
+                "lastNotNull"
+              ],
+              "fields": "",
+              "values": false
+            }
+          },
+          "transformations": [
+            {
+              "id": "sortBy",
+              "options": {
+                "fields": [
+                  "Value"
+                ],
+                "order": "desc"
+              }
+            },
+            {
+              "id": "limit",
+              "options": {
+                "limit": 12
+              }
+            }
+          ]
+        },
+        {
+          "id": 10,
+          "type": "table",
+          "title": "Namespace Plurality by Node v27",
+          "datasource": {
+            "type": "prometheus",
+            "uid": "atlas-vm"
+          },
+          "gridPos": {
+            "h": 8,
+            "w": 24,
+            "x": 0,
+            "y": 42
+          },
+          "targets": [
+            {
+              "expr": "(sum by (namespace,node) (kube_pod_info{pod!=\"\" , node!=\"\"}) / on(namespace) group_left() clamp_min(sum by (namespace) (kube_pod_info{pod!=\"\"}), 1) * 100) * on(namespace,node) group_left() ((sum by (namespace,node) (kube_pod_info{pod!=\"\" , node!=\"\"}) / on(namespace) group_left() clamp_min(sum by (namespace) (kube_pod_info{pod!=\"\"}), 1) * 100) + on(node) group_left() ((sum by (node) (kube_node_info{node=\"titan-0a\"}) * 0 + 0.001) or (sum by (node) (kube_node_info{node=\"titan-0b\"}) * 0 + 0.002) or (sum by (node) (kube_node_info{node=\"titan-0c\"}) * 0 + 0.003) or (sum by (node) (kube_node_info{node=\"titan-db\"}) * 0 + 0.004) or (sum by (node) (kube_node_info{node=\"titan-04\"}) * 0 + 0.005) or (sum by (node) (kube_node_info{node=\"titan-05\"}) * 0 + 0.006) or (sum by (node) (kube_node_info{node=\"titan-06\"}) * 0 + 0.007) or (sum by (node) (kube_node_info{node=\"titan-07\"}) * 0 + 0.008) or (sum by (node) (kube_node_info{node=\"titan-08\"}) * 0 + 0.009000000000000001) or (sum by (node) (kube_node_info{node=\"titan-09\"}) * 0 + 0.01) or (sum by (node) (kube_node_info{node=\"titan-10\"}) * 0 + 0.011) or (sum by (node) (kube_node_info{node=\"titan-11\"}) * 0 + 0.012) or (sum by (node) (kube_node_info{node=\"titan-12\"}) * 0 + 0.013000000000000001) or (sum by (node) (kube_node_info{node=\"titan-13\"}) * 0 + 0.014) or (sum by (node) (kube_node_info{node=\"titan-14\"}) * 0 + 0.015) or (sum by (node) (kube_node_info{node=\"titan-15\"}) * 0 + 0.016) or (sum by (node) (kube_node_info{node=\"titan-16\"}) * 0 + 0.017) or (sum by (node) (kube_node_info{node=\"titan-17\"}) * 0 + 0.018000000000000002) or (sum by (node) (kube_node_info{node=\"titan-18\"}) * 0 + 0.019) or (sum by (node) (kube_node_info{node=\"titan-19\"}) * 0 + 0.02) or (sum by (node) (kube_node_info{node=\"titan-22\"}) * 0 + 0.021) or (sum by (node) (kube_node_info{node=\"titan-24\"}) * 0 + 0.022)) == bool on(namespace) group_left() (max by (namespace) ((sum by (namespace,node) (kube_pod_info{pod!=\"\" , node!=\"\"}) / on(namespace) group_left() clamp_min(sum by (namespace) (kube_pod_info{pod!=\"\"}), 1) * 100) + on(node) group_left() ((sum by (node) (kube_node_info{node=\"titan-0a\"}) * 0 + 0.001) or (sum by (node) (kube_node_info{node=\"titan-0b\"}) * 0 + 0.002) or (sum by (node) (kube_node_info{node=\"titan-0c\"}) * 0 + 0.003) or (sum by (node) (kube_node_info{node=\"titan-db\"}) * 0 + 0.004) or (sum by (node) (kube_node_info{node=\"titan-04\"}) * 0 + 0.005) or (sum by (node) (kube_node_info{node=\"titan-05\"}) * 0 + 0.006) or (sum by (node) (kube_node_info{node=\"titan-06\"}) * 0 + 0.007) or (sum by (node) (kube_node_info{node=\"titan-07\"}) * 0 + 0.008) or (sum by (node) (kube_node_info{node=\"titan-08\"}) * 0 + 0.009000000000000001) or (sum by (node) (kube_node_info{node=\"titan-09\"}) * 0 + 0.01) or (sum by (node) (kube_node_info{node=\"titan-10\"}) * 0 + 0.011) or (sum by (node) (kube_node_info{node=\"titan-11\"}) * 0 + 0.012) or (sum by (node) (kube_node_info{node=\"titan-12\"}) * 0 + 0.013000000000000001) or (sum by (node) (kube_node_info{node=\"titan-13\"}) * 0 + 0.014) or (sum by (node) (kube_node_info{node=\"titan-14\"}) * 0 + 0.015) or (sum by (node) (kube_node_info{node=\"titan-15\"}) * 0 + 0.016) or (sum by (node) (kube_node_info{node=\"titan-16\"}) * 0 + 0.017) or (sum by (node) (kube_node_info{node=\"titan-17\"}) * 0 + 0.018000000000000002) or (sum by (node) (kube_node_info{node=\"titan-18\"}) * 0 + 0.019) or (sum by (node) (kube_node_info{node=\"titan-19\"}) * 0 + 0.02) or (sum by (node) (kube_node_info{node=\"titan-22\"}) * 0 + 0.021) or (sum by (node) (kube_node_info{node=\"titan-24\"}) * 0 + 0.022)))))",
+              "refId": "A",
+              "instant": true,
+              "format": "table"
+            }
+          ],
+          "fieldConfig": {
+            "defaults": {
+              "unit": "percent",
+              "custom": {
+                "filterable": false
+              }
+            },
+            "overrides": []
+          },
+          "options": {
+            "showHeader": true,
+            "columnFilters": false,
+            "showColumnFilters": false,
+            "footer": {
+              "show": false,
+              "fields": "",
+              "calcs": []
+            }
+          },
+          "transformations": [
+            {
+              "id": "labelsToFields",
+              "options": {}
+            },
+            {
+              "id": "organize",
+              "options": {
+                "excludeByName": {
+                  "Time": true
+                }
+              }
+            },
+            {
+              "id": "filterByValue",
+              "options": {
+                "match": "Value",
+                "operator": "gt",
+                "value": 0
+              }
+            },
+            {
+              "id": "sortBy",
+              "options": {
+                "fields": [
+                  "Value"
+                ],
+                "order": "desc"
+              }
+            },
+            {
+              "id": "groupBy",
+              "options": {
+                "fields": {
+                  "namespace": {
+                    "aggregations": [
+                      {
+                        "field": "Value",
+                        "operation": "max"
+                      },
+                      {
+                        "field": "node",
+                        "operation": "first"
+                      }
+                    ]
+                  }
+                },
+                "rowBy": [
+                  "namespace"
+                ]
+              }
+            }
+          ]
         }
       ],
       "time": {
diff --git a/services/monitoring/grafana-dashboard-storage.yaml b/services/monitoring/grafana-dashboard-storage.yaml
index 0a534f2..8aef820 100644
--- a/services/monitoring/grafana-dashboard-storage.yaml
+++ b/services/monitoring/grafana-dashboard-storage.yaml
@@ -36,11 +36,11 @@ data:
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
-                "mode": "percentage",
+                "mode": "absolute",
                 "steps": [
                   {
                     "color": "green",
@@ -48,11 +48,15 @@ data:
                   },
                   {
                     "color": "yellow",
-                    "value": 70
+                    "value": 50
+                  },
+                  {
+                    "color": "orange",
+                    "value": 75
                   },
                   {
                     "color": "red",
-                    "value": 85
+                    "value": 91.5
                   }
                 ]
               },
@@ -100,11 +104,11 @@ data:
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
-                "mode": "percentage",
+                "mode": "absolute",
                 "steps": [
                   {
                     "color": "green",
@@ -112,11 +116,15 @@ data:
                   },
                   {
                     "color": "yellow",
-                    "value": 70
+                    "value": 50
+                  },
+                  {
+                    "color": "orange",
+                    "value": 75
                   },
                   {
                     "color": "red",
-                    "value": 85
+                    "value": 91.5
                   }
                 ]
               },
@@ -164,7 +172,7 @@ data:
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
@@ -224,7 +232,7 @@ data:
           "fieldConfig": {
             "defaults": {
               "color": {
-                "mode": "palette-classic"
+                "mode": "thresholds"
               },
               "mappings": [],
               "thresholds": {
diff --git a/services/monitoring/helmrelease.yaml b/services/monitoring/helmrelease.yaml
index d7d7579..3fd76db 100644
--- a/services/monitoring/helmrelease.yaml
+++ b/services/monitoring/helmrelease.yaml
@@ -65,13 +65,13 @@ spec:
         namespace: flux-system
   values:
     server:
-      # keep ~3 months; change as you like (supports "d", "y")
+      # keep 1 year; supports "d", "y"
       extraArgs:
-        retentionPeriod: "90d"   # VM flag -retentionPeriod=90d. :contentReference[oaicite:11]{index=11}
+        retentionPeriod: "1y"   # VM flag -retentionPeriod=1y. :contentReference[oaicite:11]{index=11}
 
       persistentVolume:
         enabled: true
-        size: 100Gi
+        size: 250Gi
 
       # Enable built-in Kubernetes scraping
       scrape:
@@ -186,6 +186,15 @@ spec:
                 - targets: ["longhorn-backend.longhorn-system.svc:9500"]
               metrics_path: /metrics
 
+            # --- titan-db node_exporter (external control-plane DB host) ---
+            - job_name: "titan-db"
+              static_configs:
+                - targets: ["192.168.22.10:9100"]
+              relabel_configs:
+                - source_labels: [__address__]
+                  target_label: instance
+                  replacement: titan-db
+
             # --- cert-manager (pods expose on 9402) ---
             - job_name: "cert-manager"
               kubernetes_sd_configs: [{ role: pod }]
@@ -209,16 +218,6 @@ spec:
                 - action: keep
                   source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_label_app_kubernetes_io_part_of]
                   regex: flux-system;flux
-            - job_name: "titan-db"
-              static_configs:
-                - targets: ["titan-db:9100"]
-              relabel_configs:
-                - source_labels: [__address__]
-                  target_label: instance
-              metric_relabel_configs:
-                - source_labels: [instance]
-                  target_label: node
-                  replacement: titan-db
 
 ---
 
diff --git a/services/nextcloud/configmap.yaml b/services/nextcloud/configmap.yaml
new file mode 100644
index 0000000..a6e917c
--- /dev/null
+++ b/services/nextcloud/configmap.yaml
@@ -0,0 +1,48 @@
+# services/nextcloud/configmap.yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: nextcloud-config
+  namespace: nextcloud
+data:
+  extra.config.php: |
+    <?php
+    $CONFIG = array (
+      'trusted_domains' =>
+      array (
+        0 => 'cloud.bstein.dev',
+      ),
+      'overwritehost' => 'cloud.bstein.dev',
+      'overwriteprotocol' => 'https',
+      'overwrite.cli.url' => 'https://cloud.bstein.dev',
+      'default_phone_region' => 'US',
+      'mail_smtpmode' => 'smtp',
+      'mail_sendmailmode' => 'smtp',
+      'mail_smtphost' => 'mail.bstein.dev',
+      'mail_smtpport' => '587',
+      'mail_smtpsecure' => 'tls',
+      'mail_smtpauth' => true,
+      'mail_smtpauthtype' => 'LOGIN',
+      'mail_domain' => 'bstein.dev',
+      'mail_from_address' => 'no-reply',
+      'oidc_login_provider_url' => 'https://sso.bstein.dev/realms/atlas',
+      'oidc_login_client_id' => getenv('OIDC_CLIENT_ID'),
+      'oidc_login_client_secret' => getenv('OIDC_CLIENT_SECRET'),
+      'oidc_login_auto_redirect' => false,
+      'oidc_login_end_session_redirect' => true,
+      'oidc_login_button_text' => 'Login with Keycloak',
+      'oidc_login_hide_password_form' => false,
+      'oidc_login_attributes' =>
+        array (
+          'id' => 'preferred_username',
+          'mail' => 'email',
+          'name' => 'name',
+        ),
+      'oidc_login_scope' => 'openid profile email',
+      'oidc_login_unique_id' => 'preferred_username',
+      'oidc_login_use_pkce' => true,
+      'oidc_login_disable_registration' => false,
+      'oidc_login_create_groups' => false,
+      # External storage for user data should be configured to Asteria via the External Storage app (admin UI),
+      # keeping the astreae PVC for app internals only.
+    );
diff --git a/services/nextcloud/cronjob.yaml b/services/nextcloud/cronjob.yaml
new file mode 100644
index 0000000..86c55e1
--- /dev/null
+++ b/services/nextcloud/cronjob.yaml
@@ -0,0 +1,32 @@
+# services/nextcloud/cronjob.yaml
+apiVersion: batch/v1
+kind: CronJob
+metadata:
+  name: nextcloud-cron
+  namespace: nextcloud
+spec:
+  schedule: "*/5 * * * *"
+  concurrencyPolicy: Forbid
+  jobTemplate:
+    spec:
+      template:
+        spec:
+          securityContext:
+            runAsUser: 33
+            runAsGroup: 33
+            fsGroup: 33
+          restartPolicy: OnFailure
+          containers:
+            - name: nextcloud-cron
+              image: nextcloud:29-apache
+              imagePullPolicy: IfNotPresent
+              command: ["/bin/sh", "-c"]
+              args:
+                - "cd /var/www/html && php -f cron.php"
+              volumeMounts:
+                - name: nextcloud-data
+                  mountPath: /var/www/html
+          volumes:
+            - name: nextcloud-data
+              persistentVolumeClaim:
+                claimName: nextcloud-data
diff --git a/services/nextcloud/deployment.yaml b/services/nextcloud/deployment.yaml
new file mode 100644
index 0000000..b2c590f
--- /dev/null
+++ b/services/nextcloud/deployment.yaml
@@ -0,0 +1,143 @@
+# services/nextcloud/deployment.yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: nextcloud
+  namespace: nextcloud
+  labels:
+    app: nextcloud
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: nextcloud
+  template:
+    metadata:
+      labels:
+        app: nextcloud
+    spec:
+      nodeSelector:
+        hardware: rpi5
+      securityContext:
+        fsGroup: 33
+        runAsUser: 33
+        runAsGroup: 33
+      initContainers:
+        - name: fix-perms
+          image: alpine:3.20
+          command: ["/bin/sh", "-c"]
+          args:
+            - |
+              chown -R 33:33 /var/www/html/config || true
+              chown -R 33:33 /var/www/html/data || true
+          securityContext:
+            runAsUser: 0
+            runAsGroup: 0
+          volumeMounts:
+            - name: nextcloud-data
+              mountPath: /var/www/html
+            - name: nextcloud-config
+              mountPath: /var/www/html/config/extra.config.php
+              subPath: extra.config.php
+      containers:
+        - name: nextcloud
+          image: nextcloud:29-apache
+          imagePullPolicy: IfNotPresent
+          env:
+            # DB (external secret required: nextcloud-db with keys username,password,database)
+            - name: POSTGRES_HOST
+              value: postgres-service.postgres.svc.cluster.local
+            - name: POSTGRES_DB
+              valueFrom:
+                secretKeyRef:
+                  name: nextcloud-db
+                  key: database
+            - name: POSTGRES_USER
+              valueFrom:
+                secretKeyRef:
+                  name: nextcloud-db
+                  key: db-username
+            - name: POSTGRES_PASSWORD
+              valueFrom:
+                secretKeyRef:
+                  name: nextcloud-db
+                  key: db-password
+            # Admin bootstrap (external secret: nextcloud-admin with keys admin-user, admin-password)
+            - name: NEXTCLOUD_ADMIN_USER
+              valueFrom:
+                secretKeyRef:
+                  name: nextcloud-admin
+                  key: admin-user
+            - name: NEXTCLOUD_ADMIN_PASSWORD
+              valueFrom:
+                secretKeyRef:
+                  name: nextcloud-admin
+                  key: admin-password
+            - name: NEXTCLOUD_TRUSTED_DOMAINS
+              value: cloud.bstein.dev
+            - name: OVERWRITEHOST
+              value: cloud.bstein.dev
+            - name: OVERWRITEPROTOCOL
+              value: https
+            - name: OVERWRITECLIURL
+              value: https://cloud.bstein.dev
+            # SMTP (external secret: nextcloud-smtp with keys username, password)
+            - name: SMTP_HOST
+              value: mail.bstein.dev
+            - name: SMTP_PORT
+              value: "587"
+            - name: SMTP_SECURE
+              value: tls
+            - name: SMTP_NAME
+              valueFrom:
+                secretKeyRef:
+                  name: nextcloud-smtp
+                  key: smtp-username
+            - name: SMTP_PASSWORD
+              valueFrom:
+                secretKeyRef:
+                  name: nextcloud-smtp
+                  key: smtp-password
+            - name: MAIL_FROM_ADDRESS
+              value: no-reply
+            - name: MAIL_DOMAIN
+              value: bstein.dev
+            # OIDC (external secret: nextcloud-oidc with keys client-id, client-secret)
+            - name: OIDC_CLIENT_ID
+              valueFrom:
+                secretKeyRef:
+                  name: nextcloud-oidc
+                  key: client-id
+            - name: OIDC_CLIENT_SECRET
+              valueFrom:
+                secretKeyRef:
+                  name: nextcloud-oidc
+                  key: client-secret
+            - name: NEXTCLOUD_UPDATE
+              value: "1"
+            - name: APP_INSTALL
+              value: "mail,oidc_login,external"
+          ports:
+            - containerPort: 80
+              name: http
+          volumeMounts:
+            - name: nextcloud-data
+              mountPath: /var/www/html
+            - name: nextcloud-config
+              mountPath: /var/www/html/config/extra.config.php
+              subPath: extra.config.php
+          resources:
+            requests:
+              cpu: 250m
+              memory: 1Gi
+            limits:
+              cpu: 1
+              memory: 3Gi
+      volumes:
+        - name: nextcloud-data
+          persistentVolumeClaim:
+            claimName: nextcloud-data
+        - name: nextcloud-config
+          configMap:
+            name: nextcloud-config
+            defaultMode: 0444
diff --git a/services/nextcloud/ingress.yaml b/services/nextcloud/ingress.yaml
new file mode 100644
index 0000000..1c60282
--- /dev/null
+++ b/services/nextcloud/ingress.yaml
@@ -0,0 +1,25 @@
+# services/nextcloud/ingress.yaml
+apiVersion: networking.k8s.io/v1
+kind: Ingress
+metadata:
+  name: nextcloud
+  namespace: nextcloud
+  annotations:
+    cert-manager.io/cluster-issuer: letsencrypt-prod
+    traefik.ingress.kubernetes.io/router.entrypoints: websecure
+spec:
+  tls:
+    - hosts:
+        - cloud.bstein.dev
+      secretName: nextcloud-tls
+  rules:
+    - host: cloud.bstein.dev
+      http:
+        paths:
+          - path: /
+            pathType: Prefix
+            backend:
+              service:
+                name: nextcloud
+                port:
+                  number: 80
diff --git a/services/nextcloud/kustomization.yaml b/services/nextcloud/kustomization.yaml
new file mode 100644
index 0000000..5e3b414
--- /dev/null
+++ b/services/nextcloud/kustomization.yaml
@@ -0,0 +1,25 @@
+# services/nextcloud/kustomization.yaml
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+namespace: nextcloud
+resources:
+  - namespace.yaml
+  - configmap.yaml
+  - pvc.yaml
+  - deployment.yaml
+  - service.yaml
+  - ingress.yaml
+  - cronjob.yaml
+  - mail-sync-cronjob.yaml
+  - maintenance-cronjob.yaml
+configMapGenerator:
+  - name: nextcloud-maintenance-script
+    files:
+      - maintenance.sh=../../scripts/nextcloud-maintenance.sh
+    options:
+      disableNameSuffixHash: true
+  - name: nextcloud-mail-sync-script
+    files:
+      - sync.sh=../../scripts/nextcloud-mail-sync.sh
+    options:
+      disableNameSuffixHash: true
diff --git a/services/nextcloud/mail-sync-cronjob.yaml b/services/nextcloud/mail-sync-cronjob.yaml
new file mode 100644
index 0000000..52dc3ea
--- /dev/null
+++ b/services/nextcloud/mail-sync-cronjob.yaml
@@ -0,0 +1,58 @@
+# services/nextcloud/mail-sync-cronjob.yaml
+apiVersion: batch/v1
+kind: CronJob
+metadata:
+  name: nextcloud-mail-sync
+  namespace: nextcloud
+spec:
+  schedule: "0 5 * * *"
+  concurrencyPolicy: Forbid
+  jobTemplate:
+    spec:
+      template:
+        spec:
+          restartPolicy: OnFailure
+          securityContext:
+            runAsUser: 0
+            runAsGroup: 0
+          containers:
+            - name: mail-sync
+              image: nextcloud:29-apache
+              imagePullPolicy: IfNotPresent
+              command: ["/bin/bash", "/sync/sync.sh"]
+              env:
+                - name: KC_BASE
+                  value: https://sso.bstein.dev
+                - name: KC_REALM
+                  value: atlas
+                - name: KC_ADMIN_USER
+                  valueFrom:
+                    secretKeyRef:
+                      name: nextcloud-keycloak-admin
+                      key: username
+                - name: KC_ADMIN_PASS
+                  valueFrom:
+                    secretKeyRef:
+                      name: nextcloud-keycloak-admin
+                      key: password
+              volumeMounts:
+                - name: nextcloud-data
+                  mountPath: /var/www/html
+                - name: sync-script
+                  mountPath: /sync/sync.sh
+                  subPath: sync.sh
+              resources:
+                requests:
+                  cpu: 100m
+                  memory: 256Mi
+                limits:
+                  cpu: 500m
+                  memory: 512Mi
+          volumes:
+            - name: nextcloud-data
+              persistentVolumeClaim:
+                claimName: nextcloud-data
+            - name: sync-script
+              configMap:
+                name: nextcloud-mail-sync-script
+                defaultMode: 0755
diff --git a/services/nextcloud/maintenance-cronjob.yaml b/services/nextcloud/maintenance-cronjob.yaml
new file mode 100644
index 0000000..55fcbd1
--- /dev/null
+++ b/services/nextcloud/maintenance-cronjob.yaml
@@ -0,0 +1,56 @@
+# services/nextcloud/maintenance-cronjob.yaml
+apiVersion: batch/v1
+kind: CronJob
+metadata:
+  name: nextcloud-maintenance
+  namespace: nextcloud
+spec:
+  schedule: "30 4 * * *"
+  concurrencyPolicy: Forbid
+  jobTemplate:
+    spec:
+      template:
+        spec:
+          restartPolicy: OnFailure
+          securityContext:
+            runAsUser: 0
+            runAsGroup: 0
+          containers:
+            - name: maintenance
+              image: nextcloud:29-apache
+              imagePullPolicy: IfNotPresent
+              command: ["/bin/bash", "/maintenance/maintenance.sh"]
+              env:
+                - name: NC_URL
+                  value: https://cloud.bstein.dev
+                - name: ADMIN_USER
+                  valueFrom:
+                    secretKeyRef:
+                      name: nextcloud-admin
+                      key: admin-user
+                - name: ADMIN_PASS
+                  valueFrom:
+                    secretKeyRef:
+                      name: nextcloud-admin
+                      key: admin-password
+              volumeMounts:
+                - name: nextcloud-data
+                  mountPath: /var/www/html
+                - name: maintenance-script
+                  mountPath: /maintenance/maintenance.sh
+                  subPath: maintenance.sh
+              resources:
+                requests:
+                  cpu: 100m
+                  memory: 256Mi
+                limits:
+                  cpu: 500m
+                  memory: 512Mi
+          volumes:
+            - name: nextcloud-data
+              persistentVolumeClaim:
+                claimName: nextcloud-data
+            - name: maintenance-script
+              configMap:
+                name: nextcloud-maintenance-script
+                defaultMode: 0755
diff --git a/services/nextcloud/namespace.yaml b/services/nextcloud/namespace.yaml
new file mode 100644
index 0000000..fe63672
--- /dev/null
+++ b/services/nextcloud/namespace.yaml
@@ -0,0 +1,5 @@
+# services/nextcloud/namespace.yaml
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: nextcloud
diff --git a/services/nextcloud/pvc.yaml b/services/nextcloud/pvc.yaml
new file mode 100644
index 0000000..dd929b6
--- /dev/null
+++ b/services/nextcloud/pvc.yaml
@@ -0,0 +1,13 @@
+# services/nextcloud/pvc.yaml
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: nextcloud-data
+  namespace: nextcloud
+spec:
+  accessModes:
+    - ReadWriteMany
+  resources:
+    requests:
+      storage: 200Gi
+  storageClassName: astreae
diff --git a/services/nextcloud/service.yaml b/services/nextcloud/service.yaml
new file mode 100644
index 0000000..ab160fb
--- /dev/null
+++ b/services/nextcloud/service.yaml
@@ -0,0 +1,13 @@
+# services/nextcloud/service.yaml
+apiVersion: v1
+kind: Service
+metadata:
+  name: nextcloud
+  namespace: nextcloud
+spec:
+  selector:
+    app: nextcloud
+  ports:
+    - name: http
+      port: 80
+      targetPort: http
diff --git a/services/pegasus/ingress.yaml b/services/pegasus/ingress.yaml
index 48d22c3..2ab7a2e 100644
--- a/services/pegasus/ingress.yaml
+++ b/services/pegasus/ingress.yaml
@@ -8,7 +8,7 @@ metadata:
     kubernetes.io/ingress.class: traefik
     traefik.ingress.kubernetes.io/router.entrypoints: websecure
     traefik.ingress.kubernetes.io/router.tls: "true"
-    cert-manager.io/cluster-issuer: letsencrypt-prod
+    cert-manager.io/cluster-issuer: letsencrypt
 spec:
   tls:
     - hosts: [ "pegasus.bstein.dev" ]
diff --git a/services/vault/certificate.yaml b/services/vault/certificate.yaml
index 983c7fe..2d32f65 100644
--- a/services/vault/certificate.yaml
+++ b/services/vault/certificate.yaml
@@ -8,7 +8,7 @@ spec:
   secretName: vault-server-tls
   issuerRef:
     kind: ClusterIssuer
-    name: letsencrypt-prod
+    name: letsencrypt
   commonName: secret.bstein.dev
   dnsNames:
   - secret.bstein.dev
diff --git a/services/zot/ingress.yaml b/services/zot/ingress.yaml
index 3425535..12f6c60 100644
--- a/services/zot/ingress.yaml
+++ b/services/zot/ingress.yaml
@@ -5,7 +5,7 @@ metadata:
   name: zot
   namespace: zot
   annotations:
-    cert-manager.io/cluster-issuer: letsencrypt-prod
+    cert-manager.io/cluster-issuer: letsencrypt
     traefik.ingress.kubernetes.io/router.entrypoints: websecure
     traefik.ingress.kubernetes.io/router.tls: "true"
     traefik.ingress.kubernetes.io/router.middlewares: zot-zot-resp-headers@kubernetescrd