diff --git a/.gitignore b/.gitignore index c317064..9bdfcf6 100644 --- a/.gitignore +++ b/.gitignore @@ -1 +1,5 @@ -AGENTS.md +# Ignore markdown by default, but keep top-level docs +*.md +!README.md +!AGENTS.md +!**/NOTES.md diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..9dc36ac --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,81 @@ + + +Repository Guidelines + +> Local-only note: apply changes through Flux-tracked manifests, not by manual kubectl edits in-cluster—manual tweaks will be reverted by Flux. + +## Project Structure & Module Organization +- `infrastructure/`: cluster-scoped building blocks (core, flux-system, traefik, longhorn). Add new platform features by mirroring this layout. +- `services/`: workload manifests per app (`services/gitea/`, etc.) with `kustomization.yaml` plus one file per kind; keep diffs small and focused. +- `dockerfiles/` hosts bespoke images, while `scripts/` stores operational Fish/Bash helpers—extend these directories instead of relying on ad-hoc commands. + +## Build, Test, and Development Commands +- `kustomize build services/` (or `kubectl kustomize ...`) renders manifests exactly as Flux will. +- `kubectl apply --server-side --dry-run=client -k services/` checks schema compatibility without touching the cluster. +- `flux reconcile kustomization --namespace flux-system --with-source` pulls the latest Git state after merges or hotfixes. +- `fish scripts/flux_hammer.fish --help` explains the recovery tool; read it before running against production workloads. + +## Coding Style & Naming Conventions +- YAML uses two-space indents; retain the leading path comment (e.g. `# services/gitea/deployment.yaml`) to speed code review. +- Keep resource names lowercase kebab-case, align labels/selectors, and mirror namespaces with directory names. +- List resources in `kustomization.yaml` from namespace/config, through storage, then workloads and networking for predictable diffs. +- Scripts start with `#!/usr/bin/env fish` or bash, stay executable, and follow snake_case names such as `flux_hammer.fish`. + +## Testing Guidelines +- Run `kustomize build` and the dry-run apply for every service you touch; capture failures before opening a PR. +- `flux diff kustomization --path services/` previews reconciliations—link notable output when behavior shifts. +- Docker edits: `docker build -f dockerfiles/Dockerfile.monerod .` (swap the file you changed) to verify image builds. + +## Commit & Pull Request Guidelines +- Keep commit subjects short, present-tense, and optionally scoped (`gpu(titan-24): add RuntimeClass`); squash fixups before review. +- Describe linked issues, affected services, and required operator steps (e.g. `flux reconcile kustomization services-gitea`) in the PR body. +- Focus each PR on one kustomization or service and update `infrastructure/flux-system` when Flux must track new folders. +- Record the validation you ran (dry-runs, diffs, builds) and add screenshots only when ingress or UI behavior changes. + +## Security & Configuration Tips +- Never commit credentials; use Vault workflows (`services/vault/`) or SOPS-encrypted manifests wired through `infrastructure/flux-system`. +- Node selectors and tolerations gate workloads to hardware like `hardware: rpi4`; confirm labels before scaling or renaming nodes. +- Pin external images by digest or rely on Flux image automation to follow approved tags and avoid drift. + +## Dashboard roadmap / context (2025-12-02) +- Atlas dashboards are generated via `scripts/dashboards_render_atlas.py --build`, which writes JSON under `services/monitoring/dashboards/` and ConfigMaps under `services/monitoring/`. Keep the Grafana manifests in sync by regenerating after edits. +- Atlas Overview panels are paired with internal dashboards (pods, nodes, storage, network, GPU). A new `atlas-gpu` internal dashboard holds the detailed GPU metrics that feed the overview share pie. +- Old Grafana folders (`Atlas Storage`, `Atlas SRE`, `Atlas Public`, `Atlas Nodes`) should be removed in Grafana UI when convenient; only `Atlas Overview` and `Atlas Internal` should remain provisioned. +- Future work: add a separate generator (e.g., `dashboards_render_oceanus.py`) for SUI/oceanus validation dashboards, mirroring the atlas pattern of internal dashboards feeding a public overview. + +## Monitoring state (2025-12-03) +- dcgm-exporter DaemonSet pulls `registry.bstein.dev/monitoring/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04` with nvidia runtime/imagePullSecret; titan-24 exports metrics, titan-22 remains NotReady. +- Atlas Overview is the Grafana home (1h range, 1m refresh), Overview folder UID `overview`, internal folder `atlas-internal` (oceanus-internal stub). +- Panels standardized via generator; hottest row compressed, worker/control rows taller, root disk row taller and top12 bar gauge with labels. GPU share pie uses 1h avg_over_time to persist idle activity. +- Internal dashboards are provisioned without Viewer role; if anonymous still sees them, restart Grafana and tighten auth if needed. +- GPU share panel updated (feature/sso) to use `max_over_time(…[$__range])`, so longer ranges (e.g., 12h) keep recent activity visible. Flux tracking `feature/sso`. + +## Upcoming priorities (SSO/storage/mail) +- Establish SSO (Keycloak or similar) and federate Grafana, Gitea, Zot, Nextcloud, Pegasus/Jellyfin; keep Vaultwarden separate until safe. +- Add Nextcloud (limit to rpi5 workers) with office suite; integrate with SSO; plan storage class and ingress. +- Plan mail: mostly self-hosted, relay through trusted provider for outbound; integrate with services (Nextcloud, Vaultwarden, etc.) for notifications and account flows. + +## SSO plan sketch (2025-12-03) +- IdP: use Keycloak (preferred) in a new `sso` namespace, Bitnami or codecentric chart with Postgres backing store (single PVC), ingress `sso.bstein.dev`, admin user bound to brad@bstein.dev; stick with local DB initially (no external IdP). +- Auth flow goals: Grafana (OIDC), Gitea (OAuth2/Keycloak), Zot (via Traefik forward-auth/oauth2-proxy), Jellyfin/Pegasus via Jellyfin OAuth/OpenID plugin (map existing usernames; run migration to pre-create users in Keycloak with same usernames/emails and temporary passwords), Pegasus keeps using Jellyfin tokens. +- Steps to implement: + 1) Add service folder `services/keycloak/` (namespace, PVC, HelmRelease, ingress, secret for admin creds). Verify with kustomize + Flux reconcile. + 2) Seed realm `atlas` with users (import CSV/realm). Create client for Grafana (public/implicit), Gitea (confidential), and a “jellyfin” client for the OAuth plugin; set email for brad@bstein.dev as admin. + 3) Reconfigure Grafana to OIDC (disable anonymous to internal folders, leave Overview public via folder permissions). Reconfigure Gitea to OIDC (app.ini). + 4) Add Traefik forward-auth (oauth2-proxy) in front of Zot and any other services needing headers-based auth. + 5) Deploy Jellyfin OpenID plugin; map Keycloak users to existing Jellyfin usernames; communicate password reset path. +- Migration caution: do not delete existing local creds until SSO validated; keep Pegasus working via Jellyfin tokens during transition. + +## Postgres centralization (2025-12-03) +- Prefer a shared in-cluster Postgres deployment with per-service databases to reduce resource sprawl on Pi nodes. Use it for services that can easily point at an external DB. +- Candidates to migrate to shared Postgres: Keycloak (realm DB), Gitea (git DB), Nextcloud (app DB), possibly Grafana (if persistence needed beyond current provisioner), Jitsi prosody/JVB state (if external DB supported). Keep tightly-coupled or lightweight embedded DBs as-is when migration is painful or not supported. + +## SSO integration snapshot (2025-12-08) +- Current blockers: Zot still prompts for basic auth/double-login; Vault still wants the token UI after Keycloak (previously 502/404 when vault-0 sealed). Forward-auth middleware on Zot Ingress likely still causing the 401/Found hop; Vault OIDC mount not completing UI flow unless unsealed and preferred login is set. +- Flux-only changes required: remove zot forward-auth middleware from Ingress (let oauth2-proxy handle redirect), ensure Vault OIDC mount is preferred UI login and bound to admin group; keep all edits in repo so Flux enforces them. +- Secrets present (per user): `zot-oidc-client` (client_secret only), `oauth2-proxy-zot-oidc`, `oauth2-proxy-vault-oidc`, `vault-oidc-admin-token`. Zot needs its regcred in the zot namespace if image pulls fail. +- Cluster validation blocked here: `kubectl get nodes` fails (403/permission) and DNS to `*.bstein.dev` fails in this session, so no live curl verification could be run. Re-test on a host with cluster/DNS access after Flux applies fixes. + +## Docs hygiene +- Do not add per-service `README.md` files; use `NOTES.md` if documentation is needed inside service folders. Keep only the top-level repo README. +- Keep comments succinct and in a human voice—no AI-sounding notes. Use `NOTES.md` for scratch notes instead of sprinkling reminders into code or extra READMEs. diff --git a/NOTES.md b/NOTES.md new file mode 100644 index 0000000..8b8b8d2 --- /dev/null +++ b/NOTES.md @@ -0,0 +1,3 @@ +# Rotation reminders (temporary secrets set by automation) + +- Weave GitOps UI (`cd.bstein.dev`) admin: `admin` / `G1tOps!2025` — rotate immediately after first login. diff --git a/README.md b/README.md new file mode 100644 index 0000000..15dc377 --- /dev/null +++ b/README.md @@ -0,0 +1,3 @@ +# titan-iac + +Flux-managed Kubernetes cluster for bstein.dev services. diff --git a/clusters/atlas/flux-system/applications/kustomization.yaml b/clusters/atlas/flux-system/applications/kustomization.yaml index 1bc2700..daf1c42 100644 --- a/clusters/atlas/flux-system/applications/kustomization.yaml +++ b/clusters/atlas/flux-system/applications/kustomization.yaml @@ -15,3 +15,4 @@ resources: - sui-metrics/kustomization.yaml - keycloak/kustomization.yaml - oauth2-proxy/kustomization.yaml + - mailu/kustomization.yaml diff --git a/clusters/atlas/flux-system/applications/mailu/kustomization.yaml b/clusters/atlas/flux-system/applications/mailu/kustomization.yaml new file mode 100644 index 0000000..09db2fd --- /dev/null +++ b/clusters/atlas/flux-system/applications/mailu/kustomization.yaml @@ -0,0 +1,18 @@ +# clusters/atlas/flux-system/applications/mailu/kustomization.yaml +apiVersion: kustomize.toolkit.fluxcd.io/v1 +kind: Kustomization +metadata: + name: mailu + namespace: flux-system +spec: + interval: 10m + sourceRef: + kind: GitRepository + name: flux-system + namespace: flux-system + path: ./services/mailu + targetNamespace: mailu-mailserver + prune: true + wait: true + dependsOn: + - name: helm diff --git a/clusters/atlas/flux-system/gotk-sync.yaml b/clusters/atlas/flux-system/gotk-sync.yaml index 4076ef6..26dc23f 100644 --- a/clusters/atlas/flux-system/gotk-sync.yaml +++ b/clusters/atlas/flux-system/gotk-sync.yaml @@ -8,7 +8,7 @@ metadata: spec: interval: 1m0s ref: - branch: feature/sso + branch: feature/mailu secretRef: name: flux-system-gitea url: ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git diff --git a/clusters/atlas/flux-system/platform/gitops-ui/kustomization.yaml b/clusters/atlas/flux-system/platform/gitops-ui/kustomization.yaml new file mode 100644 index 0000000..9241a4b --- /dev/null +++ b/clusters/atlas/flux-system/platform/gitops-ui/kustomization.yaml @@ -0,0 +1,20 @@ +# clusters/atlas/flux-system/platform/gitops-ui/kustomization.yaml +apiVersion: kustomize.toolkit.fluxcd.io/v1 +kind: Kustomization +metadata: + name: gitops-ui + namespace: flux-system +spec: + interval: 10m + timeout: 10m + path: ./services/gitops-ui + prune: true + sourceRef: + kind: GitRepository + name: flux-system + namespace: flux-system + targetNamespace: flux-system + dependsOn: + - name: helm + - name: traefik + wait: true diff --git a/clusters/atlas/flux-system/platform/kustomization.yaml b/clusters/atlas/flux-system/platform/kustomization.yaml index 59d2032..040e478 100644 --- a/clusters/atlas/flux-system/platform/kustomization.yaml +++ b/clusters/atlas/flux-system/platform/kustomization.yaml @@ -5,5 +5,6 @@ resources: - core/kustomization.yaml - helm/kustomization.yaml - traefik/kustomization.yaml + - gitops-ui/kustomization.yaml - monitoring/kustomization.yaml - longhorn-ui/kustomization.yaml diff --git a/clusters/oceanus/README.md b/clusters/oceanus/README.md deleted file mode 100644 index d91b52f..0000000 --- a/clusters/oceanus/README.md +++ /dev/null @@ -1,5 +0,0 @@ -# Oceanus Cluster Scaffold - -This directory prepares the Flux and Kustomize layout for a future Oceanus-managed cluster. -Populate `flux-system/` with `gotk-components.yaml` and related manifests after running `flux bootstrap`. -Define node-specific resources under `infrastructure/modules/profiles/oceanus-validator/` and reference workloads in `applications/` as they come online. diff --git a/docs/topology.md b/docs/topology.md index 27b06f5..1e37235 100644 --- a/docs/topology.md +++ b/docs/topology.md @@ -2,15 +2,14 @@ | Hostname | Role / Function | Managed By | Notes | |------------|--------------------------------|---------------------|-------| +| titan-db | HA control plane database | Ansible | PostgreSQL / etcd backing services | | titan-0a | Kubernetes control-plane | Flux (atlas cluster)| HA leader, tainted for control only | | titan-0b | Kubernetes control-plane | Flux (atlas cluster)| Standby control node | | titan-0c | Kubernetes control-plane | Flux (atlas cluster)| Standby control node | | titan-04-19| Raspberry Pi workers | Flux (atlas cluster)| Workload nodes, labelled per hardware | +| titan-20&21| NVIDIA Jetson workers | Flux (atlas cluster)| Workload nodes, labelled per hardware | | titan-22 | GPU mini-PC (Jellyfin) | Flux + Ansible | NVIDIA runtime managed via `modules/profiles/atlas-ha` | +| titan-23 | Dedicated SUI validator Oceanus| Manual + Ansible | Baremetal validator workloads, exposes metrics to atlas | | titan-24 | Tethys hybrid node | Flux + Ansible | Runs SUI metrics via K8s, validator via Ansible | -| titan-db | HA control plane database | Ansible | PostgreSQL / etcd backing services | -| titan-jh | Jumphost & bastion | Ansible | Entry point / future KVM services | -| oceanus | Dedicated SUI validator host | Ansible / Flux prep | Baremetal validator workloads, exposes metrics to atlas; Kustomize scaffold under `clusters/oceanus/` | +| titan-jh | Jumphost & bastion & lesavka | Ansible | Entry point / future KVM services / custom kvm - lesavaka | | styx | Air-gapped workstation | Manual / Scripts | Remains isolated, scripts tracked in `hosts/styx` | - -Use the `clusters/` directory for cluster-scoped state and the `hosts/` directory for baremetal orchestration. diff --git a/hosts/styx/README.md b/hosts/styx/NOTES.md similarity index 100% rename from hosts/styx/README.md rename to hosts/styx/NOTES.md diff --git a/infrastructure/core/kustomization.yaml b/infrastructure/core/kustomization.yaml index 1f56f6d..14d6a02 100644 --- a/infrastructure/core/kustomization.yaml +++ b/infrastructure/core/kustomization.yaml @@ -5,3 +5,4 @@ resources: - ../modules/base - ../modules/profiles/atlas-ha - ../sources/cert-manager/letsencrypt.yaml + - ../sources/cert-manager/letsencrypt-prod.yaml diff --git a/infrastructure/sources/cert-manager/letsencrypt-prod.yaml b/infrastructure/sources/cert-manager/letsencrypt-prod.yaml new file mode 100644 index 0000000..65bf316 --- /dev/null +++ b/infrastructure/sources/cert-manager/letsencrypt-prod.yaml @@ -0,0 +1,14 @@ +apiVersion: cert-manager.io/v1 +kind: ClusterIssuer +metadata: + name: letsencrypt-prod +spec: + acme: + email: brad.stein@gmail.com + server: https://acme-v02.api.letsencrypt.org/directory + privateKeySecretRef: + name: letsencrypt-prod-account-key + solvers: + - http01: + ingress: + class: traefik diff --git a/infrastructure/sources/helm/kustomization.yaml b/infrastructure/sources/helm/kustomization.yaml new file mode 100644 index 0000000..a0f55b0 --- /dev/null +++ b/infrastructure/sources/helm/kustomization.yaml @@ -0,0 +1,10 @@ +# infrastructure/sources/helm/kustomization.yaml +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization +resources: + - grafana.yaml + - hashicorp.yaml + - jetstack.yaml + - mailu.yaml + - prometheus.yaml + - victoria-metrics.yaml diff --git a/infrastructure/sources/helm/mailu.yaml b/infrastructure/sources/helm/mailu.yaml new file mode 100644 index 0000000..5cd56e2 --- /dev/null +++ b/infrastructure/sources/helm/mailu.yaml @@ -0,0 +1,9 @@ +# infrastructure/sources/helm/mailu.yaml +apiVersion: source.toolkit.fluxcd.io/v1 +kind: HelmRepository +metadata: + name: mailu + namespace: flux-system +spec: + interval: 1h + url: https://mailu.github.io/helm-charts diff --git a/scripts/dashboards_render_atlas.py b/scripts/dashboards_render_atlas.py index f577eab..894edb9 100644 --- a/scripts/dashboards_render_atlas.py +++ b/scripts/dashboards_render_atlas.py @@ -36,11 +36,12 @@ PUBLIC_FOLDER = "overview" PRIVATE_FOLDER = "atlas-internal" PERCENT_THRESHOLDS = { - "mode": "percentage", + "mode": "absolute", "steps": [ {"color": "green", "value": None}, - {"color": "yellow", "value": 70}, - {"color": "red", "value": 85}, + {"color": "yellow", "value": 50}, + {"color": "orange", "value": 75}, + {"color": "red", "value": 91.5}, ], } @@ -81,7 +82,7 @@ CONTROL_SUFFIX = f"/{CONTROL_TOTAL}" WORKER_SUFFIX = f"/{WORKER_TOTAL}" CP_ALLOWED_NS = "kube-system|kube-public|kube-node-lease|longhorn-system|monitoring|flux-system" LONGHORN_NODE_REGEX = "titan-1[2-9]|titan-2[24]" -GAUGE_WIDTHS = [5, 5, 5, 5, 4] +GAUGE_WIDTHS = [4, 3, 3, 4, 3, 3, 4] CONTROL_WORKLOADS_EXPR = ( f'sum(kube_pod_info{{node=~"{CONTROL_REGEX}",namespace!~"{CP_ALLOWED_NS}"}}) or on() vector(0)' ) @@ -187,17 +188,64 @@ def namespace_gpu_share_expr(): return namespace_share_expr(NAMESPACE_GPU_RAW) -PROBLEM_PODS_EXPR = 'sum(max by (namespace,pod) (kube_pod_status_phase{phase!~"Running|Succeeded"}))' +PROBLEM_PODS_EXPR = ( + 'sum(max by (namespace,pod) (kube_pod_status_phase{phase!~"Running|Succeeded"})) ' + "or on() vector(0)" +) CRASHLOOP_EXPR = ( 'sum(max by (namespace,pod) (kube_pod_container_status_waiting_reason' - '{reason=~"CrashLoopBackOff|ImagePullBackOff"}))' + '{reason=~"CrashLoopBackOff|ImagePullBackOff"})) ' + "or on() vector(0)" ) STUCK_TERMINATING_EXPR = ( 'sum(max by (namespace,pod) (' '((time() - kube_pod_deletion_timestamp{pod!=""}) > bool 600)' ' and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=""} > bool 0)' - '))' + ')) ' + "or on() vector(0)" ) +UPTIME_WINDOW = "30d" +TRAEFIK_READY_EXPR = ( + "(" + 'sum(kube_deployment_status_replicas_available{namespace=~"traefik|kube-system",deployment="traefik"})' + " / clamp_min(" + 'sum(kube_deployment_spec_replicas{namespace=~"traefik|kube-system",deployment="traefik"}), 1)' + ")" +) +CONTROL_READY_FRACTION_EXPR = ( + f"(sum(kube_node_status_condition{{condition=\"Ready\",status=\"true\",node=~\"{CONTROL_REGEX}\"}})" + f" / {CONTROL_TOTAL})" +) +UPTIME_AVAIL_EXPR = ( + f"min(({CONTROL_READY_FRACTION_EXPR}), ({TRAEFIK_READY_EXPR}))" +) + +# Tie-breaker to deterministically pick one node per namespace when shares tie. +NODE_TIEBREAKER = " + ".join( + f"({node_filter(node)}) * 1e-6 * {idx}" + for idx, node in enumerate(CONTROL_ALL + WORKER_NODES, start=1) +) +UPTIME_AVG_EXPR = f"avg_over_time(({UPTIME_AVAIL_EXPR})[{UPTIME_WINDOW}:5m])" +UPTIME_PERCENT_EXPR = UPTIME_AVG_EXPR +UPTIME_NINES_EXPR = f"-log10(1 - clamp_max({UPTIME_AVG_EXPR}, 0.999999999))" +UPTIME_THRESHOLDS = { + "mode": "absolute", + "steps": [ + {"color": "red", "value": None}, + {"color": "orange", "value": 2}, + {"color": "yellow", "value": 3}, + {"color": "green", "value": 3.5}, + ], +} +UPTIME_PERCENT_THRESHOLDS = { + "mode": "absolute", + "steps": [ + {"color": "red", "value": None}, + {"color": "orange", "value": 0.999}, + {"color": "yellow", "value": 0.9999}, + {"color": "green", "value": 0.99999}, + ], +} PROBLEM_TABLE_EXPR = ( "(time() - kube_pod_created{pod!=\"\"}) " "* on(namespace,pod) group_left(node) kube_pod_info " @@ -291,6 +339,34 @@ NET_INTERNAL_EXPR = ( '+ rate(container_network_transmit_bytes_total{namespace!="traefik",pod!=""}[5m]))' ' or on() vector(0)' ) +APISERVER_5XX_RATE = 'sum(rate(apiserver_request_total{code=~"5.."}[5m]))' +APISERVER_P99_LATENCY_MS = ( + "histogram_quantile(0.99, sum by (le) (rate(apiserver_request_duration_seconds_bucket[5m]))) * 1000" +) +ETCD_P99_LATENCY_MS = ( + "histogram_quantile(0.99, sum by (le) (rate(etcd_request_duration_seconds_bucket[5m]))) * 1000" +) +TRAEFIK_TOTAL_5M = "sum(rate(traefik_entrypoint_requests_total[5m]))" +TRAEFIK_SUCCESS_5M = 'sum(rate(traefik_entrypoint_requests_total{code!~"5.."}[5m]))' +TRAEFIK_SLI_5M = f"({TRAEFIK_SUCCESS_5M}) / clamp_min({TRAEFIK_TOTAL_5M}, 1)" +TRAEFIK_P99_LATENCY_MS = ( + "histogram_quantile(0.99, sum by (le) (rate(traefik_entrypoint_request_duration_seconds_bucket[5m]))) * 1000" +) +TRAEFIK_P95_LATENCY_MS = ( + "histogram_quantile(0.95, sum by (le) (rate(traefik_entrypoint_request_duration_seconds_bucket[5m]))) * 1000" +) +SLO_AVAILABILITY = 0.999 + + +def traefik_sli(window): + total = f'sum(rate(traefik_entrypoint_requests_total[{window}]))' + success = f'sum(rate(traefik_entrypoint_requests_total{{code!~"5.."}}[{window}]))' + return f"({success}) / clamp_min({total}, 1)" + + +def traefik_burn(window): + sli = traefik_sli(window) + return f"(1 - ({sli})) / {1 - SLO_AVAILABILITY}" # --------------------------------------------------------------------------- # Panel factories @@ -304,6 +380,7 @@ def stat_panel( grid, *, unit="none", + decimals=None, thresholds=None, text_mode="value", legend=None, @@ -313,7 +390,7 @@ def stat_panel( ): """Return a Grafana stat panel definition.""" defaults = { - "color": {"mode": "palette-classic"}, + "color": {"mode": "thresholds"}, "mappings": [], "thresholds": thresholds or { @@ -328,6 +405,8 @@ def stat_panel( } if value_suffix: defaults["custom"]["valueSuffix"] = value_suffix + if decimals is not None: + defaults["decimals"] = decimals panel = { "id": panel_id, "type": "stat", @@ -446,17 +525,32 @@ def table_panel( *, unit="none", transformations=None, + instant=False, + options=None, + filterable=True, + footer=None, + format=None, ): """Return a Grafana table panel definition.""" + # Optional PromQL subquery helpers in expr: share(), etc. + panel_options = {"showHeader": True, "columnFilters": False} + if options: + panel_options.update(options) + if footer is not None: + panel_options["footer"] = footer + field_defaults = {"unit": unit, "custom": {"filterable": filterable}} + target = {"expr": expr, "refId": "A", **({"instant": True} if instant else {})} + if format: + target["format"] = format panel = { "id": panel_id, "type": "table", "title": title, "datasource": PROM_DS, "gridPos": grid, - "targets": [{"expr": expr, "refId": "A"}], - "fieldConfig": {"defaults": {"unit": unit}, "overrides": []}, - "options": {"showHeader": True}, + "targets": [target], + "fieldConfig": {"defaults": field_defaults, "overrides": []}, + "options": panel_options, } if transformations: panel["transformations"] = transformations @@ -482,7 +576,7 @@ def pie_panel(panel_id, title, expr, grid): "options": { "legend": {"displayMode": "list", "placement": "right"}, "pieType": "pie", - "displayLabels": ["percent"], + "displayLabels": [], "tooltip": {"mode": "single"}, "colorScheme": "interpolateSpectral", "colorBy": "value", @@ -491,7 +585,19 @@ def pie_panel(panel_id, title, expr, grid): } -def bargauge_panel(panel_id, title, expr, grid, *, unit="none", links=None): +def bargauge_panel( + panel_id, + title, + expr, + grid, + *, + unit="none", + links=None, + limit=None, + thresholds=None, + decimals=None, + instant=False, +): """Return a bar gauge panel with label-aware reduction.""" panel = { "id": panel_id, @@ -499,13 +605,16 @@ def bargauge_panel(panel_id, title, expr, grid, *, unit="none", links=None): "title": title, "datasource": PROM_DS, "gridPos": grid, - "targets": [{"expr": expr, "refId": "A", "legendFormat": "{{node}}"}], + "targets": [ + {"expr": expr, "refId": "A", "legendFormat": "{{node}}", **({"instant": True} if instant else {})} + ], "fieldConfig": { "defaults": { "unit": unit, "min": 0, "max": 100 if unit == "percent" else None, - "thresholds": { + "thresholds": thresholds + or { "mode": "absolute", "steps": [ {"color": "green", "value": None}, @@ -527,8 +636,19 @@ def bargauge_panel(panel_id, title, expr, grid, *, unit="none", links=None): }, }, } + if decimals is not None: + panel["fieldConfig"]["defaults"]["decimals"] = decimals if links: panel["links"] = links + # Keep bars ordered by value descending for readability. + panel["transformations"] = [ + { + "id": "sortBy", + "options": {"fields": ["Value"], "order": "desc"}, + } + ] + if limit: + panel["transformations"].append({"id": "limit", "options": {"limit": limit}}) return panel @@ -555,81 +675,37 @@ def link_to(uid): def build_overview(): panels = [] + count_thresholds = { + "mode": "absolute", + "steps": [ + {"color": "green", "value": None}, + {"color": "yellow", "value": 1}, + {"color": "orange", "value": 2}, + {"color": "red", "value": 3}, + ], + } + row1_stats = [ - ( - 1, - "Workers Ready", - f'sum(kube_node_status_condition{{condition="Ready",status="true",node=~"{WORKER_REGEX}"}})', - WORKER_SUFFIX, - WORKER_TOTAL, - None, - ), - ( - 2, - "Control Plane Ready", - f'sum(kube_node_status_condition{{condition="Ready",status="true",node=~"{CONTROL_REGEX}"}})', - CONTROL_SUFFIX, - CONTROL_TOTAL, - None, - ), - ( - 3, - "Control Plane Workloads", - CONTROL_WORKLOADS_EXPR, - None, - 4, - link_to("atlas-pods"), - ), - ( - 4, - "Problem Pods", - PROBLEM_PODS_EXPR, - None, - 1, - link_to("atlas-pods"), - ), - ( - 5, - "Stuck Terminating", - STUCK_TERMINATING_EXPR, - None, - 1, - link_to("atlas-pods"), - ), - ] - - def gauge_grid(idx): - width = GAUGE_WIDTHS[idx] if idx < len(GAUGE_WIDTHS) else 4 - x = sum(GAUGE_WIDTHS[:idx]) - return width, x - - for idx, (panel_id, title, expr, suffix, ok_value, links) in enumerate(row1_stats): - thresholds = None - min_value = 0 - max_value = ok_value or 5 - if panel_id == 1: - max_value = WORKER_TOTAL - thresholds = { - "mode": "absolute", - "steps": [ - {"color": "red", "value": None}, - {"color": "orange", "value": WORKER_TOTAL - 2}, - {"color": "yellow", "value": WORKER_TOTAL - 1}, - {"color": "green", "value": WORKER_TOTAL}, - ], - } - elif panel_id == 2: - max_value = CONTROL_TOTAL - thresholds = { + { + "id": 2, + "title": "Control Plane Ready", + "expr": f'sum(kube_node_status_condition{{condition="Ready",status="true",node=~"{CONTROL_REGEX}"}})', + "kind": "gauge", + "max_value": CONTROL_TOTAL, + "thresholds": { "mode": "absolute", "steps": [ {"color": "red", "value": None}, {"color": "green", "value": CONTROL_TOTAL}, ], - } - elif panel_id in (3, 4, 5): - max_value = 4 - thresholds = { + }, + }, + { + "id": 3, + "title": "Control Plane Workloads", + "expr": CONTROL_WORKLOADS_EXPR, + "kind": "stat", + "thresholds": { "mode": "absolute", "steps": [ {"color": "green", "value": None}, @@ -637,40 +713,122 @@ def build_overview(): {"color": "orange", "value": 2}, {"color": "red", "value": 3}, ], - } - else: - thresholds = { + }, + "links": link_to("atlas-pods"), + }, + { + "id": 5, + "title": "Stuck Terminating", + "expr": STUCK_TERMINATING_EXPR, + "kind": "stat", + "thresholds": { "mode": "absolute", "steps": [ {"color": "green", "value": None}, - {"color": "red", "value": max_value}, + {"color": "yellow", "value": 1}, + {"color": "orange", "value": 2}, + {"color": "red", "value": 3}, ], - } + }, + "links": link_to("atlas-pods"), + }, + { + "id": 27, + "title": "Atlas Availability (30d)", + "expr": UPTIME_PERCENT_EXPR, + "kind": "stat", + "thresholds": UPTIME_PERCENT_THRESHOLDS, + "unit": "percentunit", + "decimals": 3, + "text_mode": "value", + }, + { + "id": 4, + "title": "Problem Pods", + "expr": PROBLEM_PODS_EXPR, + "kind": "stat", + "thresholds": { + "mode": "absolute", + "steps": [ + {"color": "green", "value": None}, + {"color": "yellow", "value": 1}, + {"color": "orange", "value": 2}, + {"color": "red", "value": 3}, + ], + }, + "links": link_to("atlas-pods"), + }, + { + "id": 6, + "title": "CrashLoop / ImagePull", + "expr": CRASHLOOP_EXPR, + "kind": "stat", + "thresholds": { + "mode": "absolute", + "steps": [ + {"color": "green", "value": None}, + {"color": "yellow", "value": 1}, + {"color": "orange", "value": 2}, + {"color": "red", "value": 3}, + ], + }, + "links": link_to("atlas-pods"), + }, + { + "id": 1, + "title": "Workers Ready", + "expr": f'sum(kube_node_status_condition{{condition="Ready",status="true",node=~"{WORKER_REGEX}"}})', + "kind": "gauge", + "max_value": WORKER_TOTAL, + "thresholds": { + "mode": "absolute", + "steps": [ + {"color": "red", "value": None}, + {"color": "orange", "value": WORKER_TOTAL - 2}, + {"color": "yellow", "value": WORKER_TOTAL - 1}, + {"color": "green", "value": WORKER_TOTAL}, + ], + }, + }, + ] + + def gauge_grid(idx): + width = GAUGE_WIDTHS[idx] if idx < len(GAUGE_WIDTHS) else 4 + x = sum(GAUGE_WIDTHS[:idx]) + return width, x + + for idx, item in enumerate(row1_stats): + panel_id = item["id"] width, x = gauge_grid(idx) - if panel_id in (3, 4, 5): + grid = {"h": 5, "w": width, "x": x, "y": 0} + kind = item.get("kind", "gauge") + if kind == "stat": panels.append( stat_panel( panel_id, - title, - expr, - {"h": 5, "w": width, "x": x, "y": 0}, - thresholds=thresholds, - legend=None, - links=links, - text_mode="value", - ) - ) + item["title"], + item["expr"], + grid, + thresholds=item.get("thresholds"), + legend=None, + links=item.get("links"), + text_mode=item.get("text_mode", "value"), + value_suffix=item.get("value_suffix"), + unit=item.get("unit", "none"), + decimals=item.get("decimals"), + ) + ) else: panels.append( gauge_panel( panel_id, - title, - expr, - {"h": 5, "w": width, "x": x, "y": 0}, - min_value=min_value, - max_value=max_value, - thresholds=thresholds, - links=links, + item["title"], + item["expr"], + grid, + min_value=0, + max_value=item.get("max_value", 5), + thresholds=item.get("thresholds"), + links=item.get("links"), ) ) @@ -774,7 +932,7 @@ def build_overview(): timeseries_panel( 16, "Control plane CPU", - node_cpu_expr(CONTROL_REGEX), + node_cpu_expr(CONTROL_ALL_REGEX), {"h": 10, "w": 12, "x": 0, "y": 44}, unit="percent", legend="{{node}}", @@ -786,7 +944,7 @@ def build_overview(): timeseries_panel( 17, "Control plane RAM", - node_mem_expr(CONTROL_REGEX), + node_mem_expr(CONTROL_ALL_REGEX), {"h": 10, "w": 12, "x": 12, "y": 44}, unit="percent", legend="{{node}}", @@ -795,6 +953,36 @@ def build_overview(): ) ) + panels.append( + pie_panel( + 28, + "Node Pod Share", + '(sum(kube_pod_info{pod!="" , node!=""}) by (node) / clamp_min(sum(kube_pod_info{pod!="" , node!=""}), 1)) * 100', + {"h": 10, "w": 12, "x": 0, "y": 54}, + ) + ) + panels.append( + bargauge_panel( + 29, + "Top Nodes by Pod Count", + 'topk(12, sum(kube_pod_info{pod!="" , node!=""}) by (node))', + {"h": 10, "w": 12, "x": 12, "y": 54}, + unit="none", + limit=12, + decimals=0, + thresholds={ + "mode": "absolute", + "steps": [ + {"color": "green", "value": None}, + {"color": "yellow", "value": 50}, + {"color": "orange", "value": 75}, + {"color": "red", "value": 100}, + ], + }, + instant=True, + ) + ) + panels.append( timeseries_panel( 18, @@ -840,7 +1028,7 @@ def build_overview(): 21, "Root Filesystem Usage", root_usage_expr(), - {"h": 16, "w": 12, "x": 0, "y": 54}, + {"h": 16, "w": 12, "x": 0, "y": 64}, unit="percent", legend="{{node}}", legend_calcs=["last"], @@ -855,8 +1043,9 @@ def build_overview(): 22, "Nodes Closest to Full Root Disks", f"topk(12, {root_usage_expr()})", - {"h": 16, "w": 12, "x": 12, "y": 54}, + {"h": 16, "w": 12, "x": 12, "y": 64}, unit="percent", + thresholds=PERCENT_THRESHOLDS, links=link_to("atlas-storage"), ) ) @@ -874,13 +1063,7 @@ def build_overview(): "templating": {"list": []}, "time": {"from": "now-1h", "to": "now"}, "refresh": "1m", - "links": [ - {"title": "Atlas Pods", "type": "dashboard", "dashboardUid": "atlas-pods", "keepTime": False}, - {"title": "Atlas Nodes", "type": "dashboard", "dashboardUid": "atlas-nodes", "keepTime": False}, - {"title": "Atlas Storage", "type": "dashboard", "dashboardUid": "atlas-storage", "keepTime": False}, - {"title": "Atlas Network", "type": "dashboard", "dashboardUid": "atlas-network", "keepTime": False}, - {"title": "Atlas GPU", "type": "dashboard", "dashboardUid": "atlas-gpu", "keepTime": False}, - ], + "links": [], } @@ -980,6 +1163,91 @@ def build_pods_dashboard(): ], ) ) + panels.append( + pie_panel( + 8, + "Node Pod Share", + '(sum(kube_pod_info{pod!="" , node!=""}) by (node) / clamp_min(sum(kube_pod_info{pod!="" , node!=""}), 1)) * 100', + {"h": 8, "w": 12, "x": 12, "y": 34}, + ) + ) + panels.append( + bargauge_panel( + 9, + "Top Nodes by Pod Count", + 'topk(12, sum(kube_pod_info{pod!="" , node!=""}) by (node))', + {"h": 8, "w": 12, "x": 0, "y": 34}, + unit="none", + limit=12, + decimals=0, + thresholds={ + "mode": "absolute", + "steps": [ + {"color": "green", "value": None}, + {"color": "yellow", "value": 50}, + {"color": "orange", "value": 75}, + {"color": "red", "value": 100}, + ], + }, + instant=True, + ) + ) + + share_expr = ( + '(sum by (namespace,node) (kube_pod_info{pod!="" , node!=""}) ' + '/ on(namespace) group_left() clamp_min(sum by (namespace) (kube_pod_info{pod!=""}), 1) * 100)' + ) + rank_terms = [ + f"(sum by (node) (kube_node_info{{node=\"{node}\"}}) * 0 + {idx * 1e-3})" + for idx, node in enumerate(CONTROL_ALL + WORKER_NODES, start=1) + ] + rank_expr = " or ".join(rank_terms) + score_expr = f"{share_expr} + on(node) group_left() ({rank_expr})" + mask_expr = ( + f"{score_expr} == bool on(namespace) group_left() " + f"(max by (namespace) ({score_expr}))" + ) + panels.append( + table_panel( + 10, + "Namespace Plurality by Node v27", + ( + f"{share_expr} * on(namespace,node) group_left() " + f"({mask_expr})" + ), + {"h": 8, "w": 24, "x": 0, "y": 42}, + unit="percent", + transformations=[ + {"id": "labelsToFields", "options": {}}, + {"id": "organize", "options": {"excludeByName": {"Time": True}}}, + {"id": "filterByValue", "options": {"match": "Value", "operator": "gt", "value": 0}}, + { + "id": "sortBy", + "options": {"fields": ["Value"], "order": "desc"}, + }, + { + "id": "groupBy", + "options": { + "fields": { + "namespace": { + "aggregations": [ + {"field": "Value", "operation": "max"}, + {"field": "node", "operation": "first"}, + ] + } + }, + "rowBy": ["namespace"], + }, + }, + ], + instant=True, + options={"showColumnFilters": False}, + filterable=False, + footer={"show": False, "fields": "", "calcs": []}, + format="table", + ) + ) + return { "uid": "atlas-pods", "title": "Atlas Pods", @@ -1022,12 +1290,69 @@ def build_nodes_dashboard(): {"h": 4, "w": 8, "x": 16, "y": 0}, ) ) + panels.append( + stat_panel( + 9, + "API Server 5xx rate", + APISERVER_5XX_RATE, + {"h": 4, "w": 8, "x": 0, "y": 4}, + unit="req/s", + thresholds={ + "mode": "absolute", + "steps": [ + {"color": "green", "value": None}, + {"color": "yellow", "value": 0.05}, + {"color": "orange", "value": 0.2}, + {"color": "red", "value": 0.5}, + ], + }, + decimals=3, + ) + ) + panels.append( + stat_panel( + 10, + "API Server P99 latency", + APISERVER_P99_LATENCY_MS, + {"h": 4, "w": 8, "x": 8, "y": 4}, + unit="ms", + thresholds={ + "mode": "absolute", + "steps": [ + {"color": "green", "value": None}, + {"color": "yellow", "value": 250}, + {"color": "orange", "value": 400}, + {"color": "red", "value": 600}, + ], + }, + decimals=1, + ) + ) + panels.append( + stat_panel( + 11, + "etcd P99 latency", + ETCD_P99_LATENCY_MS, + {"h": 4, "w": 8, "x": 16, "y": 4}, + unit="ms", + thresholds={ + "mode": "absolute", + "steps": [ + {"color": "green", "value": None}, + {"color": "yellow", "value": 50}, + {"color": "orange", "value": 100}, + {"color": "red", "value": 200}, + ], + }, + decimals=1, + ) + ) panels.append( timeseries_panel( 4, "Node CPU", node_cpu_expr(), - {"h": 9, "w": 24, "x": 0, "y": 4}, + {"h": 9, "w": 24, "x": 0, "y": 8}, unit="percent", legend="{{node}}", legend_calcs=["last"], @@ -1040,7 +1365,7 @@ def build_nodes_dashboard(): 5, "Node RAM", node_mem_expr(), - {"h": 9, "w": 24, "x": 0, "y": 13}, + {"h": 9, "w": 24, "x": 0, "y": 17}, unit="percent", legend="{{node}}", legend_calcs=["last"], @@ -1053,7 +1378,7 @@ def build_nodes_dashboard(): 6, "Control Plane (incl. titan-db) CPU", node_cpu_expr(CONTROL_ALL_REGEX), - {"h": 9, "w": 12, "x": 0, "y": 22}, + {"h": 9, "w": 12, "x": 0, "y": 26}, unit="percent", legend="{{node}}", legend_display="table", @@ -1065,7 +1390,7 @@ def build_nodes_dashboard(): 7, "Control Plane (incl. titan-db) RAM", node_mem_expr(CONTROL_ALL_REGEX), - {"h": 9, "w": 12, "x": 12, "y": 22}, + {"h": 9, "w": 12, "x": 12, "y": 26}, unit="percent", legend="{{node}}", legend_display="table", @@ -1077,7 +1402,7 @@ def build_nodes_dashboard(): 8, "Root Filesystem Usage", root_usage_expr(), - {"h": 9, "w": 24, "x": 0, "y": 31}, + {"h": 9, "w": 24, "x": 0, "y": 35}, unit="percent", legend="{{node}}", legend_display="table", @@ -1204,43 +1529,107 @@ def build_network_dashboard(): panels.append( stat_panel( 1, - "Ingress Traffic", - NET_INGRESS_EXPR, - {"h": 4, "w": 8, "x": 0, "y": 0}, - unit="Bps", + "Ingress Success Rate (5m)", + TRAEFIK_SLI_5M, + {"h": 4, "w": 6, "x": 0, "y": 0}, + unit="percentunit", + decimals=2, + thresholds={ + "mode": "absolute", + "steps": [ + {"color": "red", "value": None}, + {"color": "orange", "value": 0.995}, + {"color": "yellow", "value": 0.999}, + {"color": "green", "value": 0.9995}, + ], + }, ) ) panels.append( stat_panel( 2, - "Egress Traffic", - NET_EGRESS_EXPR, - {"h": 4, "w": 8, "x": 8, "y": 0}, - unit="Bps", + "Error Budget Burn (1h)", + traefik_burn("1h"), + {"h": 4, "w": 6, "x": 6, "y": 0}, + thresholds={ + "mode": "absolute", + "steps": [ + {"color": "green", "value": None}, + {"color": "yellow", "value": 1}, + {"color": "orange", "value": 2}, + {"color": "red", "value": 4}, + ], + }, + decimals=2, ) ) panels.append( stat_panel( 3, - "Intra-Cluster Traffic", - NET_INTERNAL_EXPR, - {"h": 4, "w": 8, "x": 16, "y": 0}, - unit="Bps", + "Error Budget Burn (6h)", + traefik_burn("6h"), + {"h": 4, "w": 6, "x": 12, "y": 0}, + thresholds={ + "mode": "absolute", + "steps": [ + {"color": "green", "value": None}, + {"color": "yellow", "value": 1}, + {"color": "orange", "value": 2}, + {"color": "red", "value": 4}, + ], + }, + decimals=2, ) ) panels.append( stat_panel( 4, - "Top Router req/s", - f"topk(1, {TRAEFIK_ROUTER_EXPR})", + "Edge P99 Latency (ms)", + TRAEFIK_P99_LATENCY_MS, + {"h": 4, "w": 6, "x": 18, "y": 0}, + unit="ms", + thresholds={ + "mode": "absolute", + "steps": [ + {"color": "green", "value": None}, + {"color": "yellow", "value": 200}, + {"color": "orange", "value": 350}, + {"color": "red", "value": 500}, + ], + }, + decimals=1, + ) + ) + panels.append( + stat_panel( + 5, + "Ingress Traffic", + NET_INGRESS_EXPR, {"h": 4, "w": 8, "x": 0, "y": 4}, - unit="req/s", - legend="{{router}}", + unit="Bps", + ) + ) + panels.append( + stat_panel( + 6, + "Egress Traffic", + NET_EGRESS_EXPR, + {"h": 4, "w": 8, "x": 8, "y": 4}, + unit="Bps", + ) + ) + panels.append( + stat_panel( + 7, + "Intra-Cluster Traffic", + NET_INTERNAL_EXPR, + {"h": 4, "w": 8, "x": 16, "y": 4}, + unit="Bps", ) ) panels.append( timeseries_panel( - 5, + 8, "Per-Node Throughput", f'avg by (node) (({NET_NODE_TX_PHYS} + {NET_NODE_RX_PHYS}) * on(instance) group_left(node) {NODE_INFO})', {"h": 8, "w": 24, "x": 0, "y": 8}, @@ -1252,7 +1641,7 @@ def build_network_dashboard(): ) panels.append( table_panel( - 6, + 9, "Top Namespaces", 'topk(10, sum(rate(container_network_transmit_bytes_total{namespace!=""}[5m]) ' '+ rate(container_network_receive_bytes_total{namespace!=""}[5m])) by (namespace))', @@ -1263,7 +1652,7 @@ def build_network_dashboard(): ) panels.append( table_panel( - 7, + 10, "Top Pods", 'topk(10, sum(rate(container_network_transmit_bytes_total{pod!=""}[5m]) ' '+ rate(container_network_receive_bytes_total{pod!=""}[5m])) by (namespace,pod))', @@ -1274,7 +1663,7 @@ def build_network_dashboard(): ) panels.append( timeseries_panel( - 8, + 11, "Traefik Routers (req/s)", f"topk(10, {TRAEFIK_ROUTER_EXPR})", {"h": 9, "w": 12, "x": 0, "y": 25}, @@ -1286,7 +1675,7 @@ def build_network_dashboard(): ) panels.append( timeseries_panel( - 9, + 12, "Traefik Entrypoints (req/s)", 'sum by (entrypoint) (rate(traefik_entrypoint_requests_total[5m]))', {"h": 9, "w": 12, "x": 12, "y": 25}, diff --git a/scripts/mailu_sync.py b/scripts/mailu_sync.py new file mode 100644 index 0000000..ee8aa18 --- /dev/null +++ b/scripts/mailu_sync.py @@ -0,0 +1,204 @@ +#!/usr/bin/env python3 +""" +Sync Keycloak users to Mailu mailboxes. + - Generates/stores a mailu_app_password attribute in Keycloak (admin-only) + - Upserts the mailbox in Mailu Postgres using that password +""" + +import os +import sys +import json +import time +import secrets +import string +import datetime +import requests +import psycopg2 +from psycopg2.extras import RealDictCursor +from passlib.hash import bcrypt_sha256 + + +KC_BASE = os.environ["KEYCLOAK_BASE_URL"].rstrip("/") +KC_REALM = os.environ["KEYCLOAK_REALM"] +KC_CLIENT_ID = os.environ["KEYCLOAK_CLIENT_ID"] +KC_CLIENT_SECRET = os.environ["KEYCLOAK_CLIENT_SECRET"] + +MAILU_DOMAIN = os.environ["MAILU_DOMAIN"] +MAILU_DEFAULT_QUOTA = int(os.environ.get("MAILU_DEFAULT_QUOTA", "20000000000")) + +DB_CONFIG = { + "host": os.environ["MAILU_DB_HOST"], + "port": int(os.environ.get("MAILU_DB_PORT", "5432")), + "dbname": os.environ["MAILU_DB_NAME"], + "user": os.environ["MAILU_DB_USER"], + "password": os.environ["MAILU_DB_PASSWORD"], +} + +SESSION = requests.Session() + + +def log(msg): + sys.stdout.write(f"{msg}\n") + sys.stdout.flush() + + +def get_kc_token(): + resp = SESSION.post( + f"{KC_BASE}/realms/{KC_REALM}/protocol/openid-connect/token", + data={ + "grant_type": "client_credentials", + "client_id": KC_CLIENT_ID, + "client_secret": KC_CLIENT_SECRET, + }, + timeout=15, + ) + resp.raise_for_status() + return resp.json()["access_token"] + + +def kc_get_users(token): + users = [] + first = 0 + max_results = 200 + headers = {"Authorization": f"Bearer {token}"} + while True: + resp = SESSION.get( + f"{KC_BASE}/admin/realms/{KC_REALM}/users", + params={"first": first, "max": max_results, "enabled": "true"}, + headers=headers, + timeout=20, + ) + resp.raise_for_status() + batch = resp.json() + users.extend(batch) + if len(batch) < max_results: + break + first += max_results + return users + + +def kc_update_attributes(token, user, attributes): + headers = { + "Authorization": f"Bearer {token}", + "Content-Type": "application/json", + } + payload = { + "firstName": user.get("firstName"), + "lastName": user.get("lastName"), + "email": user.get("email"), + "enabled": user.get("enabled", True), + "username": user["username"], + "emailVerified": user.get("emailVerified", False), + "attributes": attributes, + } + user_url = f"{KC_BASE}/admin/realms/{KC_REALM}/users/{user['id']}" + resp = SESSION.put(user_url, headers=headers, json=payload, timeout=20) + resp.raise_for_status() + verify = SESSION.get( + user_url, + headers={"Authorization": f"Bearer {token}"}, + params={"briefRepresentation": "false"}, + timeout=15, + ) + verify.raise_for_status() + attrs = verify.json().get("attributes") or {} + if not attrs.get("mailu_app_password"): + raise Exception(f"attribute not persisted for {user.get('email') or user['username']}") + + +def random_password(): + alphabet = string.ascii_letters + string.digits + return "".join(secrets.choice(alphabet) for _ in range(24)) + + +def ensure_mailu_user(cursor, email, password, display_name): + localpart, domain = email.split("@", 1) + if domain.lower() != MAILU_DOMAIN.lower(): + return + hashed = bcrypt_sha256.hash(password) + now = datetime.datetime.utcnow() + cursor.execute( + """ + INSERT INTO "user" ( + email, localpart, domain_name, password, + quota_bytes, quota_bytes_used, + global_admin, enabled, enable_imap, enable_pop, allow_spoofing, + forward_enabled, forward_destination, forward_keep, + reply_enabled, reply_subject, reply_body, reply_startdate, reply_enddate, + displayed_name, spam_enabled, spam_mark_as_read, spam_threshold, + change_pw_next_login, created_at, updated_at, comment + ) + VALUES ( + %(email)s, %(localpart)s, %(domain)s, %(password)s, + %(quota)s, 0, + false, true, true, true, false, + false, '', true, + false, NULL, NULL, DATE '1900-01-01', DATE '2999-12-31', + %(display)s, true, true, 80, + false, CURRENT_DATE, %(now)s, '' + ) + ON CONFLICT (email) DO UPDATE + SET password = EXCLUDED.password, + enabled = true, + updated_at = EXCLUDED.updated_at + """, + { + "email": email, + "localpart": localpart, + "domain": domain, + "password": hashed, + "quota": MAILU_DEFAULT_QUOTA, + "display": display_name or localpart, + "now": now, + }, + ) + + +def main(): + token = get_kc_token() + users = kc_get_users(token) + if not users: + log("No users found; exiting.") + return + + conn = psycopg2.connect(**DB_CONFIG) + conn.autocommit = True + cursor = conn.cursor(cursor_factory=RealDictCursor) + + for user in users: + attrs = user.get("attributes", {}) or {} + app_pw_value = attrs.get("mailu_app_password") + if isinstance(app_pw_value, list): + app_pw = app_pw_value[0] if app_pw_value else None + elif isinstance(app_pw_value, str): + app_pw = app_pw_value + else: + app_pw = None + + email = user.get("email") + if not email: + email = f"{user['username']}@{MAILU_DOMAIN}" + + if not app_pw: + app_pw = random_password() + attrs["mailu_app_password"] = app_pw + kc_update_attributes(token, user, attrs) + log(f"Set mailu_app_password for {email}") + + display_name = " ".join( + part for part in [user.get("firstName"), user.get("lastName")] if part + ).strip() + + ensure_mailu_user(cursor, email, app_pw, display_name) + log(f"Synced mailbox for {email}") + + cursor.close() + conn.close() + + +if __name__ == "__main__": + try: + main() + except Exception as exc: + log(f"ERROR: {exc}") + sys.exit(1) diff --git a/scripts/nextcloud-mail-sync.sh b/scripts/nextcloud-mail-sync.sh new file mode 100755 index 0000000..7feeec6 --- /dev/null +++ b/scripts/nextcloud-mail-sync.sh @@ -0,0 +1,49 @@ +#!/bin/bash +set -euo pipefail + +KC_BASE="${KC_BASE:?}" +KC_REALM="${KC_REALM:?}" +KC_ADMIN_USER="${KC_ADMIN_USER:?}" +KC_ADMIN_PASS="${KC_ADMIN_PASS:?}" + +if ! command -v jq >/dev/null 2>&1; then + apt-get update && apt-get install -y jq curl >/dev/null +fi + +account_exists() { + # Skip if the account email is already present in the mail app. + runuser -u www-data -- php occ mail:account:list 2>/dev/null | grep -Fq " ${1}" || \ + runuser -u www-data -- php occ mail:account:list 2>/dev/null | grep -Fq "${1} " +} + +token=$( + curl -s -d "grant_type=password" \ + -d "client_id=admin-cli" \ + -d "username=${KC_ADMIN_USER}" \ + -d "password=${KC_ADMIN_PASS}" \ + "${KC_BASE}/realms/master/protocol/openid-connect/token" | jq -r '.access_token' +) + +if [[ -z "${token}" || "${token}" == "null" ]]; then + echo "Failed to obtain admin token" + exit 1 +fi + +users=$(curl -s -H "Authorization: Bearer ${token}" \ + "${KC_BASE}/admin/realms/${KC_REALM}/users?max=2000") + +echo "${users}" | jq -c '.[]' | while read -r user; do + username=$(echo "${user}" | jq -r '.username') + email=$(echo "${user}" | jq -r '.email // empty') + app_pw=$(echo "${user}" | jq -r '.attributes.mailu_app_password[0] // empty') + [[ -z "${email}" || -z "${app_pw}" ]] && continue + if account_exists "${email}"; then + echo "Skipping ${email}, already exists" + continue + fi + echo "Syncing ${email}" + runuser -u www-data -- php occ mail:account:create \ + "${username}" "${username}" "${email}" \ + mail.bstein.dev 993 ssl "${email}" "${app_pw}" \ + mail.bstein.dev 587 tls "${email}" "${app_pw}" login || true +done diff --git a/scripts/nextcloud-maintenance.sh b/scripts/nextcloud-maintenance.sh new file mode 100755 index 0000000..e8ea18c --- /dev/null +++ b/scripts/nextcloud-maintenance.sh @@ -0,0 +1,65 @@ +#!/bin/bash +set -euo pipefail + +NC_URL="${NC_URL:-https://cloud.bstein.dev}" +ADMIN_USER="${ADMIN_USER:?}" +ADMIN_PASS="${ADMIN_PASS:?}" + +export DEBIAN_FRONTEND=noninteractive +apt-get update -qq +apt-get install -y -qq curl jq >/dev/null + +run_occ() { + runuser -u www-data -- php occ "$@" +} + +log() { echo "[$(date -Is)] $*"; } + +log "Applying Atlas theming" +run_occ theming:config name "Atlas Cloud" +run_occ theming:config slogan "Unified access to Atlas services" +run_occ theming:config url "https://cloud.bstein.dev" +run_occ theming:config color "#0f172a" +run_occ theming:config disable-user-theming yes + +log "Setting default quota to 200 GB" +run_occ config:app:set files default_quota --value "200 GB" + +API_BASE="${NC_URL}/ocs/v2.php/apps/external/api/v1" +AUTH=(-u "${ADMIN_USER}:${ADMIN_PASS}" -H "OCS-APIRequest: true") + +log "Removing existing external links" +existing=$(curl -sf "${AUTH[@]}" "${API_BASE}?format=json" | jq -r '.ocs.data[].id // empty') +for id in ${existing}; do + curl -sf "${AUTH[@]}" -X DELETE "${API_BASE}/sites/${id}?format=json" >/dev/null || true +done + +SITES=( + "Vaultwarden|https://vault.bstein.dev" + "Jellyfin|https://stream.bstein.dev" + "Gitea|https://scm.bstein.dev" + "Jenkins|https://ci.bstein.dev" + "Zot|https://registry.bstein.dev" + "Vault|https://secret.bstein.dev" + "Jitsi|https://meet.bstein.dev" + "Grafana|https://metrics.bstein.dev" + "Chat LLM|https://chat.ai.bstein.dev" + "Vision|https://draw.ai.bstein.dev" + "STT/TTS|https://talk.ai.bstein.dev" +) + +log "Seeding external links" +for entry in "${SITES[@]}"; do + IFS="|" read -r name url <<<"${entry}" + curl -sf "${AUTH[@]}" -X POST "${API_BASE}/sites?format=json" \ + -d "name=${name}" \ + -d "url=${url}" \ + -d "lang=" \ + -d "type=link" \ + -d "device=" \ + -d "icon=" \ + -d "groups[]=" \ + -d "redirect=1" >/dev/null +done + +log "Maintenance run completed" diff --git a/scripts/tests/test_dashboards_render_atlas.py b/scripts/tests/test_dashboards_render_atlas.py new file mode 100644 index 0000000..865aa68 --- /dev/null +++ b/scripts/tests/test_dashboards_render_atlas.py @@ -0,0 +1,58 @@ +import importlib.util +import pathlib + + +def load_module(): + path = pathlib.Path(__file__).resolve().parents[1] / "dashboards_render_atlas.py" + spec = importlib.util.spec_from_file_location("dashboards_render_atlas", path) + module = importlib.util.module_from_spec(spec) + assert spec.loader is not None + spec.loader.exec_module(module) + return module + + +def test_table_panel_options_and_filterable(): + mod = load_module() + panel = mod.table_panel( + 1, + "test", + "metric", + {"h": 1, "w": 1, "x": 0, "y": 0}, + unit="percent", + transformations=[{"id": "labelsToFields", "options": {}}], + instant=True, + options={"showColumnFilters": False}, + filterable=False, + footer={"show": False, "fields": "", "calcs": []}, + format="table", + ) + assert panel["fieldConfig"]["defaults"]["unit"] == "percent" + assert panel["fieldConfig"]["defaults"]["custom"]["filterable"] is False + assert panel["options"]["showHeader"] is True + assert panel["targets"][0]["format"] == "table" + + +def test_node_filter_and_expr_helpers(): + mod = load_module() + expr = mod.node_filter("titan-.*") + assert "label_replace" in expr + cpu_expr = mod.node_cpu_expr("titan-.*") + mem_expr = mod.node_mem_expr("titan-.*") + assert "node_cpu_seconds_total" in cpu_expr + assert "node_memory_MemAvailable_bytes" in mem_expr + + +def test_render_configmap_writes(tmp_path): + mod = load_module() + mod.DASHBOARD_DIR = tmp_path / "dash" + mod.ROOT = tmp_path + uid = "atlas-test" + info = {"configmap": tmp_path / "cm.yaml"} + data = {"title": "Atlas Test"} + mod.write_json(uid, data) + mod.render_configmap(uid, info) + json_path = mod.DASHBOARD_DIR / f"{uid}.json" + assert json_path.exists() + content = (tmp_path / "cm.yaml").read_text() + assert "kind: ConfigMap" in content + assert f"{uid}.json" in content diff --git a/scripts/tests/test_mailu_sync.py b/scripts/tests/test_mailu_sync.py new file mode 100644 index 0000000..41616b2 --- /dev/null +++ b/scripts/tests/test_mailu_sync.py @@ -0,0 +1,181 @@ +import importlib.util +import pathlib + +import pytest + + +def load_sync_module(monkeypatch): + # Minimal env required by module import + env = { + "KEYCLOAK_BASE_URL": "http://keycloak", + "KEYCLOAK_REALM": "atlas", + "KEYCLOAK_CLIENT_ID": "mailu-sync", + "KEYCLOAK_CLIENT_SECRET": "secret", + "MAILU_DOMAIN": "example.com", + "MAILU_DB_HOST": "localhost", + "MAILU_DB_PORT": "5432", + "MAILU_DB_NAME": "mailu", + "MAILU_DB_USER": "mailu", + "MAILU_DB_PASSWORD": "pw", + } + for k, v in env.items(): + monkeypatch.setenv(k, v) + module_path = pathlib.Path(__file__).resolve().parents[1] / "mailu_sync.py" + spec = importlib.util.spec_from_file_location("mailu_sync_testmod", module_path) + module = importlib.util.module_from_spec(spec) + assert spec.loader is not None + spec.loader.exec_module(module) + return module + + +def test_random_password_length_and_charset(monkeypatch): + sync = load_sync_module(monkeypatch) + pw = sync.random_password() + assert len(pw) == 24 + assert all(ch.isalnum() for ch in pw) + + +class _FakeResponse: + def __init__(self, json_data=None, status=200): + self._json_data = json_data or {} + self.status_code = status + + def raise_for_status(self): + if self.status_code >= 400: + raise AssertionError(f"status {self.status_code}") + + def json(self): + return self._json_data + + +class _FakeSession: + def __init__(self, put_resp, get_resp): + self.put_resp = put_resp + self.get_resp = get_resp + self.put_called = False + self.get_called = False + + def post(self, *args, **kwargs): + return _FakeResponse({"access_token": "dummy"}) + + def put(self, *args, **kwargs): + self.put_called = True + return self.put_resp + + def get(self, *args, **kwargs): + self.get_called = True + return self.get_resp + + +def test_kc_update_attributes_succeeds(monkeypatch): + sync = load_sync_module(monkeypatch) + ok_resp = _FakeResponse({"attributes": {"mailu_app_password": ["abc"]}}) + sync.SESSION = _FakeSession(_FakeResponse({}), ok_resp) + sync.kc_update_attributes("token", {"id": "u1", "username": "u1"}, {"mailu_app_password": "abc"}) + assert sync.SESSION.put_called and sync.SESSION.get_called + + +def test_kc_update_attributes_raises_without_attribute(monkeypatch): + sync = load_sync_module(monkeypatch) + missing_attr_resp = _FakeResponse({"attributes": {}}, status=200) + sync.SESSION = _FakeSession(_FakeResponse({}), missing_attr_resp) + with pytest.raises(Exception): + sync.kc_update_attributes("token", {"id": "u1", "username": "u1"}, {"mailu_app_password": "abc"}) + + +def test_kc_get_users_paginates(monkeypatch): + sync = load_sync_module(monkeypatch) + + class _PagedSession: + def __init__(self): + self.calls = 0 + + def post(self, *_, **__): + return _FakeResponse({"access_token": "tok"}) + + def get(self, *_, **__): + self.calls += 1 + if self.calls == 1: + return _FakeResponse([{"id": "u1"}, {"id": "u2"}]) + return _FakeResponse([]) # stop pagination + + sync.SESSION = _PagedSession() + users = sync.kc_get_users("tok") + assert [u["id"] for u in users] == ["u1", "u2"] + assert sync.SESSION.calls == 2 + + +def test_ensure_mailu_user_skips_foreign_domain(monkeypatch): + sync = load_sync_module(monkeypatch) + executed = [] + + class _Cursor: + def execute(self, sql, params): + executed.append((sql, params)) + + sync.ensure_mailu_user(_Cursor(), "user@other.com", "pw", "User") + assert not executed + + +def test_ensure_mailu_user_upserts(monkeypatch): + sync = load_sync_module(monkeypatch) + captured = {} + + class _Cursor: + def execute(self, sql, params): + captured.update(params) + + sync.ensure_mailu_user(_Cursor(), "user@example.com", "pw", "User Example") + assert captured["email"] == "user@example.com" + assert captured["localpart"] == "user" + # password should be hashed, not the raw string + assert captured["password"] != "pw" + + +def test_main_generates_password_and_upserts(monkeypatch): + sync = load_sync_module(monkeypatch) + users = [ + {"id": "u1", "username": "user1", "email": "user1@example.com", "attributes": {}}, + {"id": "u2", "username": "user2", "email": "user2@example.com", "attributes": {"mailu_app_password": ["keepme"]}}, + {"id": "u3", "username": "user3", "email": "user3@other.com", "attributes": {}}, + ] + updated = [] + + class _Cursor: + def __init__(self): + self.executions = [] + + def execute(self, sql, params): + self.executions.append(params) + + def close(self): + return None + + class _Conn: + def __init__(self): + self.autocommit = False + self._cursor = _Cursor() + + def cursor(self, cursor_factory=None): + return self._cursor + + def close(self): + return None + + monkeypatch.setattr(sync, "get_kc_token", lambda: "tok") + monkeypatch.setattr(sync, "kc_get_users", lambda token: users) + monkeypatch.setattr(sync, "kc_update_attributes", lambda token, user, attrs: updated.append((user["id"], attrs["mailu_app_password"]))) + conns = [] + + def _connect(**kwargs): + conn = _Conn() + conns.append(conn) + return conn + + monkeypatch.setattr(sync.psycopg2, "connect", _connect) + + sync.main() + + # Should attempt two inserts (third user skipped due to domain mismatch) + assert len(updated) == 1 # only one missing attr was backfilled + assert conns and len(conns[0]._cursor.executions) == 2 diff --git a/services/gitea/ingress.yaml b/services/gitea/ingress.yaml index 375dba3..0077ba4 100644 --- a/services/gitea/ingress.yaml +++ b/services/gitea/ingress.yaml @@ -5,7 +5,7 @@ metadata: name: gitea-ingress namespace: gitea annotations: - cert-manager.io/cluster-issuer: "letsencrypt-prod" + cert-manager.io/cluster-issuer: letsencrypt nginx.ingress.kubernetes.io/ssl-redirect: "true" spec: tls: diff --git a/services/gitops-ui/helmrelease.yaml b/services/gitops-ui/helmrelease.yaml new file mode 100644 index 0000000..27b610d --- /dev/null +++ b/services/gitops-ui/helmrelease.yaml @@ -0,0 +1,49 @@ +# services/gitops-ui/helmrelease.yaml +apiVersion: helm.toolkit.fluxcd.io/v2 +kind: HelmRelease +metadata: + name: weave-gitops + namespace: flux-system +spec: + interval: 30m + chart: + spec: + chart: ./charts/gitops-server + sourceRef: + kind: GitRepository + name: weave-gitops-upstream + namespace: flux-system + # track upstream tag; see source object for version pin + install: + remediation: + retries: 3 + upgrade: + remediation: + retries: 3 + remediateLastFailure: true + cleanupOnFail: true + values: + adminUser: + create: true + createClusterRole: true + createSecret: true + username: admin + # bcrypt hash for temporary password "G1tOps!2025" (rotate after login) + passwordHash: "$2y$12$wDEOzR1Gc2dbvNSJ3ZXNdOBVFEjC6YASIxnZmHIbO.W1m0fie/QVi" + ingress: + enabled: true + className: traefik + annotations: + cert-manager.io/cluster-issuer: letsencrypt-prod + traefik.ingress.kubernetes.io/router.entrypoints: websecure + hosts: + - host: cd.bstein.dev + paths: + - path: / + pathType: Prefix + tls: + - secretName: gitops-ui-tls + hosts: + - cd.bstein.dev + metrics: + enabled: true diff --git a/services/gitops-ui/kustomization.yaml b/services/gitops-ui/kustomization.yaml new file mode 100644 index 0000000..53a903e --- /dev/null +++ b/services/gitops-ui/kustomization.yaml @@ -0,0 +1,7 @@ +# services/gitops-ui/kustomization.yaml +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization +namespace: flux-system +resources: + - source.yaml + - helmrelease.yaml diff --git a/services/gitops-ui/source.yaml b/services/gitops-ui/source.yaml new file mode 100644 index 0000000..0e87524 --- /dev/null +++ b/services/gitops-ui/source.yaml @@ -0,0 +1,11 @@ +# services/gitops-ui/source.yaml +apiVersion: source.toolkit.fluxcd.io/v1 +kind: GitRepository +metadata: + name: weave-gitops-upstream + namespace: flux-system +spec: + interval: 1h + url: https://github.com/weaveworks/weave-gitops.git + ref: + tag: v0.38.0 diff --git a/services/jitsi/ingress.yaml b/services/jitsi/ingress.yaml index c09b669..3336c37 100644 --- a/services/jitsi/ingress.yaml +++ b/services/jitsi/ingress.yaml @@ -5,7 +5,7 @@ metadata: name: jitsi namespace: jitsi annotations: - cert-manager.io/cluster-issuer: "letsencrypt-prod" + cert-manager.io/cluster-issuer: letsencrypt spec: ingressClassName: traefik tls: diff --git a/services/keycloak/README.md b/services/keycloak/NOTES.md similarity index 100% rename from services/keycloak/README.md rename to services/keycloak/NOTES.md diff --git a/services/keycloak/deployment.yaml b/services/keycloak/deployment.yaml index af7839f..c4ffcda 100644 --- a/services/keycloak/deployment.yaml +++ b/services/keycloak/deployment.yaml @@ -48,6 +48,20 @@ spec: runAsGroup: 0 fsGroup: 1000 fsGroupChangePolicy: OnRootMismatch + imagePullSecrets: + - name: zot-regcred + initContainers: + - name: mailu-http-listener + image: registry.bstein.dev/sso/mailu-http-listener:0.1.0 + imagePullPolicy: IfNotPresent + command: ["/bin/sh", "-c"] + args: + - | + cp /plugin/mailu-http-listener-0.1.0.jar /providers/ + cp -r /plugin/src /providers/src + volumeMounts: + - name: providers + mountPath: /providers containers: - name: keycloak image: quay.io/keycloak/keycloak:26.0.7 @@ -104,6 +118,10 @@ spec: secretKeyRef: name: keycloak-admin key: password + - name: KC_EVENTS_LISTENERS + value: jboss-logging,mailu-http + - name: KC_SPI_EVENTS_LISTENER_MAILU-HTTP_ENDPOINT + value: http://mailu-sync-listener.mailu-mailserver.svc.cluster.local:8080/events ports: - containerPort: 8080 name: http @@ -126,7 +144,11 @@ spec: volumeMounts: - name: data mountPath: /opt/keycloak/data + - name: providers + mountPath: /opt/keycloak/providers volumes: - name: data persistentVolumeClaim: claimName: keycloak-data + - name: providers + emptyDir: {} diff --git a/services/mailu/certificate.yaml b/services/mailu/certificate.yaml new file mode 100644 index 0000000..83cc17c --- /dev/null +++ b/services/mailu/certificate.yaml @@ -0,0 +1,13 @@ +# services/mailu/certificate.yaml +apiVersion: cert-manager.io/v1 +kind: Certificate +metadata: + name: mailu-tls + namespace: mailu-mailserver +spec: + secretName: mailu-certificates + issuerRef: + kind: ClusterIssuer + name: letsencrypt-prod + dnsNames: + - mail.bstein.dev diff --git a/services/mailu/helmrelease.yaml b/services/mailu/helmrelease.yaml new file mode 100644 index 0000000..c8b0975 --- /dev/null +++ b/services/mailu/helmrelease.yaml @@ -0,0 +1,287 @@ +# services/mailu/helmrelease.yaml +apiVersion: helm.toolkit.fluxcd.io/v2 +kind: HelmRelease +metadata: + name: mailu + namespace: mailu-mailserver +spec: + interval: 30m + chart: + spec: + chart: mailu + version: 2.1.2 + sourceRef: + kind: HelmRepository + name: mailu + namespace: flux-system + install: + remediation: { retries: 3 } + timeout: 10m + upgrade: + remediation: + retries: 3 + remediateLastFailure: true + cleanupOnFail: true + timeout: 10m + values: + mailuVersion: "2024.06" + domain: bstein.dev + hostnames: [mail.bstein.dev] + domains: + - name: bstein.dev + enabled: true + dkim: + enabled: true + externalRelay: + host: "[email-smtp.us-east-2.amazonaws.com]:587" + existingSecret: mailu-ses-relay + usernameKey: relay-username + passwordKey: relay-password + timezone: Etc/UTC + subnet: 10.42.0.0/16 + existingSecret: mailu-secret + tls: + outboundLevel: encrypt + externalDatabase: + enabled: true + type: postgresql + host: postgres-service.postgres.svc.cluster.local + port: 5432 + database: mailu + username: mailu + existingSecret: mailu-db-secret + existingSecretUsernameKey: username + existingSecretPasswordKey: password + existingSecretDatabaseKey: database + initialAccount: + enabled: true + username: test + domain: bstein.dev + existingSecret: mailu-initial-account-secret + existingSecretPasswordKey: password + persistence: + accessModes: [ReadWriteMany] + size: 100Gi + storageClass: astreae + single_pvc: true + front: + hostnames: [mail.bstein.dev] + proxied: true + hostPort: + enabled: false + https: + enabled: false + external: false + forceHttps: false + externalService: + enabled: true + type: LoadBalancer + externalTrafficPolicy: Cluster + ports: + submission: true + nodePorts: + pop3: 30010 + pop3s: 30011 + imap: 30143 + imaps: 30993 + manageSieve: 30419 + smtp: 30025 + smtps: 30465 + submission: 30587 + logLevel: DEBUG + nodeSelector: + hardware: rpi4 + admin: + logLevel: DEBUG + nodeSelector: + hardware: rpi4 + podLivenessProbe: + enabled: true + initialDelaySeconds: 30 + periodSeconds: 10 + timeoutSeconds: 5 + failureThreshold: 6 + successThreshold: 1 + podReadinessProbe: + enabled: true + initialDelaySeconds: 20 + periodSeconds: 10 + timeoutSeconds: 5 + failureThreshold: 6 + successThreshold: 1 + extraEnvVars: + - name: FLASK_DEBUG + value: "1" + - name: ACCESSLOG + value: /dev/stdout + - name: ERRORLOG + value: /dev/stderr + - name: WEBROOT_REDIRECT + value: "" + - name: FORWARDED_ALLOW_IPS + value: 127.0.0.1,10.42.0.0/16 + - name: DNS_RESOLVERS + value: 1.1.1.1,9.9.9.9 + extraVolumes: + - name: unbound-config + configMap: + name: mailu-unbound + - name: unbound-run + emptyDir: {} + extraVolumeMounts: + - name: unbound-run + mountPath: /var/lib/unbound + extraContainers: + - name: unbound + image: docker.io/alpine:3.20 + command: ["/bin/sh", "-c"] + args: + - | + while :; do + printf "nameserver 10.43.0.10\n" > /etc/resolv.conf + if apk add --no-cache unbound bind-tools; then + break + fi + echo "apk failed, retrying" >&2 + sleep 10 + done + cat >/etc/resolv.conf <<'EOF' + search mailu-mailserver.svc.cluster.local svc.cluster.local cluster.local + nameserver 127.0.0.1 + EOF + unbound-anchor -a /var/lib/unbound/root.key || true + exec unbound -d -c /opt/unbound/etc/unbound/unbound.conf + ports: + - containerPort: 53 + protocol: UDP + - containerPort: 53 + protocol: TCP + volumeMounts: + - name: unbound-config + mountPath: /opt/unbound/etc/unbound + - name: unbound-run + mountPath: /var/lib/unbound + dnsPolicy: None + dnsConfig: + nameservers: + - 127.0.0.1 + searches: + - mailu-mailserver.svc.cluster.local + - svc.cluster.local + - cluster.local + clamav: + image: + repository: clamav/clamav-debian + tag: "1.4" + logLevel: DEBUG + nodeSelector: + hardware: rpi5 + resources: + requests: + cpu: 200m + memory: 1Gi + limits: + cpu: 500m + memory: 3Gi + livenessProbe: + enabled: false + initialDelaySeconds: 300 + periodSeconds: 30 + timeoutSeconds: 5 + failureThreshold: 6 + successThreshold: 1 + startupProbe: + enabled: false + initialDelaySeconds: 60 + periodSeconds: 30 + timeoutSeconds: 5 + failureThreshold: 20 + successThreshold: 1 + readinessProbe: + enabled: false + initialDelaySeconds: 300 + periodSeconds: 30 + timeoutSeconds: 5 + failureThreshold: 6 + successThreshold: 1 + dovecot: + logLevel: DEBUG + nodeSelector: + hardware: rpi4 + oletools: + logLevel: DEBUG + nodeSelector: + hardware: rpi4 + postfix: + logLevel: DEBUG + nodeSelector: + hardware: rpi4 + overrides: + smtp_use_tls: "yes" + smtp_tls_security_level: "encrypt" + smtp_sasl_security_options: "noanonymous" + redis: + enabled: true + architecture: standalone + logLevel: DEBUG + image: + repository: bitnamilegacy/redis + tag: 8.0.3-debian-12-r3 + master: + nodeSelector: + hardware: rpi4 + persistence: + enabled: true + accessModes: [ReadWriteMany] + size: 8Gi + storageClass: astreae + rspamd: + logLevel: DEBUG + nodeSelector: + hardware: rpi4 + persistence: + accessModes: [ReadWriteOnce] + size: 8Gi + storageClass: astreae + tika: + logLevel: DEBUG + nodeSelector: + hardware: rpi4 + global: + logLevel: DEBUG + storageClass: astreae + webmail: + enabled: false + nodeSelector: + hardware: rpi4 + ingress: + enabled: false + ingressClassName: traefik + tls: true + existingSecret: mailu-certificates + annotations: + traefik.ingress.kubernetes.io/router.entrypoints: websecure + traefik.ingress.kubernetes.io/service.serversscheme: https + traefik.ingress.kubernetes.io/service.serverstransport: mailu-transport@kubernetescrd + extraRules: + - host: mail.bstein.dev + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: mailu-front + port: + number: 443 + service: + ports: + smtp: + port: 25 + targetPort: 25 + smtps: + port: 465 + targetPort: 465 + submission: + port: 587 + targetPort: 587 diff --git a/services/mailu/ingressroute.yaml b/services/mailu/ingressroute.yaml new file mode 100644 index 0000000..d4bc4f6 --- /dev/null +++ b/services/mailu/ingressroute.yaml @@ -0,0 +1,19 @@ +# services/mailu/ingressroute.yaml +apiVersion: traefik.io/v1alpha1 +kind: IngressRoute +metadata: + name: mailu + namespace: mailu-mailserver +spec: + entryPoints: + - websecure + routes: + - match: Host(`mail.bstein.dev`) + kind: Rule + services: + - name: mailu-front + port: 443 + scheme: https + serversTransport: mailu-transport + tls: + secretName: mailu-certificates diff --git a/services/mailu/kustomization.yaml b/services/mailu/kustomization.yaml new file mode 100644 index 0000000..2df7440 --- /dev/null +++ b/services/mailu/kustomization.yaml @@ -0,0 +1,23 @@ +# services/mailu/kustomization.yaml +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization +namespace: mailu-mailserver +resources: + - namespace.yaml + - helmrelease.yaml + - certificate.yaml + - vip-controller.yaml + - unbound-configmap.yaml + - serverstransport.yaml + - ingressroute.yaml + - mailu-sync-job.yaml + - mailu-sync-cronjob.yaml + - mailu-sync-listener.yaml + +configMapGenerator: + - name: mailu-sync-script + namespace: mailu-mailserver + files: + - sync.py=../../scripts/mailu_sync.py + options: + disableNameSuffixHash: true diff --git a/services/mailu/mailu-sync-cronjob.yaml b/services/mailu/mailu-sync-cronjob.yaml new file mode 100644 index 0000000..268680f --- /dev/null +++ b/services/mailu/mailu-sync-cronjob.yaml @@ -0,0 +1,77 @@ +# services/mailu/mailu-sync-cronjob.yaml +apiVersion: batch/v1 +kind: CronJob +metadata: + name: mailu-sync-nightly + namespace: mailu-mailserver +spec: + schedule: "30 4 * * *" + concurrencyPolicy: Forbid + jobTemplate: + spec: + template: + spec: + restartPolicy: OnFailure + containers: + - name: mailu-sync + image: python:3.11-alpine + imagePullPolicy: IfNotPresent + command: ["/bin/sh", "-c"] + args: + - | + pip install --no-cache-dir requests psycopg2-binary passlib >/tmp/pip.log \ + && python /app/sync.py + env: + - name: KEYCLOAK_BASE_URL + value: http://keycloak.sso.svc.cluster.local + - name: KEYCLOAK_REALM + value: atlas + - name: MAILU_DOMAIN + value: bstein.dev + - name: MAILU_DEFAULT_QUOTA + value: "20000000000" + - name: MAILU_DB_HOST + value: postgres-service.postgres.svc.cluster.local + - name: MAILU_DB_PORT + value: "5432" + - name: MAILU_DB_NAME + valueFrom: + secretKeyRef: + name: mailu-db-secret + key: database + - name: MAILU_DB_USER + valueFrom: + secretKeyRef: + name: mailu-db-secret + key: username + - name: MAILU_DB_PASSWORD + valueFrom: + secretKeyRef: + name: mailu-db-secret + key: password + - name: KEYCLOAK_CLIENT_ID + valueFrom: + secretKeyRef: + name: mailu-sync-credentials + key: client-id + - name: KEYCLOAK_CLIENT_SECRET + valueFrom: + secretKeyRef: + name: mailu-sync-credentials + key: client-secret + volumeMounts: + - name: sync-script + mountPath: /app/sync.py + subPath: sync.py + resources: + requests: + cpu: 50m + memory: 128Mi + limits: + cpu: 200m + memory: 256Mi + volumes: + - name: sync-script + configMap: + name: mailu-sync-script + defaultMode: 0444 diff --git a/services/mailu/mailu-sync-job.yaml b/services/mailu/mailu-sync-job.yaml new file mode 100644 index 0000000..7230c1d --- /dev/null +++ b/services/mailu/mailu-sync-job.yaml @@ -0,0 +1,73 @@ +# services/mailu/mailu-sync-job.yaml +apiVersion: batch/v1 +kind: Job +metadata: + name: mailu-sync + namespace: mailu-mailserver +spec: + template: + spec: + restartPolicy: OnFailure + containers: + - name: mailu-sync + image: python:3.11-alpine + imagePullPolicy: IfNotPresent + command: ["/bin/sh", "-c"] + args: + - | + pip install --no-cache-dir requests psycopg2-binary passlib >/tmp/pip.log \ + && python /app/sync.py + env: + - name: KEYCLOAK_BASE_URL + value: http://keycloak.sso.svc.cluster.local + - name: KEYCLOAK_REALM + value: atlas + - name: MAILU_DOMAIN + value: bstein.dev + - name: MAILU_DEFAULT_QUOTA + value: "20000000000" + - name: MAILU_DB_HOST + value: postgres-service.postgres.svc.cluster.local + - name: MAILU_DB_PORT + value: "5432" + - name: MAILU_DB_NAME + valueFrom: + secretKeyRef: + name: mailu-db-secret + key: database + - name: MAILU_DB_USER + valueFrom: + secretKeyRef: + name: mailu-db-secret + key: username + - name: MAILU_DB_PASSWORD + valueFrom: + secretKeyRef: + name: mailu-db-secret + key: password + - name: KEYCLOAK_CLIENT_ID + valueFrom: + secretKeyRef: + name: mailu-sync-credentials + key: client-id + - name: KEYCLOAK_CLIENT_SECRET + valueFrom: + secretKeyRef: + name: mailu-sync-credentials + key: client-secret + volumeMounts: + - name: sync-script + mountPath: /app/sync.py + subPath: sync.py + resources: + requests: + cpu: 50m + memory: 128Mi + limits: + cpu: 200m + memory: 256Mi + volumes: + - name: sync-script + configMap: + name: mailu-sync-script + defaultMode: 0444 diff --git a/services/mailu/mailu-sync-listener.yaml b/services/mailu/mailu-sync-listener.yaml new file mode 100644 index 0000000..04e8070 --- /dev/null +++ b/services/mailu/mailu-sync-listener.yaml @@ -0,0 +1,154 @@ +# services/mailu/mailu-sync-listener.yaml +apiVersion: v1 +kind: Service +metadata: + name: mailu-sync-listener + namespace: mailu-mailserver +spec: + selector: + app: mailu-sync-listener + ports: + - name: http + port: 8080 + targetPort: 8080 +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: mailu-sync-listener + namespace: mailu-mailserver + labels: + app: mailu-sync-listener +spec: + replicas: 1 + selector: + matchLabels: + app: mailu-sync-listener + template: + metadata: + labels: + app: mailu-sync-listener + spec: + restartPolicy: Always + containers: + - name: listener + image: python:3.11-alpine + imagePullPolicy: IfNotPresent + command: ["/bin/sh", "-c"] + args: + - | + pip install --no-cache-dir requests psycopg2-binary passlib >/tmp/pip.log \ + && python /app/listener.py + env: + - name: KEYCLOAK_BASE_URL + value: http://keycloak.sso.svc.cluster.local + - name: KEYCLOAK_REALM + value: atlas + - name: MAILU_DOMAIN + value: bstein.dev + - name: MAILU_DEFAULT_QUOTA + value: "20000000000" + - name: MAILU_DB_HOST + value: postgres-service.postgres.svc.cluster.local + - name: MAILU_DB_PORT + value: "5432" + - name: MAILU_DB_NAME + valueFrom: + secretKeyRef: + name: mailu-db-secret + key: database + - name: MAILU_DB_USER + valueFrom: + secretKeyRef: + name: mailu-db-secret + key: username + - name: MAILU_DB_PASSWORD + valueFrom: + secretKeyRef: + name: mailu-db-secret + key: password + - name: KEYCLOAK_CLIENT_ID + valueFrom: + secretKeyRef: + name: mailu-sync-credentials + key: client-id + - name: KEYCLOAK_CLIENT_SECRET + valueFrom: + secretKeyRef: + name: mailu-sync-credentials + key: client-secret + volumeMounts: + - name: sync-script + mountPath: /app/sync.py + subPath: sync.py + - name: listener-script + mountPath: /app/listener.py + subPath: listener.py + resources: + requests: + cpu: 50m + memory: 128Mi + limits: + cpu: 200m + memory: 256Mi + volumes: + - name: sync-script + configMap: + name: mailu-sync-script + defaultMode: 0444 + - name: listener-script + configMap: + name: mailu-sync-listener + defaultMode: 0444 +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: mailu-sync-listener + namespace: mailu-mailserver +data: + listener.py: | + import http.server + import json + import os + import subprocess + import threading + + from time import time + + # Simple debounce to avoid hammering on bursts + MIN_INTERVAL_SECONDS = 10 + last_run = 0.0 + lock = threading.Lock() + + def trigger_sync(): + global last_run + with lock: + now = time() + if now - last_run < MIN_INTERVAL_SECONDS: + return + last_run = now + # Fire and forget; output to stdout + subprocess.Popen(["python", "/app/sync.py"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT) + + class Handler(http.server.BaseHTTPRequestHandler): + def do_POST(self): + length = int(self.headers.get("Content-Length", 0)) + body = self.rfile.read(length) if length else b"" + try: + json.loads(body or b"{}") + except json.JSONDecodeError: + self.send_response(400) + self.end_headers() + return + trigger_sync() + self.send_response(202) + self.end_headers() + + def log_message(self, fmt, *args): + # Quiet logging + return + + if __name__ == "__main__": + server = http.server.ThreadingHTTPServer(("", 8080), Handler) + server.serve_forever() diff --git a/services/mailu/namespace.yaml b/services/mailu/namespace.yaml new file mode 100644 index 0000000..1f3831b --- /dev/null +++ b/services/mailu/namespace.yaml @@ -0,0 +1,5 @@ +# services/mailu/namespace.yaml +apiVersion: v1 +kind: Namespace +metadata: + name: mailu-mailserver diff --git a/services/mailu/serverstransport.yaml b/services/mailu/serverstransport.yaml new file mode 100644 index 0000000..ace7b31 --- /dev/null +++ b/services/mailu/serverstransport.yaml @@ -0,0 +1,10 @@ +# services/mailu/serverstransport.yaml +apiVersion: traefik.io/v1alpha1 +kind: ServersTransport +metadata: + name: mailu-transport + namespace: mailu-mailserver +spec: + # Force SNI to mail.bstein.dev and skip backend cert verification (backend cert is for the host, not the pod IP). + serverName: mail.bstein.dev + insecureSkipVerify: true diff --git a/services/mailu/unbound-configmap.yaml b/services/mailu/unbound-configmap.yaml new file mode 100644 index 0000000..219cafb --- /dev/null +++ b/services/mailu/unbound-configmap.yaml @@ -0,0 +1,49 @@ +# services/mailu/unbound-configmap.yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: mailu-unbound + namespace: mailu-mailserver +data: + unbound.conf: | + server: + verbosity: 1 + interface: 0.0.0.0 + do-ip4: yes + do-ip6: no + do-udp: yes + do-tcp: yes + auto-trust-anchor-file: "/var/lib/unbound/root.key" + prefetch: yes + qname-minimisation: yes + harden-dnssec-stripped: yes + val-clean-additional: yes + domain-insecure: "mailu-mailserver.svc.cluster.local." + domain-insecure: "svc.cluster.local." + domain-insecure: "cluster.local." + cache-min-ttl: 120 + cache-max-ttl: 86400 + access-control: 0.0.0.0/0 allow + + forward-zone: + name: "mailu-mailserver.svc.cluster.local." + forward-addr: 10.43.0.10 + forward-no-cache: yes + forward-first: yes + + forward-zone: + name: "svc.cluster.local." + forward-addr: 10.43.0.10 + forward-no-cache: yes + forward-first: yes + + forward-zone: + name: "cluster.local." + forward-addr: 10.43.0.10 + forward-no-cache: yes + forward-first: yes + + forward-zone: + name: "." + forward-addr: 9.9.9.9 + forward-addr: 1.1.1.1 diff --git a/services/mailu/vip-controller.yaml b/services/mailu/vip-controller.yaml new file mode 100644 index 0000000..a6d8c1f --- /dev/null +++ b/services/mailu/vip-controller.yaml @@ -0,0 +1,71 @@ +# services/mailu/vip-controller.yaml +--- +apiVersion: v1 +kind: ServiceAccount +metadata: + name: vip-controller + namespace: mailu-mailserver +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: Role +metadata: + name: vip-controller-role + namespace: mailu-mailserver +rules: + - apiGroups: ["apps"] + resources: ["deployments"] + verbs: ["get", "list", "patch", "update"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: vip-controller-binding + namespace: mailu-mailserver +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: Role + name: vip-controller-role +subjects: + - kind: ServiceAccount + name: vip-controller + namespace: mailu-mailserver +--- +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: vip-controller + namespace: mailu-mailserver +spec: + selector: + matchLabels: + app: vip-controller + template: + metadata: + labels: + app: vip-controller + spec: + serviceAccountName: vip-controller + hostNetwork: true + nodeSelector: + mailu.bstein.dev/vip: "true" + containers: + - name: vip-controller + image: lachlanevenson/k8s-kubectl:latest + imagePullPolicy: IfNotPresent + command: + - /bin/sh + - -c + args: + - | + set -e + while true; do + if ip addr show end0 | grep -q 'inet 192\.168\.22\.9/32'; then + NODE=$(hostname) + echo "VIP found on node ${NODE}." + kubectl patch deployment mailu-front -n mailu-mailserver --type='merge' \ + -p "{\"spec\":{\"template\":{\"spec\":{\"nodeSelector\":{\"kubernetes.io/hostname\":\"${NODE}\"}}}}}" + else + echo "No VIP on node ${HOSTNAME}." + fi + sleep 60 + done diff --git a/services/monitoring/README.md b/services/monitoring/README.md deleted file mode 100644 index 835ae1d..0000000 --- a/services/monitoring/README.md +++ /dev/null @@ -1,28 +0,0 @@ -# services/monitoring - -## Grafana admin secret - -The Grafana Helm release expects a pre-existing secret named `grafana-admin` -in the `monitoring` namespace. Create or rotate it with: - -```bash -kubectl create secret generic grafana-admin \ - --namespace monitoring \ - --from-literal=admin-user=admin \ - --from-literal=admin-password='REPLACE_ME' -``` - -Update the password whenever you rotate credentials. - -## DCGM exporter image - -The NVIDIA GPU metrics DaemonSet expects `registry.bstein.dev/monitoring/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04`, mirrored from `docker.io/nvidia/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04`. Refresh it in Zot when bumping versions: - -```bash -skopeo copy \ - --all \ - docker://docker.io/nvidia/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04 \ - docker://registry.bstein.dev/monitoring/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04 -``` - -When finished mirroring from the control-plane, you can remove temporary tooling with `sudo apt-get purge -y skopeo && sudo apt-get autoremove -y` and clear `~/.config/containers/auth.json`. diff --git a/services/monitoring/dashboards/atlas-gpu.json b/services/monitoring/dashboards/atlas-gpu.json index 9071b0a..572c2c6 100644 --- a/services/monitoring/dashboards/atlas-gpu.json +++ b/services/monitoring/dashboards/atlas-gpu.json @@ -40,9 +40,7 @@ "placement": "right" }, "pieType": "pie", - "displayLabels": [ - "percent" - ], + "displayLabels": [], "tooltip": { "mode": "single" }, @@ -153,12 +151,16 @@ ], "fieldConfig": { "defaults": { - "unit": "percent" + "unit": "percent", + "custom": { + "filterable": true + } }, "overrides": [] }, "options": { - "showHeader": true + "showHeader": true, + "columnFilters": false }, "transformations": [ { diff --git a/services/monitoring/dashboards/atlas-network.json b/services/monitoring/dashboards/atlas-network.json index ff0af9b..09e9383 100644 --- a/services/monitoring/dashboards/atlas-network.json +++ b/services/monitoring/dashboards/atlas-network.json @@ -7,46 +7,55 @@ { "id": 1, "type": "stat", - "title": "Ingress Traffic", + "title": "Ingress Success Rate (5m)", "datasource": { "type": "prometheus", "uid": "atlas-vm" }, "gridPos": { "h": 4, - "w": 8, + "w": 6, "x": 0, "y": 0 }, "targets": [ { - "expr": "sum(rate(node_network_receive_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)", + "expr": "(sum(rate(traefik_entrypoint_requests_total{code!~\"5..\"}[5m]))) / clamp_min(sum(rate(traefik_entrypoint_requests_total[5m])), 1)", "refId": "A" } ], "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { - "color": "rgba(115, 115, 115, 1)", + "color": "red", "value": null }, + { + "color": "orange", + "value": 0.995 + }, + { + "color": "yellow", + "value": 0.999 + }, { "color": "green", - "value": 1 + "value": 0.9995 } ] }, - "unit": "Bps", + "unit": "percentunit", "custom": { "displayMode": "auto" - } + }, + "decimals": 2 }, "overrides": [] }, @@ -67,46 +76,55 @@ { "id": 2, "type": "stat", - "title": "Egress Traffic", + "title": "Error Budget Burn (1h)", "datasource": { "type": "prometheus", "uid": "atlas-vm" }, "gridPos": { "h": 4, - "w": 8, - "x": 8, + "w": 6, + "x": 6, "y": 0 }, "targets": [ { - "expr": "sum(rate(node_network_transmit_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)", + "expr": "(1 - ((sum(rate(traefik_entrypoint_requests_total{code!~\"5..\"}[1h]))) / clamp_min(sum(rate(traefik_entrypoint_requests_total[1h])), 1))) / 0.0010000000000000009", "refId": "A" } ], "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { - "color": "rgba(115, 115, 115, 1)", + "color": "green", "value": null }, { - "color": "green", + "color": "yellow", "value": 1 + }, + { + "color": "orange", + "value": 2 + }, + { + "color": "red", + "value": 4 } ] }, - "unit": "Bps", + "unit": "none", "custom": { "displayMode": "auto" - } + }, + "decimals": 2 }, "overrides": [] }, @@ -127,7 +145,145 @@ { "id": 3, "type": "stat", - "title": "Intra-Cluster Traffic", + "title": "Error Budget Burn (6h)", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 6, + "x": 12, + "y": 0 + }, + "targets": [ + { + "expr": "(1 - ((sum(rate(traefik_entrypoint_requests_total{code!~\"5..\"}[6h]))) / clamp_min(sum(rate(traefik_entrypoint_requests_total[6h])), 1))) / 0.0010000000000000009", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 1 + }, + { + "color": "orange", + "value": 2 + }, + { + "color": "red", + "value": 4 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto" + }, + "decimals": 2 + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 4, + "type": "stat", + "title": "Edge P99 Latency (ms)", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 6, + "x": 18, + "y": 0 + }, + "targets": [ + { + "expr": "histogram_quantile(0.99, sum by (le) (rate(traefik_entrypoint_request_duration_seconds_bucket[5m]))) * 1000", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 200 + }, + { + "color": "orange", + "value": 350 + }, + { + "color": "red", + "value": 500 + } + ] + }, + "unit": "ms", + "custom": { + "displayMode": "auto" + }, + "decimals": 1 + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 5, + "type": "stat", + "title": "Ingress Traffic", "datasource": { "type": "prometheus", "uid": "atlas-vm" @@ -135,19 +291,19 @@ "gridPos": { "h": 4, "w": 8, - "x": 16, - "y": 0 + "x": 0, + "y": 4 }, "targets": [ { - "expr": "sum(rate(container_network_receive_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m]) + rate(container_network_transmit_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m])) or on() vector(0)", + "expr": "sum(rate(node_network_receive_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)", "refId": "A" } ], "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -185,9 +341,9 @@ } }, { - "id": 4, + "id": 6, "type": "stat", - "title": "Top Router req/s", + "title": "Egress Traffic", "datasource": { "type": "prometheus", "uid": "atlas-vm" @@ -195,20 +351,19 @@ "gridPos": { "h": 4, "w": 8, - "x": 0, + "x": 8, "y": 4 }, "targets": [ { - "expr": "topk(1, sum by (router) (rate(traefik_router_requests_total[5m])))", - "refId": "A", - "legendFormat": "{{router}}" + "expr": "sum(rate(node_network_transmit_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)", + "refId": "A" } ], "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -224,7 +379,7 @@ } ] }, - "unit": "req/s", + "unit": "Bps", "custom": { "displayMode": "auto" } @@ -246,7 +401,67 @@ } }, { - "id": 5, + "id": 7, + "type": "stat", + "title": "Intra-Cluster Traffic", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 8, + "x": 16, + "y": 4 + }, + "targets": [ + { + "expr": "sum(rate(container_network_receive_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m]) + rate(container_network_transmit_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m])) or on() vector(0)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "Bps", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 8, "type": "timeseries", "title": "Per-Node Throughput", "datasource": { @@ -283,7 +498,7 @@ } }, { - "id": 6, + "id": 9, "type": "table", "title": "Top Namespaces", "datasource": { @@ -304,12 +519,16 @@ ], "fieldConfig": { "defaults": { - "unit": "Bps" + "unit": "Bps", + "custom": { + "filterable": true + } }, "overrides": [] }, "options": { - "showHeader": true + "showHeader": true, + "columnFilters": false }, "transformations": [ { @@ -319,7 +538,7 @@ ] }, { - "id": 7, + "id": 10, "type": "table", "title": "Top Pods", "datasource": { @@ -340,12 +559,16 @@ ], "fieldConfig": { "defaults": { - "unit": "Bps" + "unit": "Bps", + "custom": { + "filterable": true + } }, "overrides": [] }, "options": { - "showHeader": true + "showHeader": true, + "columnFilters": false }, "transformations": [ { @@ -355,7 +578,7 @@ ] }, { - "id": 8, + "id": 11, "type": "timeseries", "title": "Traefik Routers (req/s)", "datasource": { @@ -392,7 +615,7 @@ } }, { - "id": 9, + "id": 12, "type": "timeseries", "title": "Traefik Entrypoints (req/s)", "datasource": { diff --git a/services/monitoring/dashboards/atlas-nodes.json b/services/monitoring/dashboards/atlas-nodes.json index 802fe5a..495c622 100644 --- a/services/monitoring/dashboards/atlas-nodes.json +++ b/services/monitoring/dashboards/atlas-nodes.json @@ -27,7 +27,7 @@ "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -88,7 +88,7 @@ "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -149,7 +149,7 @@ "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -186,6 +186,213 @@ "textMode": "value" } }, + { + "id": 9, + "type": "stat", + "title": "API Server 5xx rate", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 8, + "x": 0, + "y": 4 + }, + "targets": [ + { + "expr": "sum(rate(apiserver_request_total{code=~\"5..\"}[5m]))", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 0.05 + }, + { + "color": "orange", + "value": 0.2 + }, + { + "color": "red", + "value": 0.5 + } + ] + }, + "unit": "req/s", + "custom": { + "displayMode": "auto" + }, + "decimals": 3 + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 10, + "type": "stat", + "title": "API Server P99 latency", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 8, + "x": 8, + "y": 4 + }, + "targets": [ + { + "expr": "histogram_quantile(0.99, sum by (le) (rate(apiserver_request_duration_seconds_bucket[5m]))) * 1000", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 250 + }, + { + "color": "orange", + "value": 400 + }, + { + "color": "red", + "value": 600 + } + ] + }, + "unit": "ms", + "custom": { + "displayMode": "auto" + }, + "decimals": 1 + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 11, + "type": "stat", + "title": "etcd P99 latency", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 8, + "x": 16, + "y": 4 + }, + "targets": [ + { + "expr": "histogram_quantile(0.99, sum by (le) (rate(etcd_request_duration_seconds_bucket[5m]))) * 1000", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 50 + }, + { + "color": "orange", + "value": 100 + }, + { + "color": "red", + "value": 200 + } + ] + }, + "unit": "ms", + "custom": { + "displayMode": "auto" + }, + "decimals": 1 + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, { "id": 4, "type": "timeseries", @@ -198,7 +405,7 @@ "h": 9, "w": 24, "x": 0, - "y": 4 + "y": 8 }, "targets": [ { @@ -238,7 +445,7 @@ "h": 9, "w": 24, "x": 0, - "y": 13 + "y": 17 }, "targets": [ { @@ -278,7 +485,7 @@ "h": 9, "w": 12, "x": 0, - "y": 22 + "y": 26 }, "targets": [ { @@ -315,7 +522,7 @@ "h": 9, "w": 12, "x": 12, - "y": 22 + "y": 26 }, "targets": [ { @@ -352,7 +559,7 @@ "h": 9, "w": 24, "x": 0, - "y": 31 + "y": 35 }, "targets": [ { diff --git a/services/monitoring/dashboards/atlas-overview.json b/services/monitoring/dashboards/atlas-overview.json index beb676e..93ee927 100644 --- a/services/monitoring/dashboards/atlas-overview.json +++ b/services/monitoring/dashboards/atlas-overview.json @@ -7,67 +7,6 @@ "list": [] }, "panels": [ - { - "id": 1, - "type": "gauge", - "title": "Workers Ready", - "datasource": { - "type": "prometheus", - "uid": "atlas-vm" - }, - "gridPos": { - "h": 5, - "w": 5, - "x": 0, - "y": 0 - }, - "targets": [ - { - "expr": "sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-04|titan-05|titan-06|titan-07|titan-08|titan-09|titan-10|titan-11|titan-12|titan-13|titan-14|titan-15|titan-16|titan-17|titan-18|titan-19|titan-22|titan-24\"})", - "refId": "A" - } - ], - "fieldConfig": { - "defaults": { - "min": 0, - "max": 18, - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "red", - "value": null - }, - { - "color": "orange", - "value": 16 - }, - { - "color": "yellow", - "value": 17 - }, - { - "color": "green", - "value": 18 - } - ] - } - }, - "overrides": [] - }, - "options": { - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "orientation": "auto", - "showThresholdMarkers": false, - "showThresholdLabels": false - } - }, { "id": 2, "type": "gauge", @@ -78,8 +17,8 @@ }, "gridPos": { "h": 5, - "w": 5, - "x": 5, + "w": 4, + "x": 0, "y": 0 }, "targets": [ @@ -131,8 +70,8 @@ }, "gridPos": { "h": 5, - "w": 5, - "x": 10, + "w": 3, + "x": 4, "y": 0 }, "targets": [ @@ -144,82 +83,7 @@ "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" - }, - "mappings": [], - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "green", - "value": null - }, - { - "color": "yellow", - "value": 1 - }, - { - "color": "orange", - "value": 2 - }, - { - "color": "red", - "value": 3 - } - ] - }, - "unit": "none", - "custom": { - "displayMode": "auto" - } - }, - "overrides": [] - }, - "options": { - "colorMode": "value", - "graphMode": "area", - "justifyMode": "center", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "textMode": "value" - }, - "links": [ - { - "title": "Open atlas-pods dashboard", - "url": "/d/atlas-pods", - "targetBlank": true - } - ] - }, - { - "id": 4, - "type": "stat", - "title": "Problem Pods", - "datasource": { - "type": "prometheus", - "uid": "atlas-vm" - }, - "gridPos": { - "h": 5, - "w": 5, - "x": 15, - "y": 0 - }, - "targets": [ - { - "expr": "sum(max by (namespace,pod) (kube_pod_status_phase{phase!~\"Running|Succeeded\"}))", - "refId": "A" - } - ], - "fieldConfig": { - "defaults": { - "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -281,20 +145,20 @@ }, "gridPos": { "h": 5, - "w": 4, - "x": 20, + "w": 3, + "x": 7, "y": 0 }, "targets": [ { - "expr": "sum(max by (namespace,pod) (((time() - kube_pod_deletion_timestamp{pod!=\"\"}) > bool 600) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0)))", + "expr": "sum(max by (namespace,pod) (((time() - kube_pod_deletion_timestamp{pod!=\"\"}) > bool 600) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0))) or on() vector(0)", "refId": "A" } ], "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -346,6 +210,286 @@ } ] }, + { + "id": 27, + "type": "stat", + "title": "Atlas Availability (30d)", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 4, + "x": 10, + "y": 0 + }, + "targets": [ + { + "expr": "avg_over_time((min(((sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-0a|titan-0b|titan-0c\"}) / 3)), ((sum(kube_deployment_status_replicas_available{namespace=~\"traefik|kube-system\",deployment=\"traefik\"}) / clamp_min(sum(kube_deployment_spec_replicas{namespace=~\"traefik|kube-system\",deployment=\"traefik\"}), 1)))))[30d:5m])", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "orange", + "value": 0.999 + }, + { + "color": "yellow", + "value": 0.9999 + }, + { + "color": "green", + "value": 0.99999 + } + ] + }, + "unit": "percentunit", + "custom": { + "displayMode": "auto" + }, + "decimals": 3 + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 4, + "type": "stat", + "title": "Problem Pods", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 3, + "x": 14, + "y": 0 + }, + "targets": [ + { + "expr": "sum(max by (namespace,pod) (kube_pod_status_phase{phase!~\"Running|Succeeded\"})) or on() vector(0)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 1 + }, + { + "color": "orange", + "value": 2 + }, + { + "color": "red", + "value": 3 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + }, + "links": [ + { + "title": "Open atlas-pods dashboard", + "url": "/d/atlas-pods", + "targetBlank": true + } + ] + }, + { + "id": 6, + "type": "stat", + "title": "CrashLoop / ImagePull", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 3, + "x": 17, + "y": 0 + }, + "targets": [ + { + "expr": "sum(max by (namespace,pod) (kube_pod_container_status_waiting_reason{reason=~\"CrashLoopBackOff|ImagePullBackOff\"})) or on() vector(0)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 1 + }, + { + "color": "orange", + "value": 2 + }, + { + "color": "red", + "value": 3 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + }, + "links": [ + { + "title": "Open atlas-pods dashboard", + "url": "/d/atlas-pods", + "targetBlank": true + } + ] + }, + { + "id": 1, + "type": "gauge", + "title": "Workers Ready", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 4, + "x": 20, + "y": 0 + }, + "targets": [ + { + "expr": "sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-04|titan-05|titan-06|titan-07|titan-08|titan-09|titan-10|titan-11|titan-12|titan-13|titan-14|titan-15|titan-16|titan-17|titan-18|titan-19|titan-22|titan-24\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "min": 0, + "max": 18, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "orange", + "value": 16 + }, + { + "color": "yellow", + "value": 17 + }, + { + "color": "green", + "value": 18 + } + ] + } + }, + "overrides": [] + }, + "options": { + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "orientation": "auto", + "showThresholdMarkers": false, + "showThresholdLabels": false + } + }, { "id": 7, "type": "stat", @@ -371,11 +515,11 @@ "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { - "mode": "percentage", + "mode": "absolute", "steps": [ { "color": "green", @@ -383,11 +527,15 @@ }, { "color": "yellow", - "value": 70 + "value": 50 + }, + { + "color": "orange", + "value": 75 }, { "color": "red", - "value": 85 + "value": 91.5 } ] }, @@ -444,11 +592,11 @@ "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { - "mode": "percentage", + "mode": "absolute", "steps": [ { "color": "green", @@ -456,11 +604,15 @@ }, { "color": "yellow", - "value": 70 + "value": 50 + }, + { + "color": "orange", + "value": 75 }, { "color": "red", - "value": 85 + "value": 91.5 } ] }, @@ -517,7 +669,7 @@ "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -586,7 +738,7 @@ "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -653,11 +805,11 @@ "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { - "mode": "percentage", + "mode": "absolute", "steps": [ { "color": "green", @@ -665,11 +817,15 @@ }, { "color": "yellow", - "value": 70 + "value": 50 + }, + { + "color": "orange", + "value": 75 }, { "color": "red", - "value": 85 + "value": 91.5 } ] }, @@ -724,11 +880,11 @@ "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { - "mode": "percentage", + "mode": "absolute", "steps": [ { "color": "green", @@ -736,11 +892,15 @@ }, { "color": "yellow", - "value": 70 + "value": 50 + }, + { + "color": "orange", + "value": 75 }, { "color": "red", - "value": 85 + "value": 91.5 } ] }, @@ -795,7 +955,7 @@ "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -862,7 +1022,7 @@ "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -942,9 +1102,7 @@ "placement": "right" }, "pieType": "pie", - "displayLabels": [ - "percent" - ], + "displayLabels": [], "tooltip": { "mode": "single" }, @@ -995,9 +1153,7 @@ "placement": "right" }, "pieType": "pie", - "displayLabels": [ - "percent" - ], + "displayLabels": [], "tooltip": { "mode": "single" }, @@ -1048,9 +1204,7 @@ "placement": "right" }, "pieType": "pie", - "displayLabels": [ - "percent" - ], + "displayLabels": [], "tooltip": { "mode": "single" }, @@ -1175,7 +1329,7 @@ }, "targets": [ { - "expr": "(avg by (node) (((1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", + "expr": "(avg by (node) (((1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c|titan-db\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", "refId": "A", "legendFormat": "{{node}}" } @@ -1212,7 +1366,7 @@ }, "targets": [ { - "expr": "(avg by (node) ((avg by (instance) ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", + "expr": "(avg by (node) ((avg by (instance) ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c|titan-db\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", "refId": "A", "legendFormat": "{{node}}" } @@ -1233,6 +1387,138 @@ } } }, + { + "id": 28, + "type": "piechart", + "title": "Node Pod Share", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 10, + "w": 12, + "x": 0, + "y": 54 + }, + "targets": [ + { + "expr": "(sum(kube_pod_info{pod!=\"\" , node!=\"\"}) by (node) / clamp_min(sum(kube_pod_info{pod!=\"\" , node!=\"\"}), 1)) * 100", + "refId": "A", + "legendFormat": "{{namespace}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent", + "color": { + "mode": "palette-classic" + } + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "list", + "placement": "right" + }, + "pieType": "pie", + "displayLabels": [], + "tooltip": { + "mode": "single" + }, + "colorScheme": "interpolateSpectral", + "colorBy": "value", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + } + } + }, + { + "id": 29, + "type": "bargauge", + "title": "Top Nodes by Pod Count", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 10, + "w": 12, + "x": 12, + "y": 54 + }, + "targets": [ + { + "expr": "topk(12, sum(kube_pod_info{pod!=\"\" , node!=\"\"}) by (node))", + "refId": "A", + "legendFormat": "{{node}}", + "instant": true + } + ], + "fieldConfig": { + "defaults": { + "unit": "none", + "min": 0, + "max": null, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 50 + }, + { + "color": "orange", + "value": 75 + }, + { + "color": "red", + "value": 100 + } + ] + }, + "decimals": 0 + }, + "overrides": [] + }, + "options": { + "displayMode": "gradient", + "orientation": "horizontal", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + } + }, + "transformations": [ + { + "id": "sortBy", + "options": { + "fields": [ + "Value" + ], + "order": "desc" + } + }, + { + "id": "limit", + "options": { + "limit": 12 + } + } + ] + }, { "id": 18, "type": "timeseries", @@ -1377,7 +1663,7 @@ "h": 16, "w": 12, "x": 0, - "y": 54 + "y": 64 }, "targets": [ { @@ -1425,7 +1711,7 @@ "h": 16, "w": 12, "x": 12, - "y": 54 + "y": 64 }, "targets": [ { @@ -1452,11 +1738,11 @@ }, { "color": "orange", - "value": 70 + "value": 75 }, { "color": "red", - "value": 85 + "value": 91.5 } ] } @@ -1480,6 +1766,17 @@ "url": "/d/atlas-storage", "targetBlank": true } + ], + "transformations": [ + { + "id": "sortBy", + "options": { + "fields": [ + "Value" + ], + "order": "desc" + } + } ] } ], @@ -1497,36 +1794,5 @@ "to": "now" }, "refresh": "1m", - "links": [ - { - "title": "Atlas Pods", - "type": "dashboard", - "dashboardUid": "atlas-pods", - "keepTime": false - }, - { - "title": "Atlas Nodes", - "type": "dashboard", - "dashboardUid": "atlas-nodes", - "keepTime": false - }, - { - "title": "Atlas Storage", - "type": "dashboard", - "dashboardUid": "atlas-storage", - "keepTime": false - }, - { - "title": "Atlas Network", - "type": "dashboard", - "dashboardUid": "atlas-network", - "keepTime": false - }, - { - "title": "Atlas GPU", - "type": "dashboard", - "dashboardUid": "atlas-gpu", - "keepTime": false - } - ] + "links": [] } diff --git a/services/monitoring/dashboards/atlas-pods.json b/services/monitoring/dashboards/atlas-pods.json index ef616e0..4b2a54a 100644 --- a/services/monitoring/dashboards/atlas-pods.json +++ b/services/monitoring/dashboards/atlas-pods.json @@ -20,14 +20,14 @@ }, "targets": [ { - "expr": "sum(max by (namespace,pod) (kube_pod_status_phase{phase!~\"Running|Succeeded\"}))", + "expr": "sum(max by (namespace,pod) (kube_pod_status_phase{phase!~\"Running|Succeeded\"})) or on() vector(0)", "refId": "A" } ], "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -80,14 +80,14 @@ }, "targets": [ { - "expr": "sum(max by (namespace,pod) (kube_pod_container_status_waiting_reason{reason=~\"CrashLoopBackOff|ImagePullBackOff\"}))", + "expr": "sum(max by (namespace,pod) (kube_pod_container_status_waiting_reason{reason=~\"CrashLoopBackOff|ImagePullBackOff\"})) or on() vector(0)", "refId": "A" } ], "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -140,14 +140,14 @@ }, "targets": [ { - "expr": "sum(max by (namespace,pod) (((time() - kube_pod_deletion_timestamp{pod!=\"\"}) > bool 600) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0)))", + "expr": "sum(max by (namespace,pod) (((time() - kube_pod_deletion_timestamp{pod!=\"\"}) > bool 600) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0))) or on() vector(0)", "refId": "A" } ], "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -207,7 +207,7 @@ "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -266,12 +266,16 @@ ], "fieldConfig": { "defaults": { - "unit": "s" + "unit": "s", + "custom": { + "filterable": true + } }, "overrides": [] }, "options": { - "showHeader": true + "showHeader": true, + "columnFilters": false }, "transformations": [ { @@ -302,12 +306,16 @@ ], "fieldConfig": { "defaults": { - "unit": "s" + "unit": "s", + "custom": { + "filterable": true + } }, "overrides": [] }, "options": { - "showHeader": true + "showHeader": true, + "columnFilters": false }, "transformations": [ { @@ -338,12 +346,16 @@ ], "fieldConfig": { "defaults": { - "unit": "s" + "unit": "s", + "custom": { + "filterable": true + } }, "overrides": [] }, "options": { - "showHeader": true + "showHeader": true, + "columnFilters": false }, "transformations": [ { @@ -359,6 +371,233 @@ } } ] + }, + { + "id": 8, + "type": "piechart", + "title": "Node Pod Share", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 34 + }, + "targets": [ + { + "expr": "(sum(kube_pod_info{pod!=\"\" , node!=\"\"}) by (node) / clamp_min(sum(kube_pod_info{pod!=\"\" , node!=\"\"}), 1)) * 100", + "refId": "A", + "legendFormat": "{{namespace}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent", + "color": { + "mode": "palette-classic" + } + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "list", + "placement": "right" + }, + "pieType": "pie", + "displayLabels": [], + "tooltip": { + "mode": "single" + }, + "colorScheme": "interpolateSpectral", + "colorBy": "value", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + } + } + }, + { + "id": 9, + "type": "bargauge", + "title": "Top Nodes by Pod Count", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 34 + }, + "targets": [ + { + "expr": "topk(12, sum(kube_pod_info{pod!=\"\" , node!=\"\"}) by (node))", + "refId": "A", + "legendFormat": "{{node}}", + "instant": true + } + ], + "fieldConfig": { + "defaults": { + "unit": "none", + "min": 0, + "max": null, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 50 + }, + { + "color": "orange", + "value": 75 + }, + { + "color": "red", + "value": 100 + } + ] + }, + "decimals": 0 + }, + "overrides": [] + }, + "options": { + "displayMode": "gradient", + "orientation": "horizontal", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + } + }, + "transformations": [ + { + "id": "sortBy", + "options": { + "fields": [ + "Value" + ], + "order": "desc" + } + }, + { + "id": "limit", + "options": { + "limit": 12 + } + } + ] + }, + { + "id": 10, + "type": "table", + "title": "Namespace Plurality by Node v27", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 8, + "w": 24, + "x": 0, + "y": 42 + }, + "targets": [ + { + "expr": "(sum by (namespace,node) (kube_pod_info{pod!=\"\" , node!=\"\"}) / on(namespace) group_left() clamp_min(sum by (namespace) (kube_pod_info{pod!=\"\"}), 1) * 100) * on(namespace,node) group_left() ((sum by (namespace,node) (kube_pod_info{pod!=\"\" , node!=\"\"}) / on(namespace) group_left() clamp_min(sum by (namespace) (kube_pod_info{pod!=\"\"}), 1) * 100) + on(node) group_left() ((sum by (node) (kube_node_info{node=\"titan-0a\"}) * 0 + 0.001) or (sum by (node) (kube_node_info{node=\"titan-0b\"}) * 0 + 0.002) or (sum by (node) (kube_node_info{node=\"titan-0c\"}) * 0 + 0.003) or (sum by (node) (kube_node_info{node=\"titan-db\"}) * 0 + 0.004) or (sum by (node) (kube_node_info{node=\"titan-04\"}) * 0 + 0.005) or (sum by (node) (kube_node_info{node=\"titan-05\"}) * 0 + 0.006) or (sum by (node) (kube_node_info{node=\"titan-06\"}) * 0 + 0.007) or (sum by (node) (kube_node_info{node=\"titan-07\"}) * 0 + 0.008) or (sum by (node) (kube_node_info{node=\"titan-08\"}) * 0 + 0.009000000000000001) or (sum by (node) (kube_node_info{node=\"titan-09\"}) * 0 + 0.01) or (sum by (node) (kube_node_info{node=\"titan-10\"}) * 0 + 0.011) or (sum by (node) (kube_node_info{node=\"titan-11\"}) * 0 + 0.012) or (sum by (node) (kube_node_info{node=\"titan-12\"}) * 0 + 0.013000000000000001) or (sum by (node) (kube_node_info{node=\"titan-13\"}) * 0 + 0.014) or (sum by (node) (kube_node_info{node=\"titan-14\"}) * 0 + 0.015) or (sum by (node) (kube_node_info{node=\"titan-15\"}) * 0 + 0.016) or (sum by (node) (kube_node_info{node=\"titan-16\"}) * 0 + 0.017) or (sum by (node) (kube_node_info{node=\"titan-17\"}) * 0 + 0.018000000000000002) or (sum by (node) (kube_node_info{node=\"titan-18\"}) * 0 + 0.019) or (sum by (node) (kube_node_info{node=\"titan-19\"}) * 0 + 0.02) or (sum by (node) (kube_node_info{node=\"titan-22\"}) * 0 + 0.021) or (sum by (node) (kube_node_info{node=\"titan-24\"}) * 0 + 0.022)) == bool on(namespace) group_left() (max by (namespace) ((sum by (namespace,node) (kube_pod_info{pod!=\"\" , node!=\"\"}) / on(namespace) group_left() clamp_min(sum by (namespace) (kube_pod_info{pod!=\"\"}), 1) * 100) + on(node) group_left() ((sum by (node) (kube_node_info{node=\"titan-0a\"}) * 0 + 0.001) or (sum by (node) (kube_node_info{node=\"titan-0b\"}) * 0 + 0.002) or (sum by (node) (kube_node_info{node=\"titan-0c\"}) * 0 + 0.003) or (sum by (node) (kube_node_info{node=\"titan-db\"}) * 0 + 0.004) or (sum by (node) (kube_node_info{node=\"titan-04\"}) * 0 + 0.005) or (sum by (node) (kube_node_info{node=\"titan-05\"}) * 0 + 0.006) or (sum by (node) (kube_node_info{node=\"titan-06\"}) * 0 + 0.007) or (sum by (node) (kube_node_info{node=\"titan-07\"}) * 0 + 0.008) or (sum by (node) (kube_node_info{node=\"titan-08\"}) * 0 + 0.009000000000000001) or (sum by (node) (kube_node_info{node=\"titan-09\"}) * 0 + 0.01) or (sum by (node) (kube_node_info{node=\"titan-10\"}) * 0 + 0.011) or (sum by (node) (kube_node_info{node=\"titan-11\"}) * 0 + 0.012) or (sum by (node) (kube_node_info{node=\"titan-12\"}) * 0 + 0.013000000000000001) or (sum by (node) (kube_node_info{node=\"titan-13\"}) * 0 + 0.014) or (sum by (node) (kube_node_info{node=\"titan-14\"}) * 0 + 0.015) or (sum by (node) (kube_node_info{node=\"titan-15\"}) * 0 + 0.016) or (sum by (node) (kube_node_info{node=\"titan-16\"}) * 0 + 0.017) or (sum by (node) (kube_node_info{node=\"titan-17\"}) * 0 + 0.018000000000000002) or (sum by (node) (kube_node_info{node=\"titan-18\"}) * 0 + 0.019) or (sum by (node) (kube_node_info{node=\"titan-19\"}) * 0 + 0.02) or (sum by (node) (kube_node_info{node=\"titan-22\"}) * 0 + 0.021) or (sum by (node) (kube_node_info{node=\"titan-24\"}) * 0 + 0.022)))))", + "refId": "A", + "instant": true, + "format": "table" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent", + "custom": { + "filterable": false + } + }, + "overrides": [] + }, + "options": { + "showHeader": true, + "columnFilters": false, + "showColumnFilters": false, + "footer": { + "show": false, + "fields": "", + "calcs": [] + } + }, + "transformations": [ + { + "id": "labelsToFields", + "options": {} + }, + { + "id": "organize", + "options": { + "excludeByName": { + "Time": true + } + } + }, + { + "id": "filterByValue", + "options": { + "match": "Value", + "operator": "gt", + "value": 0 + } + }, + { + "id": "sortBy", + "options": { + "fields": [ + "Value" + ], + "order": "desc" + } + }, + { + "id": "groupBy", + "options": { + "fields": { + "namespace": { + "aggregations": [ + { + "field": "Value", + "operation": "max" + }, + { + "field": "node", + "operation": "first" + } + ] + } + }, + "rowBy": [ + "namespace" + ] + } + } + ] } ], "time": { diff --git a/services/monitoring/dashboards/atlas-storage.json b/services/monitoring/dashboards/atlas-storage.json index 1d07040..2e548b2 100644 --- a/services/monitoring/dashboards/atlas-storage.json +++ b/services/monitoring/dashboards/atlas-storage.json @@ -27,11 +27,11 @@ "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { - "mode": "percentage", + "mode": "absolute", "steps": [ { "color": "green", @@ -39,11 +39,15 @@ }, { "color": "yellow", - "value": 70 + "value": 50 + }, + { + "color": "orange", + "value": 75 }, { "color": "red", - "value": 85 + "value": 91.5 } ] }, @@ -91,11 +95,11 @@ "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { - "mode": "percentage", + "mode": "absolute", "steps": [ { "color": "green", @@ -103,11 +107,15 @@ }, { "color": "yellow", - "value": 70 + "value": 50 + }, + { + "color": "orange", + "value": 75 }, { "color": "red", - "value": 85 + "value": 91.5 } ] }, @@ -155,7 +163,7 @@ "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -215,7 +223,7 @@ "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { diff --git a/services/monitoring/grafana-dashboard-gpu.yaml b/services/monitoring/grafana-dashboard-gpu.yaml index b5c2c18..48725de 100644 --- a/services/monitoring/grafana-dashboard-gpu.yaml +++ b/services/monitoring/grafana-dashboard-gpu.yaml @@ -49,9 +49,7 @@ data: "placement": "right" }, "pieType": "pie", - "displayLabels": [ - "percent" - ], + "displayLabels": [], "tooltip": { "mode": "single" }, @@ -162,12 +160,16 @@ data: ], "fieldConfig": { "defaults": { - "unit": "percent" + "unit": "percent", + "custom": { + "filterable": true + } }, "overrides": [] }, "options": { - "showHeader": true + "showHeader": true, + "columnFilters": false }, "transformations": [ { diff --git a/services/monitoring/grafana-dashboard-network.yaml b/services/monitoring/grafana-dashboard-network.yaml index fd1f5d6..a87600f 100644 --- a/services/monitoring/grafana-dashboard-network.yaml +++ b/services/monitoring/grafana-dashboard-network.yaml @@ -16,46 +16,55 @@ data: { "id": 1, "type": "stat", - "title": "Ingress Traffic", + "title": "Ingress Success Rate (5m)", "datasource": { "type": "prometheus", "uid": "atlas-vm" }, "gridPos": { "h": 4, - "w": 8, + "w": 6, "x": 0, "y": 0 }, "targets": [ { - "expr": "sum(rate(node_network_receive_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)", + "expr": "(sum(rate(traefik_entrypoint_requests_total{code!~\"5..\"}[5m]))) / clamp_min(sum(rate(traefik_entrypoint_requests_total[5m])), 1)", "refId": "A" } ], "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { - "color": "rgba(115, 115, 115, 1)", + "color": "red", "value": null }, + { + "color": "orange", + "value": 0.995 + }, + { + "color": "yellow", + "value": 0.999 + }, { "color": "green", - "value": 1 + "value": 0.9995 } ] }, - "unit": "Bps", + "unit": "percentunit", "custom": { "displayMode": "auto" - } + }, + "decimals": 2 }, "overrides": [] }, @@ -76,46 +85,55 @@ data: { "id": 2, "type": "stat", - "title": "Egress Traffic", + "title": "Error Budget Burn (1h)", "datasource": { "type": "prometheus", "uid": "atlas-vm" }, "gridPos": { "h": 4, - "w": 8, - "x": 8, + "w": 6, + "x": 6, "y": 0 }, "targets": [ { - "expr": "sum(rate(node_network_transmit_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)", + "expr": "(1 - ((sum(rate(traefik_entrypoint_requests_total{code!~\"5..\"}[1h]))) / clamp_min(sum(rate(traefik_entrypoint_requests_total[1h])), 1))) / 0.0010000000000000009", "refId": "A" } ], "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { - "color": "rgba(115, 115, 115, 1)", + "color": "green", "value": null }, { - "color": "green", + "color": "yellow", "value": 1 + }, + { + "color": "orange", + "value": 2 + }, + { + "color": "red", + "value": 4 } ] }, - "unit": "Bps", + "unit": "none", "custom": { "displayMode": "auto" - } + }, + "decimals": 2 }, "overrides": [] }, @@ -136,7 +154,145 @@ data: { "id": 3, "type": "stat", - "title": "Intra-Cluster Traffic", + "title": "Error Budget Burn (6h)", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 6, + "x": 12, + "y": 0 + }, + "targets": [ + { + "expr": "(1 - ((sum(rate(traefik_entrypoint_requests_total{code!~\"5..\"}[6h]))) / clamp_min(sum(rate(traefik_entrypoint_requests_total[6h])), 1))) / 0.0010000000000000009", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 1 + }, + { + "color": "orange", + "value": 2 + }, + { + "color": "red", + "value": 4 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto" + }, + "decimals": 2 + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 4, + "type": "stat", + "title": "Edge P99 Latency (ms)", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 6, + "x": 18, + "y": 0 + }, + "targets": [ + { + "expr": "histogram_quantile(0.99, sum by (le) (rate(traefik_entrypoint_request_duration_seconds_bucket[5m]))) * 1000", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 200 + }, + { + "color": "orange", + "value": 350 + }, + { + "color": "red", + "value": 500 + } + ] + }, + "unit": "ms", + "custom": { + "displayMode": "auto" + }, + "decimals": 1 + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 5, + "type": "stat", + "title": "Ingress Traffic", "datasource": { "type": "prometheus", "uid": "atlas-vm" @@ -144,19 +300,19 @@ data: "gridPos": { "h": 4, "w": 8, - "x": 16, - "y": 0 + "x": 0, + "y": 4 }, "targets": [ { - "expr": "sum(rate(container_network_receive_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m]) + rate(container_network_transmit_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m])) or on() vector(0)", + "expr": "sum(rate(node_network_receive_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)", "refId": "A" } ], "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -194,9 +350,9 @@ data: } }, { - "id": 4, + "id": 6, "type": "stat", - "title": "Top Router req/s", + "title": "Egress Traffic", "datasource": { "type": "prometheus", "uid": "atlas-vm" @@ -204,20 +360,19 @@ data: "gridPos": { "h": 4, "w": 8, - "x": 0, + "x": 8, "y": 4 }, "targets": [ { - "expr": "topk(1, sum by (router) (rate(traefik_router_requests_total[5m])))", - "refId": "A", - "legendFormat": "{{router}}" + "expr": "sum(rate(node_network_transmit_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)", + "refId": "A" } ], "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -233,7 +388,7 @@ data: } ] }, - "unit": "req/s", + "unit": "Bps", "custom": { "displayMode": "auto" } @@ -255,7 +410,67 @@ data: } }, { - "id": 5, + "id": 7, + "type": "stat", + "title": "Intra-Cluster Traffic", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 8, + "x": 16, + "y": 4 + }, + "targets": [ + { + "expr": "sum(rate(container_network_receive_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m]) + rate(container_network_transmit_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m])) or on() vector(0)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(115, 115, 115, 1)", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "Bps", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 8, "type": "timeseries", "title": "Per-Node Throughput", "datasource": { @@ -292,7 +507,7 @@ data: } }, { - "id": 6, + "id": 9, "type": "table", "title": "Top Namespaces", "datasource": { @@ -313,12 +528,16 @@ data: ], "fieldConfig": { "defaults": { - "unit": "Bps" + "unit": "Bps", + "custom": { + "filterable": true + } }, "overrides": [] }, "options": { - "showHeader": true + "showHeader": true, + "columnFilters": false }, "transformations": [ { @@ -328,7 +547,7 @@ data: ] }, { - "id": 7, + "id": 10, "type": "table", "title": "Top Pods", "datasource": { @@ -349,12 +568,16 @@ data: ], "fieldConfig": { "defaults": { - "unit": "Bps" + "unit": "Bps", + "custom": { + "filterable": true + } }, "overrides": [] }, "options": { - "showHeader": true + "showHeader": true, + "columnFilters": false }, "transformations": [ { @@ -364,7 +587,7 @@ data: ] }, { - "id": 8, + "id": 11, "type": "timeseries", "title": "Traefik Routers (req/s)", "datasource": { @@ -401,7 +624,7 @@ data: } }, { - "id": 9, + "id": 12, "type": "timeseries", "title": "Traefik Entrypoints (req/s)", "datasource": { diff --git a/services/monitoring/grafana-dashboard-nodes.yaml b/services/monitoring/grafana-dashboard-nodes.yaml index 2facfed..542daca 100644 --- a/services/monitoring/grafana-dashboard-nodes.yaml +++ b/services/monitoring/grafana-dashboard-nodes.yaml @@ -36,7 +36,7 @@ data: "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -97,7 +97,7 @@ data: "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -158,7 +158,7 @@ data: "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -195,6 +195,213 @@ data: "textMode": "value" } }, + { + "id": 9, + "type": "stat", + "title": "API Server 5xx rate", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 8, + "x": 0, + "y": 4 + }, + "targets": [ + { + "expr": "sum(rate(apiserver_request_total{code=~\"5..\"}[5m]))", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 0.05 + }, + { + "color": "orange", + "value": 0.2 + }, + { + "color": "red", + "value": 0.5 + } + ] + }, + "unit": "req/s", + "custom": { + "displayMode": "auto" + }, + "decimals": 3 + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 10, + "type": "stat", + "title": "API Server P99 latency", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 8, + "x": 8, + "y": 4 + }, + "targets": [ + { + "expr": "histogram_quantile(0.99, sum by (le) (rate(apiserver_request_duration_seconds_bucket[5m]))) * 1000", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 250 + }, + { + "color": "orange", + "value": 400 + }, + { + "color": "red", + "value": 600 + } + ] + }, + "unit": "ms", + "custom": { + "displayMode": "auto" + }, + "decimals": 1 + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 11, + "type": "stat", + "title": "etcd P99 latency", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 4, + "w": 8, + "x": 16, + "y": 4 + }, + "targets": [ + { + "expr": "histogram_quantile(0.99, sum by (le) (rate(etcd_request_duration_seconds_bucket[5m]))) * 1000", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 50 + }, + { + "color": "orange", + "value": 100 + }, + { + "color": "red", + "value": 200 + } + ] + }, + "unit": "ms", + "custom": { + "displayMode": "auto" + }, + "decimals": 1 + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, { "id": 4, "type": "timeseries", @@ -207,7 +414,7 @@ data: "h": 9, "w": 24, "x": 0, - "y": 4 + "y": 8 }, "targets": [ { @@ -247,7 +454,7 @@ data: "h": 9, "w": 24, "x": 0, - "y": 13 + "y": 17 }, "targets": [ { @@ -287,7 +494,7 @@ data: "h": 9, "w": 12, "x": 0, - "y": 22 + "y": 26 }, "targets": [ { @@ -324,7 +531,7 @@ data: "h": 9, "w": 12, "x": 12, - "y": 22 + "y": 26 }, "targets": [ { @@ -361,7 +568,7 @@ data: "h": 9, "w": 24, "x": 0, - "y": 31 + "y": 35 }, "targets": [ { diff --git a/services/monitoring/grafana-dashboard-overview.yaml b/services/monitoring/grafana-dashboard-overview.yaml index ef17ebf..f4165bd 100644 --- a/services/monitoring/grafana-dashboard-overview.yaml +++ b/services/monitoring/grafana-dashboard-overview.yaml @@ -16,67 +16,6 @@ data: "list": [] }, "panels": [ - { - "id": 1, - "type": "gauge", - "title": "Workers Ready", - "datasource": { - "type": "prometheus", - "uid": "atlas-vm" - }, - "gridPos": { - "h": 5, - "w": 5, - "x": 0, - "y": 0 - }, - "targets": [ - { - "expr": "sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-04|titan-05|titan-06|titan-07|titan-08|titan-09|titan-10|titan-11|titan-12|titan-13|titan-14|titan-15|titan-16|titan-17|titan-18|titan-19|titan-22|titan-24\"})", - "refId": "A" - } - ], - "fieldConfig": { - "defaults": { - "min": 0, - "max": 18, - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "red", - "value": null - }, - { - "color": "orange", - "value": 16 - }, - { - "color": "yellow", - "value": 17 - }, - { - "color": "green", - "value": 18 - } - ] - } - }, - "overrides": [] - }, - "options": { - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "orientation": "auto", - "showThresholdMarkers": false, - "showThresholdLabels": false - } - }, { "id": 2, "type": "gauge", @@ -87,8 +26,8 @@ data: }, "gridPos": { "h": 5, - "w": 5, - "x": 5, + "w": 4, + "x": 0, "y": 0 }, "targets": [ @@ -140,8 +79,8 @@ data: }, "gridPos": { "h": 5, - "w": 5, - "x": 10, + "w": 3, + "x": 4, "y": 0 }, "targets": [ @@ -153,82 +92,7 @@ data: "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" - }, - "mappings": [], - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "green", - "value": null - }, - { - "color": "yellow", - "value": 1 - }, - { - "color": "orange", - "value": 2 - }, - { - "color": "red", - "value": 3 - } - ] - }, - "unit": "none", - "custom": { - "displayMode": "auto" - } - }, - "overrides": [] - }, - "options": { - "colorMode": "value", - "graphMode": "area", - "justifyMode": "center", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "textMode": "value" - }, - "links": [ - { - "title": "Open atlas-pods dashboard", - "url": "/d/atlas-pods", - "targetBlank": true - } - ] - }, - { - "id": 4, - "type": "stat", - "title": "Problem Pods", - "datasource": { - "type": "prometheus", - "uid": "atlas-vm" - }, - "gridPos": { - "h": 5, - "w": 5, - "x": 15, - "y": 0 - }, - "targets": [ - { - "expr": "sum(max by (namespace,pod) (kube_pod_status_phase{phase!~\"Running|Succeeded\"}))", - "refId": "A" - } - ], - "fieldConfig": { - "defaults": { - "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -290,20 +154,20 @@ data: }, "gridPos": { "h": 5, - "w": 4, - "x": 20, + "w": 3, + "x": 7, "y": 0 }, "targets": [ { - "expr": "sum(max by (namespace,pod) (((time() - kube_pod_deletion_timestamp{pod!=\"\"}) > bool 600) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0)))", + "expr": "sum(max by (namespace,pod) (((time() - kube_pod_deletion_timestamp{pod!=\"\"}) > bool 600) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0))) or on() vector(0)", "refId": "A" } ], "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -355,6 +219,286 @@ data: } ] }, + { + "id": 27, + "type": "stat", + "title": "Atlas Availability (30d)", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 4, + "x": 10, + "y": 0 + }, + "targets": [ + { + "expr": "avg_over_time((min(((sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-0a|titan-0b|titan-0c\"}) / 3)), ((sum(kube_deployment_status_replicas_available{namespace=~\"traefik|kube-system\",deployment=\"traefik\"}) / clamp_min(sum(kube_deployment_spec_replicas{namespace=~\"traefik|kube-system\",deployment=\"traefik\"}), 1)))))[30d:5m])", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "orange", + "value": 0.999 + }, + { + "color": "yellow", + "value": 0.9999 + }, + { + "color": "green", + "value": 0.99999 + } + ] + }, + "unit": "percentunit", + "custom": { + "displayMode": "auto" + }, + "decimals": 3 + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + } + }, + { + "id": 4, + "type": "stat", + "title": "Problem Pods", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 3, + "x": 14, + "y": 0 + }, + "targets": [ + { + "expr": "sum(max by (namespace,pod) (kube_pod_status_phase{phase!~\"Running|Succeeded\"})) or on() vector(0)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 1 + }, + { + "color": "orange", + "value": 2 + }, + { + "color": "red", + "value": 3 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + }, + "links": [ + { + "title": "Open atlas-pods dashboard", + "url": "/d/atlas-pods", + "targetBlank": true + } + ] + }, + { + "id": 6, + "type": "stat", + "title": "CrashLoop / ImagePull", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 3, + "x": 17, + "y": 0 + }, + "targets": [ + { + "expr": "sum(max by (namespace,pod) (kube_pod_container_status_waiting_reason{reason=~\"CrashLoopBackOff|ImagePullBackOff\"})) or on() vector(0)", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 1 + }, + { + "color": "orange", + "value": 2 + }, + { + "color": "red", + "value": 3 + } + ] + }, + "unit": "none", + "custom": { + "displayMode": "auto" + } + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "value" + }, + "links": [ + { + "title": "Open atlas-pods dashboard", + "url": "/d/atlas-pods", + "targetBlank": true + } + ] + }, + { + "id": 1, + "type": "gauge", + "title": "Workers Ready", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 5, + "w": 4, + "x": 20, + "y": 0 + }, + "targets": [ + { + "expr": "sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-04|titan-05|titan-06|titan-07|titan-08|titan-09|titan-10|titan-11|titan-12|titan-13|titan-14|titan-15|titan-16|titan-17|titan-18|titan-19|titan-22|titan-24\"})", + "refId": "A" + } + ], + "fieldConfig": { + "defaults": { + "min": 0, + "max": 18, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "orange", + "value": 16 + }, + { + "color": "yellow", + "value": 17 + }, + { + "color": "green", + "value": 18 + } + ] + } + }, + "overrides": [] + }, + "options": { + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "orientation": "auto", + "showThresholdMarkers": false, + "showThresholdLabels": false + } + }, { "id": 7, "type": "stat", @@ -380,11 +524,11 @@ data: "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { - "mode": "percentage", + "mode": "absolute", "steps": [ { "color": "green", @@ -392,11 +536,15 @@ data: }, { "color": "yellow", - "value": 70 + "value": 50 + }, + { + "color": "orange", + "value": 75 }, { "color": "red", - "value": 85 + "value": 91.5 } ] }, @@ -453,11 +601,11 @@ data: "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { - "mode": "percentage", + "mode": "absolute", "steps": [ { "color": "green", @@ -465,11 +613,15 @@ data: }, { "color": "yellow", - "value": 70 + "value": 50 + }, + { + "color": "orange", + "value": 75 }, { "color": "red", - "value": 85 + "value": 91.5 } ] }, @@ -526,7 +678,7 @@ data: "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -595,7 +747,7 @@ data: "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -662,11 +814,11 @@ data: "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { - "mode": "percentage", + "mode": "absolute", "steps": [ { "color": "green", @@ -674,11 +826,15 @@ data: }, { "color": "yellow", - "value": 70 + "value": 50 + }, + { + "color": "orange", + "value": 75 }, { "color": "red", - "value": 85 + "value": 91.5 } ] }, @@ -733,11 +889,11 @@ data: "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { - "mode": "percentage", + "mode": "absolute", "steps": [ { "color": "green", @@ -745,11 +901,15 @@ data: }, { "color": "yellow", - "value": 70 + "value": 50 + }, + { + "color": "orange", + "value": 75 }, { "color": "red", - "value": 85 + "value": 91.5 } ] }, @@ -804,7 +964,7 @@ data: "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -871,7 +1031,7 @@ data: "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -951,9 +1111,7 @@ data: "placement": "right" }, "pieType": "pie", - "displayLabels": [ - "percent" - ], + "displayLabels": [], "tooltip": { "mode": "single" }, @@ -1004,9 +1162,7 @@ data: "placement": "right" }, "pieType": "pie", - "displayLabels": [ - "percent" - ], + "displayLabels": [], "tooltip": { "mode": "single" }, @@ -1057,9 +1213,7 @@ data: "placement": "right" }, "pieType": "pie", - "displayLabels": [ - "percent" - ], + "displayLabels": [], "tooltip": { "mode": "single" }, @@ -1184,7 +1338,7 @@ data: }, "targets": [ { - "expr": "(avg by (node) (((1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", + "expr": "(avg by (node) (((1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c|titan-db\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", "refId": "A", "legendFormat": "{{node}}" } @@ -1221,7 +1375,7 @@ data: }, "targets": [ { - "expr": "(avg by (node) ((avg by (instance) ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", + "expr": "(avg by (node) ((avg by (instance) ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c|titan-db\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")", "refId": "A", "legendFormat": "{{node}}" } @@ -1242,6 +1396,138 @@ data: } } }, + { + "id": 28, + "type": "piechart", + "title": "Node Pod Share", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 10, + "w": 12, + "x": 0, + "y": 54 + }, + "targets": [ + { + "expr": "(sum(kube_pod_info{pod!=\"\" , node!=\"\"}) by (node) / clamp_min(sum(kube_pod_info{pod!=\"\" , node!=\"\"}), 1)) * 100", + "refId": "A", + "legendFormat": "{{namespace}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent", + "color": { + "mode": "palette-classic" + } + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "list", + "placement": "right" + }, + "pieType": "pie", + "displayLabels": [], + "tooltip": { + "mode": "single" + }, + "colorScheme": "interpolateSpectral", + "colorBy": "value", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + } + } + }, + { + "id": 29, + "type": "bargauge", + "title": "Top Nodes by Pod Count", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 10, + "w": 12, + "x": 12, + "y": 54 + }, + "targets": [ + { + "expr": "topk(12, sum(kube_pod_info{pod!=\"\" , node!=\"\"}) by (node))", + "refId": "A", + "legendFormat": "{{node}}", + "instant": true + } + ], + "fieldConfig": { + "defaults": { + "unit": "none", + "min": 0, + "max": null, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 50 + }, + { + "color": "orange", + "value": 75 + }, + { + "color": "red", + "value": 100 + } + ] + }, + "decimals": 0 + }, + "overrides": [] + }, + "options": { + "displayMode": "gradient", + "orientation": "horizontal", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + } + }, + "transformations": [ + { + "id": "sortBy", + "options": { + "fields": [ + "Value" + ], + "order": "desc" + } + }, + { + "id": "limit", + "options": { + "limit": 12 + } + } + ] + }, { "id": 18, "type": "timeseries", @@ -1386,7 +1672,7 @@ data: "h": 16, "w": 12, "x": 0, - "y": 54 + "y": 64 }, "targets": [ { @@ -1434,7 +1720,7 @@ data: "h": 16, "w": 12, "x": 12, - "y": 54 + "y": 64 }, "targets": [ { @@ -1461,11 +1747,11 @@ data: }, { "color": "orange", - "value": 70 + "value": 75 }, { "color": "red", - "value": 85 + "value": 91.5 } ] } @@ -1489,6 +1775,17 @@ data: "url": "/d/atlas-storage", "targetBlank": true } + ], + "transformations": [ + { + "id": "sortBy", + "options": { + "fields": [ + "Value" + ], + "order": "desc" + } + } ] } ], @@ -1506,36 +1803,5 @@ data: "to": "now" }, "refresh": "1m", - "links": [ - { - "title": "Atlas Pods", - "type": "dashboard", - "dashboardUid": "atlas-pods", - "keepTime": false - }, - { - "title": "Atlas Nodes", - "type": "dashboard", - "dashboardUid": "atlas-nodes", - "keepTime": false - }, - { - "title": "Atlas Storage", - "type": "dashboard", - "dashboardUid": "atlas-storage", - "keepTime": false - }, - { - "title": "Atlas Network", - "type": "dashboard", - "dashboardUid": "atlas-network", - "keepTime": false - }, - { - "title": "Atlas GPU", - "type": "dashboard", - "dashboardUid": "atlas-gpu", - "keepTime": false - } - ] + "links": [] } diff --git a/services/monitoring/grafana-dashboard-pods.yaml b/services/monitoring/grafana-dashboard-pods.yaml index f92adf1..b7c49d5 100644 --- a/services/monitoring/grafana-dashboard-pods.yaml +++ b/services/monitoring/grafana-dashboard-pods.yaml @@ -29,14 +29,14 @@ data: }, "targets": [ { - "expr": "sum(max by (namespace,pod) (kube_pod_status_phase{phase!~\"Running|Succeeded\"}))", + "expr": "sum(max by (namespace,pod) (kube_pod_status_phase{phase!~\"Running|Succeeded\"})) or on() vector(0)", "refId": "A" } ], "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -89,14 +89,14 @@ data: }, "targets": [ { - "expr": "sum(max by (namespace,pod) (kube_pod_container_status_waiting_reason{reason=~\"CrashLoopBackOff|ImagePullBackOff\"}))", + "expr": "sum(max by (namespace,pod) (kube_pod_container_status_waiting_reason{reason=~\"CrashLoopBackOff|ImagePullBackOff\"})) or on() vector(0)", "refId": "A" } ], "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -149,14 +149,14 @@ data: }, "targets": [ { - "expr": "sum(max by (namespace,pod) (((time() - kube_pod_deletion_timestamp{pod!=\"\"}) > bool 600) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0)))", + "expr": "sum(max by (namespace,pod) (((time() - kube_pod_deletion_timestamp{pod!=\"\"}) > bool 600) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0))) or on() vector(0)", "refId": "A" } ], "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -216,7 +216,7 @@ data: "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -275,12 +275,16 @@ data: ], "fieldConfig": { "defaults": { - "unit": "s" + "unit": "s", + "custom": { + "filterable": true + } }, "overrides": [] }, "options": { - "showHeader": true + "showHeader": true, + "columnFilters": false }, "transformations": [ { @@ -311,12 +315,16 @@ data: ], "fieldConfig": { "defaults": { - "unit": "s" + "unit": "s", + "custom": { + "filterable": true + } }, "overrides": [] }, "options": { - "showHeader": true + "showHeader": true, + "columnFilters": false }, "transformations": [ { @@ -347,12 +355,16 @@ data: ], "fieldConfig": { "defaults": { - "unit": "s" + "unit": "s", + "custom": { + "filterable": true + } }, "overrides": [] }, "options": { - "showHeader": true + "showHeader": true, + "columnFilters": false }, "transformations": [ { @@ -368,6 +380,233 @@ data: } } ] + }, + { + "id": 8, + "type": "piechart", + "title": "Node Pod Share", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 34 + }, + "targets": [ + { + "expr": "(sum(kube_pod_info{pod!=\"\" , node!=\"\"}) by (node) / clamp_min(sum(kube_pod_info{pod!=\"\" , node!=\"\"}), 1)) * 100", + "refId": "A", + "legendFormat": "{{namespace}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent", + "color": { + "mode": "palette-classic" + } + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "list", + "placement": "right" + }, + "pieType": "pie", + "displayLabels": [], + "tooltip": { + "mode": "single" + }, + "colorScheme": "interpolateSpectral", + "colorBy": "value", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + } + } + }, + { + "id": 9, + "type": "bargauge", + "title": "Top Nodes by Pod Count", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 34 + }, + "targets": [ + { + "expr": "topk(12, sum(kube_pod_info{pod!=\"\" , node!=\"\"}) by (node))", + "refId": "A", + "legendFormat": "{{node}}", + "instant": true + } + ], + "fieldConfig": { + "defaults": { + "unit": "none", + "min": 0, + "max": null, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 50 + }, + { + "color": "orange", + "value": 75 + }, + { + "color": "red", + "value": 100 + } + ] + }, + "decimals": 0 + }, + "overrides": [] + }, + "options": { + "displayMode": "gradient", + "orientation": "horizontal", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + } + }, + "transformations": [ + { + "id": "sortBy", + "options": { + "fields": [ + "Value" + ], + "order": "desc" + } + }, + { + "id": "limit", + "options": { + "limit": 12 + } + } + ] + }, + { + "id": 10, + "type": "table", + "title": "Namespace Plurality by Node v27", + "datasource": { + "type": "prometheus", + "uid": "atlas-vm" + }, + "gridPos": { + "h": 8, + "w": 24, + "x": 0, + "y": 42 + }, + "targets": [ + { + "expr": "(sum by (namespace,node) (kube_pod_info{pod!=\"\" , node!=\"\"}) / on(namespace) group_left() clamp_min(sum by (namespace) (kube_pod_info{pod!=\"\"}), 1) * 100) * on(namespace,node) group_left() ((sum by (namespace,node) (kube_pod_info{pod!=\"\" , node!=\"\"}) / on(namespace) group_left() clamp_min(sum by (namespace) (kube_pod_info{pod!=\"\"}), 1) * 100) + on(node) group_left() ((sum by (node) (kube_node_info{node=\"titan-0a\"}) * 0 + 0.001) or (sum by (node) (kube_node_info{node=\"titan-0b\"}) * 0 + 0.002) or (sum by (node) (kube_node_info{node=\"titan-0c\"}) * 0 + 0.003) or (sum by (node) (kube_node_info{node=\"titan-db\"}) * 0 + 0.004) or (sum by (node) (kube_node_info{node=\"titan-04\"}) * 0 + 0.005) or (sum by (node) (kube_node_info{node=\"titan-05\"}) * 0 + 0.006) or (sum by (node) (kube_node_info{node=\"titan-06\"}) * 0 + 0.007) or (sum by (node) (kube_node_info{node=\"titan-07\"}) * 0 + 0.008) or (sum by (node) (kube_node_info{node=\"titan-08\"}) * 0 + 0.009000000000000001) or (sum by (node) (kube_node_info{node=\"titan-09\"}) * 0 + 0.01) or (sum by (node) (kube_node_info{node=\"titan-10\"}) * 0 + 0.011) or (sum by (node) (kube_node_info{node=\"titan-11\"}) * 0 + 0.012) or (sum by (node) (kube_node_info{node=\"titan-12\"}) * 0 + 0.013000000000000001) or (sum by (node) (kube_node_info{node=\"titan-13\"}) * 0 + 0.014) or (sum by (node) (kube_node_info{node=\"titan-14\"}) * 0 + 0.015) or (sum by (node) (kube_node_info{node=\"titan-15\"}) * 0 + 0.016) or (sum by (node) (kube_node_info{node=\"titan-16\"}) * 0 + 0.017) or (sum by (node) (kube_node_info{node=\"titan-17\"}) * 0 + 0.018000000000000002) or (sum by (node) (kube_node_info{node=\"titan-18\"}) * 0 + 0.019) or (sum by (node) (kube_node_info{node=\"titan-19\"}) * 0 + 0.02) or (sum by (node) (kube_node_info{node=\"titan-22\"}) * 0 + 0.021) or (sum by (node) (kube_node_info{node=\"titan-24\"}) * 0 + 0.022)) == bool on(namespace) group_left() (max by (namespace) ((sum by (namespace,node) (kube_pod_info{pod!=\"\" , node!=\"\"}) / on(namespace) group_left() clamp_min(sum by (namespace) (kube_pod_info{pod!=\"\"}), 1) * 100) + on(node) group_left() ((sum by (node) (kube_node_info{node=\"titan-0a\"}) * 0 + 0.001) or (sum by (node) (kube_node_info{node=\"titan-0b\"}) * 0 + 0.002) or (sum by (node) (kube_node_info{node=\"titan-0c\"}) * 0 + 0.003) or (sum by (node) (kube_node_info{node=\"titan-db\"}) * 0 + 0.004) or (sum by (node) (kube_node_info{node=\"titan-04\"}) * 0 + 0.005) or (sum by (node) (kube_node_info{node=\"titan-05\"}) * 0 + 0.006) or (sum by (node) (kube_node_info{node=\"titan-06\"}) * 0 + 0.007) or (sum by (node) (kube_node_info{node=\"titan-07\"}) * 0 + 0.008) or (sum by (node) (kube_node_info{node=\"titan-08\"}) * 0 + 0.009000000000000001) or (sum by (node) (kube_node_info{node=\"titan-09\"}) * 0 + 0.01) or (sum by (node) (kube_node_info{node=\"titan-10\"}) * 0 + 0.011) or (sum by (node) (kube_node_info{node=\"titan-11\"}) * 0 + 0.012) or (sum by (node) (kube_node_info{node=\"titan-12\"}) * 0 + 0.013000000000000001) or (sum by (node) (kube_node_info{node=\"titan-13\"}) * 0 + 0.014) or (sum by (node) (kube_node_info{node=\"titan-14\"}) * 0 + 0.015) or (sum by (node) (kube_node_info{node=\"titan-15\"}) * 0 + 0.016) or (sum by (node) (kube_node_info{node=\"titan-16\"}) * 0 + 0.017) or (sum by (node) (kube_node_info{node=\"titan-17\"}) * 0 + 0.018000000000000002) or (sum by (node) (kube_node_info{node=\"titan-18\"}) * 0 + 0.019) or (sum by (node) (kube_node_info{node=\"titan-19\"}) * 0 + 0.02) or (sum by (node) (kube_node_info{node=\"titan-22\"}) * 0 + 0.021) or (sum by (node) (kube_node_info{node=\"titan-24\"}) * 0 + 0.022)))))", + "refId": "A", + "instant": true, + "format": "table" + } + ], + "fieldConfig": { + "defaults": { + "unit": "percent", + "custom": { + "filterable": false + } + }, + "overrides": [] + }, + "options": { + "showHeader": true, + "columnFilters": false, + "showColumnFilters": false, + "footer": { + "show": false, + "fields": "", + "calcs": [] + } + }, + "transformations": [ + { + "id": "labelsToFields", + "options": {} + }, + { + "id": "organize", + "options": { + "excludeByName": { + "Time": true + } + } + }, + { + "id": "filterByValue", + "options": { + "match": "Value", + "operator": "gt", + "value": 0 + } + }, + { + "id": "sortBy", + "options": { + "fields": [ + "Value" + ], + "order": "desc" + } + }, + { + "id": "groupBy", + "options": { + "fields": { + "namespace": { + "aggregations": [ + { + "field": "Value", + "operation": "max" + }, + { + "field": "node", + "operation": "first" + } + ] + } + }, + "rowBy": [ + "namespace" + ] + } + } + ] } ], "time": { diff --git a/services/monitoring/grafana-dashboard-storage.yaml b/services/monitoring/grafana-dashboard-storage.yaml index 0a534f2..8aef820 100644 --- a/services/monitoring/grafana-dashboard-storage.yaml +++ b/services/monitoring/grafana-dashboard-storage.yaml @@ -36,11 +36,11 @@ data: "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { - "mode": "percentage", + "mode": "absolute", "steps": [ { "color": "green", @@ -48,11 +48,15 @@ data: }, { "color": "yellow", - "value": 70 + "value": 50 + }, + { + "color": "orange", + "value": 75 }, { "color": "red", - "value": 85 + "value": 91.5 } ] }, @@ -100,11 +104,11 @@ data: "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { - "mode": "percentage", + "mode": "absolute", "steps": [ { "color": "green", @@ -112,11 +116,15 @@ data: }, { "color": "yellow", - "value": 70 + "value": 50 + }, + { + "color": "orange", + "value": 75 }, { "color": "red", - "value": 85 + "value": 91.5 } ] }, @@ -164,7 +172,7 @@ data: "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { @@ -224,7 +232,7 @@ data: "fieldConfig": { "defaults": { "color": { - "mode": "palette-classic" + "mode": "thresholds" }, "mappings": [], "thresholds": { diff --git a/services/monitoring/helmrelease.yaml b/services/monitoring/helmrelease.yaml index d7d7579..3fd76db 100644 --- a/services/monitoring/helmrelease.yaml +++ b/services/monitoring/helmrelease.yaml @@ -65,13 +65,13 @@ spec: namespace: flux-system values: server: - # keep ~3 months; change as you like (supports "d", "y") + # keep 1 year; supports "d", "y" extraArgs: - retentionPeriod: "90d" # VM flag -retentionPeriod=90d. :contentReference[oaicite:11]{index=11} + retentionPeriod: "1y" # VM flag -retentionPeriod=1y. :contentReference[oaicite:11]{index=11} persistentVolume: enabled: true - size: 100Gi + size: 250Gi # Enable built-in Kubernetes scraping scrape: @@ -186,6 +186,15 @@ spec: - targets: ["longhorn-backend.longhorn-system.svc:9500"] metrics_path: /metrics + # --- titan-db node_exporter (external control-plane DB host) --- + - job_name: "titan-db" + static_configs: + - targets: ["192.168.22.10:9100"] + relabel_configs: + - source_labels: [__address__] + target_label: instance + replacement: titan-db + # --- cert-manager (pods expose on 9402) --- - job_name: "cert-manager" kubernetes_sd_configs: [{ role: pod }] @@ -209,16 +218,6 @@ spec: - action: keep source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_label_app_kubernetes_io_part_of] regex: flux-system;flux - - job_name: "titan-db" - static_configs: - - targets: ["titan-db:9100"] - relabel_configs: - - source_labels: [__address__] - target_label: instance - metric_relabel_configs: - - source_labels: [instance] - target_label: node - replacement: titan-db --- diff --git a/services/nextcloud/configmap.yaml b/services/nextcloud/configmap.yaml new file mode 100644 index 0000000..a6e917c --- /dev/null +++ b/services/nextcloud/configmap.yaml @@ -0,0 +1,48 @@ +# services/nextcloud/configmap.yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: nextcloud-config + namespace: nextcloud +data: + extra.config.php: | + + array ( + 0 => 'cloud.bstein.dev', + ), + 'overwritehost' => 'cloud.bstein.dev', + 'overwriteprotocol' => 'https', + 'overwrite.cli.url' => 'https://cloud.bstein.dev', + 'default_phone_region' => 'US', + 'mail_smtpmode' => 'smtp', + 'mail_sendmailmode' => 'smtp', + 'mail_smtphost' => 'mail.bstein.dev', + 'mail_smtpport' => '587', + 'mail_smtpsecure' => 'tls', + 'mail_smtpauth' => true, + 'mail_smtpauthtype' => 'LOGIN', + 'mail_domain' => 'bstein.dev', + 'mail_from_address' => 'no-reply', + 'oidc_login_provider_url' => 'https://sso.bstein.dev/realms/atlas', + 'oidc_login_client_id' => getenv('OIDC_CLIENT_ID'), + 'oidc_login_client_secret' => getenv('OIDC_CLIENT_SECRET'), + 'oidc_login_auto_redirect' => false, + 'oidc_login_end_session_redirect' => true, + 'oidc_login_button_text' => 'Login with Keycloak', + 'oidc_login_hide_password_form' => false, + 'oidc_login_attributes' => + array ( + 'id' => 'preferred_username', + 'mail' => 'email', + 'name' => 'name', + ), + 'oidc_login_scope' => 'openid profile email', + 'oidc_login_unique_id' => 'preferred_username', + 'oidc_login_use_pkce' => true, + 'oidc_login_disable_registration' => false, + 'oidc_login_create_groups' => false, + # External storage for user data should be configured to Asteria via the External Storage app (admin UI), + # keeping the astreae PVC for app internals only. + ); diff --git a/services/nextcloud/cronjob.yaml b/services/nextcloud/cronjob.yaml new file mode 100644 index 0000000..86c55e1 --- /dev/null +++ b/services/nextcloud/cronjob.yaml @@ -0,0 +1,32 @@ +# services/nextcloud/cronjob.yaml +apiVersion: batch/v1 +kind: CronJob +metadata: + name: nextcloud-cron + namespace: nextcloud +spec: + schedule: "*/5 * * * *" + concurrencyPolicy: Forbid + jobTemplate: + spec: + template: + spec: + securityContext: + runAsUser: 33 + runAsGroup: 33 + fsGroup: 33 + restartPolicy: OnFailure + containers: + - name: nextcloud-cron + image: nextcloud:29-apache + imagePullPolicy: IfNotPresent + command: ["/bin/sh", "-c"] + args: + - "cd /var/www/html && php -f cron.php" + volumeMounts: + - name: nextcloud-data + mountPath: /var/www/html + volumes: + - name: nextcloud-data + persistentVolumeClaim: + claimName: nextcloud-data diff --git a/services/nextcloud/deployment.yaml b/services/nextcloud/deployment.yaml new file mode 100644 index 0000000..b2c590f --- /dev/null +++ b/services/nextcloud/deployment.yaml @@ -0,0 +1,143 @@ +# services/nextcloud/deployment.yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: nextcloud + namespace: nextcloud + labels: + app: nextcloud +spec: + replicas: 1 + selector: + matchLabels: + app: nextcloud + template: + metadata: + labels: + app: nextcloud + spec: + nodeSelector: + hardware: rpi5 + securityContext: + fsGroup: 33 + runAsUser: 33 + runAsGroup: 33 + initContainers: + - name: fix-perms + image: alpine:3.20 + command: ["/bin/sh", "-c"] + args: + - | + chown -R 33:33 /var/www/html/config || true + chown -R 33:33 /var/www/html/data || true + securityContext: + runAsUser: 0 + runAsGroup: 0 + volumeMounts: + - name: nextcloud-data + mountPath: /var/www/html + - name: nextcloud-config + mountPath: /var/www/html/config/extra.config.php + subPath: extra.config.php + containers: + - name: nextcloud + image: nextcloud:29-apache + imagePullPolicy: IfNotPresent + env: + # DB (external secret required: nextcloud-db with keys username,password,database) + - name: POSTGRES_HOST + value: postgres-service.postgres.svc.cluster.local + - name: POSTGRES_DB + valueFrom: + secretKeyRef: + name: nextcloud-db + key: database + - name: POSTGRES_USER + valueFrom: + secretKeyRef: + name: nextcloud-db + key: db-username + - name: POSTGRES_PASSWORD + valueFrom: + secretKeyRef: + name: nextcloud-db + key: db-password + # Admin bootstrap (external secret: nextcloud-admin with keys admin-user, admin-password) + - name: NEXTCLOUD_ADMIN_USER + valueFrom: + secretKeyRef: + name: nextcloud-admin + key: admin-user + - name: NEXTCLOUD_ADMIN_PASSWORD + valueFrom: + secretKeyRef: + name: nextcloud-admin + key: admin-password + - name: NEXTCLOUD_TRUSTED_DOMAINS + value: cloud.bstein.dev + - name: OVERWRITEHOST + value: cloud.bstein.dev + - name: OVERWRITEPROTOCOL + value: https + - name: OVERWRITECLIURL + value: https://cloud.bstein.dev + # SMTP (external secret: nextcloud-smtp with keys username, password) + - name: SMTP_HOST + value: mail.bstein.dev + - name: SMTP_PORT + value: "587" + - name: SMTP_SECURE + value: tls + - name: SMTP_NAME + valueFrom: + secretKeyRef: + name: nextcloud-smtp + key: smtp-username + - name: SMTP_PASSWORD + valueFrom: + secretKeyRef: + name: nextcloud-smtp + key: smtp-password + - name: MAIL_FROM_ADDRESS + value: no-reply + - name: MAIL_DOMAIN + value: bstein.dev + # OIDC (external secret: nextcloud-oidc with keys client-id, client-secret) + - name: OIDC_CLIENT_ID + valueFrom: + secretKeyRef: + name: nextcloud-oidc + key: client-id + - name: OIDC_CLIENT_SECRET + valueFrom: + secretKeyRef: + name: nextcloud-oidc + key: client-secret + - name: NEXTCLOUD_UPDATE + value: "1" + - name: APP_INSTALL + value: "mail,oidc_login,external" + ports: + - containerPort: 80 + name: http + volumeMounts: + - name: nextcloud-data + mountPath: /var/www/html + - name: nextcloud-config + mountPath: /var/www/html/config/extra.config.php + subPath: extra.config.php + resources: + requests: + cpu: 250m + memory: 1Gi + limits: + cpu: 1 + memory: 3Gi + volumes: + - name: nextcloud-data + persistentVolumeClaim: + claimName: nextcloud-data + - name: nextcloud-config + configMap: + name: nextcloud-config + defaultMode: 0444 diff --git a/services/nextcloud/ingress.yaml b/services/nextcloud/ingress.yaml new file mode 100644 index 0000000..1c60282 --- /dev/null +++ b/services/nextcloud/ingress.yaml @@ -0,0 +1,25 @@ +# services/nextcloud/ingress.yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: nextcloud + namespace: nextcloud + annotations: + cert-manager.io/cluster-issuer: letsencrypt-prod + traefik.ingress.kubernetes.io/router.entrypoints: websecure +spec: + tls: + - hosts: + - cloud.bstein.dev + secretName: nextcloud-tls + rules: + - host: cloud.bstein.dev + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: nextcloud + port: + number: 80 diff --git a/services/nextcloud/kustomization.yaml b/services/nextcloud/kustomization.yaml new file mode 100644 index 0000000..5e3b414 --- /dev/null +++ b/services/nextcloud/kustomization.yaml @@ -0,0 +1,25 @@ +# services/nextcloud/kustomization.yaml +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization +namespace: nextcloud +resources: + - namespace.yaml + - configmap.yaml + - pvc.yaml + - deployment.yaml + - service.yaml + - ingress.yaml + - cronjob.yaml + - mail-sync-cronjob.yaml + - maintenance-cronjob.yaml +configMapGenerator: + - name: nextcloud-maintenance-script + files: + - maintenance.sh=../../scripts/nextcloud-maintenance.sh + options: + disableNameSuffixHash: true + - name: nextcloud-mail-sync-script + files: + - sync.sh=../../scripts/nextcloud-mail-sync.sh + options: + disableNameSuffixHash: true diff --git a/services/nextcloud/mail-sync-cronjob.yaml b/services/nextcloud/mail-sync-cronjob.yaml new file mode 100644 index 0000000..52dc3ea --- /dev/null +++ b/services/nextcloud/mail-sync-cronjob.yaml @@ -0,0 +1,58 @@ +# services/nextcloud/mail-sync-cronjob.yaml +apiVersion: batch/v1 +kind: CronJob +metadata: + name: nextcloud-mail-sync + namespace: nextcloud +spec: + schedule: "0 5 * * *" + concurrencyPolicy: Forbid + jobTemplate: + spec: + template: + spec: + restartPolicy: OnFailure + securityContext: + runAsUser: 0 + runAsGroup: 0 + containers: + - name: mail-sync + image: nextcloud:29-apache + imagePullPolicy: IfNotPresent + command: ["/bin/bash", "/sync/sync.sh"] + env: + - name: KC_BASE + value: https://sso.bstein.dev + - name: KC_REALM + value: atlas + - name: KC_ADMIN_USER + valueFrom: + secretKeyRef: + name: nextcloud-keycloak-admin + key: username + - name: KC_ADMIN_PASS + valueFrom: + secretKeyRef: + name: nextcloud-keycloak-admin + key: password + volumeMounts: + - name: nextcloud-data + mountPath: /var/www/html + - name: sync-script + mountPath: /sync/sync.sh + subPath: sync.sh + resources: + requests: + cpu: 100m + memory: 256Mi + limits: + cpu: 500m + memory: 512Mi + volumes: + - name: nextcloud-data + persistentVolumeClaim: + claimName: nextcloud-data + - name: sync-script + configMap: + name: nextcloud-mail-sync-script + defaultMode: 0755 diff --git a/services/nextcloud/maintenance-cronjob.yaml b/services/nextcloud/maintenance-cronjob.yaml new file mode 100644 index 0000000..55fcbd1 --- /dev/null +++ b/services/nextcloud/maintenance-cronjob.yaml @@ -0,0 +1,56 @@ +# services/nextcloud/maintenance-cronjob.yaml +apiVersion: batch/v1 +kind: CronJob +metadata: + name: nextcloud-maintenance + namespace: nextcloud +spec: + schedule: "30 4 * * *" + concurrencyPolicy: Forbid + jobTemplate: + spec: + template: + spec: + restartPolicy: OnFailure + securityContext: + runAsUser: 0 + runAsGroup: 0 + containers: + - name: maintenance + image: nextcloud:29-apache + imagePullPolicy: IfNotPresent + command: ["/bin/bash", "/maintenance/maintenance.sh"] + env: + - name: NC_URL + value: https://cloud.bstein.dev + - name: ADMIN_USER + valueFrom: + secretKeyRef: + name: nextcloud-admin + key: admin-user + - name: ADMIN_PASS + valueFrom: + secretKeyRef: + name: nextcloud-admin + key: admin-password + volumeMounts: + - name: nextcloud-data + mountPath: /var/www/html + - name: maintenance-script + mountPath: /maintenance/maintenance.sh + subPath: maintenance.sh + resources: + requests: + cpu: 100m + memory: 256Mi + limits: + cpu: 500m + memory: 512Mi + volumes: + - name: nextcloud-data + persistentVolumeClaim: + claimName: nextcloud-data + - name: maintenance-script + configMap: + name: nextcloud-maintenance-script + defaultMode: 0755 diff --git a/services/nextcloud/namespace.yaml b/services/nextcloud/namespace.yaml new file mode 100644 index 0000000..fe63672 --- /dev/null +++ b/services/nextcloud/namespace.yaml @@ -0,0 +1,5 @@ +# services/nextcloud/namespace.yaml +apiVersion: v1 +kind: Namespace +metadata: + name: nextcloud diff --git a/services/nextcloud/pvc.yaml b/services/nextcloud/pvc.yaml new file mode 100644 index 0000000..dd929b6 --- /dev/null +++ b/services/nextcloud/pvc.yaml @@ -0,0 +1,13 @@ +# services/nextcloud/pvc.yaml +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: nextcloud-data + namespace: nextcloud +spec: + accessModes: + - ReadWriteMany + resources: + requests: + storage: 200Gi + storageClassName: astreae diff --git a/services/nextcloud/service.yaml b/services/nextcloud/service.yaml new file mode 100644 index 0000000..ab160fb --- /dev/null +++ b/services/nextcloud/service.yaml @@ -0,0 +1,13 @@ +# services/nextcloud/service.yaml +apiVersion: v1 +kind: Service +metadata: + name: nextcloud + namespace: nextcloud +spec: + selector: + app: nextcloud + ports: + - name: http + port: 80 + targetPort: http diff --git a/services/pegasus/ingress.yaml b/services/pegasus/ingress.yaml index 48d22c3..2ab7a2e 100644 --- a/services/pegasus/ingress.yaml +++ b/services/pegasus/ingress.yaml @@ -8,7 +8,7 @@ metadata: kubernetes.io/ingress.class: traefik traefik.ingress.kubernetes.io/router.entrypoints: websecure traefik.ingress.kubernetes.io/router.tls: "true" - cert-manager.io/cluster-issuer: letsencrypt-prod + cert-manager.io/cluster-issuer: letsencrypt spec: tls: - hosts: [ "pegasus.bstein.dev" ] diff --git a/services/vault/certificate.yaml b/services/vault/certificate.yaml index 983c7fe..2d32f65 100644 --- a/services/vault/certificate.yaml +++ b/services/vault/certificate.yaml @@ -8,7 +8,7 @@ spec: secretName: vault-server-tls issuerRef: kind: ClusterIssuer - name: letsencrypt-prod + name: letsencrypt commonName: secret.bstein.dev dnsNames: - secret.bstein.dev diff --git a/services/zot/ingress.yaml b/services/zot/ingress.yaml index 3425535..12f6c60 100644 --- a/services/zot/ingress.yaml +++ b/services/zot/ingress.yaml @@ -5,7 +5,7 @@ metadata: name: zot namespace: zot annotations: - cert-manager.io/cluster-issuer: letsencrypt-prod + cert-manager.io/cluster-issuer: letsencrypt traefik.ingress.kubernetes.io/router.entrypoints: websecure traefik.ingress.kubernetes.io/router.tls: "true" traefik.ingress.kubernetes.io/router.middlewares: zot-zot-resp-headers@kubernetescrd