Compare commits

..

No commits in common. "main" and "restructure/hybrid-clusters" have entirely different histories.

92 changed files with 35 additions and 13942 deletions

6
.gitignore vendored
View File

@ -1,5 +1 @@
# Ignore markdown by default, but keep top-level docs
*.md
!README.md
!AGENTS.md
!**/NOTES.md
AGENTS.md

View File

@ -1,81 +0,0 @@
Repository Guidelines
> Local-only note: apply changes through Flux-tracked manifests, not by manual kubectl edits in-cluster—manual tweaks will be reverted by Flux.
## Project Structure & Module Organization
- `infrastructure/`: cluster-scoped building blocks (core, flux-system, traefik, longhorn). Add new platform features by mirroring this layout.
- `services/`: workload manifests per app (`services/gitea/`, etc.) with `kustomization.yaml` plus one file per kind; keep diffs small and focused.
- `dockerfiles/` hosts bespoke images, while `scripts/` stores operational Fish/Bash helpers—extend these directories instead of relying on ad-hoc commands.
## Build, Test, and Development Commands
- `kustomize build services/<app>` (or `kubectl kustomize ...`) renders manifests exactly as Flux will.
- `kubectl apply --server-side --dry-run=client -k services/<app>` checks schema compatibility without touching the cluster.
- `flux reconcile kustomization <name> --namespace flux-system --with-source` pulls the latest Git state after merges or hotfixes.
- `fish scripts/flux_hammer.fish --help` explains the recovery tool; read it before running against production workloads.
## Coding Style & Naming Conventions
- YAML uses two-space indents; retain the leading path comment (e.g. `# services/gitea/deployment.yaml`) to speed code review.
- Keep resource names lowercase kebab-case, align labels/selectors, and mirror namespaces with directory names.
- List resources in `kustomization.yaml` from namespace/config, through storage, then workloads and networking for predictable diffs.
- Scripts start with `#!/usr/bin/env fish` or bash, stay executable, and follow snake_case names such as `flux_hammer.fish`.
## Testing Guidelines
- Run `kustomize build` and the dry-run apply for every service you touch; capture failures before opening a PR.
- `flux diff kustomization <name> --path services/<app>` previews reconciliations—link notable output when behavior shifts.
- Docker edits: `docker build -f dockerfiles/Dockerfile.monerod .` (swap the file you changed) to verify image builds.
## Commit & Pull Request Guidelines
- Keep commit subjects short, present-tense, and optionally scoped (`gpu(titan-24): add RuntimeClass`); squash fixups before review.
- Describe linked issues, affected services, and required operator steps (e.g. `flux reconcile kustomization services-gitea`) in the PR body.
- Focus each PR on one kustomization or service and update `infrastructure/flux-system` when Flux must track new folders.
- Record the validation you ran (dry-runs, diffs, builds) and add screenshots only when ingress or UI behavior changes.
## Security & Configuration Tips
- Never commit credentials; use Vault workflows (`services/vault/`) or SOPS-encrypted manifests wired through `infrastructure/flux-system`.
- Node selectors and tolerations gate workloads to hardware like `hardware: rpi4`; confirm labels before scaling or renaming nodes.
- Pin external images by digest or rely on Flux image automation to follow approved tags and avoid drift.
## Dashboard roadmap / context (2025-12-02)
- Atlas dashboards are generated via `scripts/dashboards_render_atlas.py --build`, which writes JSON under `services/monitoring/dashboards/` and ConfigMaps under `services/monitoring/`. Keep the Grafana manifests in sync by regenerating after edits.
- Atlas Overview panels are paired with internal dashboards (pods, nodes, storage, network, GPU). A new `atlas-gpu` internal dashboard holds the detailed GPU metrics that feed the overview share pie.
- Old Grafana folders (`Atlas Storage`, `Atlas SRE`, `Atlas Public`, `Atlas Nodes`) should be removed in Grafana UI when convenient; only `Atlas Overview` and `Atlas Internal` should remain provisioned.
- Future work: add a separate generator (e.g., `dashboards_render_oceanus.py`) for SUI/oceanus validation dashboards, mirroring the atlas pattern of internal dashboards feeding a public overview.
## Monitoring state (2025-12-03)
- dcgm-exporter DaemonSet pulls `registry.bstein.dev/monitoring/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04` with nvidia runtime/imagePullSecret; titan-24 exports metrics, titan-22 remains NotReady.
- Atlas Overview is the Grafana home (1h range, 1m refresh), Overview folder UID `overview`, internal folder `atlas-internal` (oceanus-internal stub).
- Panels standardized via generator; hottest row compressed, worker/control rows taller, root disk row taller and top12 bar gauge with labels. GPU share pie uses 1h avg_over_time to persist idle activity.
- Internal dashboards are provisioned without Viewer role; if anonymous still sees them, restart Grafana and tighten auth if needed.
- GPU share panel updated (feature/sso) to use `max_over_time(…[$__range])`, so longer ranges (e.g., 12h) keep recent activity visible. Flux tracking `feature/sso`.
## Upcoming priorities (SSO/storage/mail)
- Establish SSO (Keycloak or similar) and federate Grafana, Gitea, Zot, Nextcloud, Pegasus/Jellyfin; keep Vaultwarden separate until safe.
- Add Nextcloud (limit to rpi5 workers) with office suite; integrate with SSO; plan storage class and ingress.
- Plan mail: mostly self-hosted, relay through trusted provider for outbound; integrate with services (Nextcloud, Vaultwarden, etc.) for notifications and account flows.
## SSO plan sketch (2025-12-03)
- IdP: use Keycloak (preferred) in a new `sso` namespace, Bitnami or codecentric chart with Postgres backing store (single PVC), ingress `sso.bstein.dev`, admin user bound to brad@bstein.dev; stick with local DB initially (no external IdP).
- Auth flow goals: Grafana (OIDC), Gitea (OAuth2/Keycloak), Zot (via Traefik forward-auth/oauth2-proxy), Jellyfin/Pegasus via Jellyfin OAuth/OpenID plugin (map existing usernames; run migration to pre-create users in Keycloak with same usernames/emails and temporary passwords), Pegasus keeps using Jellyfin tokens.
- Steps to implement:
1) Add service folder `services/keycloak/` (namespace, PVC, HelmRelease, ingress, secret for admin creds). Verify with kustomize + Flux reconcile.
2) Seed realm `atlas` with users (import CSV/realm). Create client for Grafana (public/implicit), Gitea (confidential), and a “jellyfin” client for the OAuth plugin; set email for brad@bstein.dev as admin.
3) Reconfigure Grafana to OIDC (disable anonymous to internal folders, leave Overview public via folder permissions). Reconfigure Gitea to OIDC (app.ini).
4) Add Traefik forward-auth (oauth2-proxy) in front of Zot and any other services needing headers-based auth.
5) Deploy Jellyfin OpenID plugin; map Keycloak users to existing Jellyfin usernames; communicate password reset path.
- Migration caution: do not delete existing local creds until SSO validated; keep Pegasus working via Jellyfin tokens during transition.
## Postgres centralization (2025-12-03)
- Prefer a shared in-cluster Postgres deployment with per-service databases to reduce resource sprawl on Pi nodes. Use it for services that can easily point at an external DB.
- Candidates to migrate to shared Postgres: Keycloak (realm DB), Gitea (git DB), Nextcloud (app DB), possibly Grafana (if persistence needed beyond current provisioner), Jitsi prosody/JVB state (if external DB supported). Keep tightly-coupled or lightweight embedded DBs as-is when migration is painful or not supported.
## SSO integration snapshot (2025-12-08)
- Current blockers: Zot still prompts for basic auth/double-login; Vault still wants the token UI after Keycloak (previously 502/404 when vault-0 sealed). Forward-auth middleware on Zot Ingress likely still causing the 401/Found hop; Vault OIDC mount not completing UI flow unless unsealed and preferred login is set.
- Flux-only changes required: remove zot forward-auth middleware from Ingress (let oauth2-proxy handle redirect), ensure Vault OIDC mount is preferred UI login and bound to admin group; keep all edits in repo so Flux enforces them.
- Secrets present (per user): `zot-oidc-client` (client_secret only), `oauth2-proxy-zot-oidc`, `oauth2-proxy-vault-oidc`, `vault-oidc-admin-token`. Zot needs its regcred in the zot namespace if image pulls fail.
- Cluster validation blocked here: `kubectl get nodes` fails (403/permission) and DNS to `*.bstein.dev` fails in this session, so no live curl verification could be run. Re-test on a host with cluster/DNS access after Flux applies fixes.
## Docs hygiene
- Do not add per-service `README.md` files; use `NOTES.md` if documentation is needed inside service folders. Keep only the top-level repo README.
- Keep comments succinct and in a human voice—no AI-sounding notes. Use `NOTES.md` for scratch notes instead of sprinkling reminders into code or extra READMEs.

View File

@ -1,3 +0,0 @@
# Rotation reminders (temporary secrets set by automation)
- Weave GitOps UI (`cd.bstein.dev`) admin: `admin` / `G1tOps!2025` — rotate immediately after first login.

View File

@ -1,3 +0,0 @@
# titan-iac
Flux-managed Kubernetes cluster for bstein.dev services.

View File

@ -1,15 +0,0 @@
# clusters/atlas/flux-system/applications/keycloak/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: keycloak
namespace: flux-system
spec:
interval: 10m
prune: true
sourceRef:
kind: GitRepository
name: flux-system
path: ./services/keycloak
targetNamespace: sso
timeout: 2m

View File

@ -13,6 +13,3 @@ resources:
- jellyfin/kustomization.yaml
- xmr-miner/kustomization.yaml
- sui-metrics/kustomization.yaml
- keycloak/kustomization.yaml
- oauth2-proxy/kustomization.yaml
- mailu/kustomization.yaml

View File

@ -1,18 +0,0 @@
# clusters/atlas/flux-system/applications/mailu/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: mailu
namespace: flux-system
spec:
interval: 10m
sourceRef:
kind: GitRepository
name: flux-system
namespace: flux-system
path: ./services/mailu
targetNamespace: mailu-mailserver
prune: true
wait: true
dependsOn:
- name: helm

View File

@ -1,15 +0,0 @@
# clusters/atlas/flux-system/applications/oauth2-proxy/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: oauth2-proxy
namespace: flux-system
spec:
interval: 10m
prune: true
sourceRef:
kind: GitRepository
name: flux-system
path: ./services/oauth2-proxy
targetNamespace: sso
timeout: 2m

View File

@ -8,7 +8,7 @@ metadata:
spec:
interval: 1m0s
ref:
branch: feature/mailu
branch: restructure/hybrid-clusters
secretRef:
name: flux-system-gitea
url: ssh://git@scm.bstein.dev:2242/bstein/titan-iac.git

View File

@ -1,20 +0,0 @@
# clusters/atlas/flux-system/platform/gitops-ui/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: gitops-ui
namespace: flux-system
spec:
interval: 10m
timeout: 10m
path: ./services/gitops-ui
prune: true
sourceRef:
kind: GitRepository
name: flux-system
namespace: flux-system
targetNamespace: flux-system
dependsOn:
- name: helm
- name: traefik
wait: true

View File

@ -5,6 +5,5 @@ resources:
- core/kustomization.yaml
- helm/kustomization.yaml
- traefik/kustomization.yaml
- gitops-ui/kustomization.yaml
- monitoring/kustomization.yaml
- longhorn-ui/kustomization.yaml

View File

@ -11,4 +11,4 @@ spec:
sourceRef:
kind: GitRepository
name: flux-system
wait: false
wait: true

View File

@ -0,0 +1,5 @@
# Oceanus Cluster Scaffold
This directory prepares the Flux and Kustomize layout for a future Oceanus-managed cluster.
Populate `flux-system/` with `gotk-components.yaml` and related manifests after running `flux bootstrap`.
Define node-specific resources under `infrastructure/modules/profiles/oceanus-validator/` and reference workloads in `applications/` as they come online.

View File

@ -2,14 +2,15 @@
| Hostname | Role / Function | Managed By | Notes |
|------------|--------------------------------|---------------------|-------|
| titan-db | HA control plane database | Ansible | PostgreSQL / etcd backing services |
| titan-0a | Kubernetes control-plane | Flux (atlas cluster)| HA leader, tainted for control only |
| titan-0b | Kubernetes control-plane | Flux (atlas cluster)| Standby control node |
| titan-0c | Kubernetes control-plane | Flux (atlas cluster)| Standby control node |
| titan-04-19| Raspberry Pi workers | Flux (atlas cluster)| Workload nodes, labelled per hardware |
| titan-20&21| NVIDIA Jetson workers | Flux (atlas cluster)| Workload nodes, labelled per hardware |
| titan-22 | GPU mini-PC (Jellyfin) | Flux + Ansible | NVIDIA runtime managed via `modules/profiles/atlas-ha` |
| titan-23 | Dedicated SUI validator Oceanus| Manual + Ansible | Baremetal validator workloads, exposes metrics to atlas |
| titan-24 | Tethys hybrid node | Flux + Ansible | Runs SUI metrics via K8s, validator via Ansible |
| titan-jh | Jumphost & bastion & lesavka | Ansible | Entry point / future KVM services / custom kvm - lesavaka |
| titan-db | HA control plane database | Ansible | PostgreSQL / etcd backing services |
| titan-jh | Jumphost & bastion | Ansible | Entry point / future KVM services |
| oceanus | Dedicated SUI validator host | Ansible / Flux prep | Baremetal validator workloads, exposes metrics to atlas; Kustomize scaffold under `clusters/oceanus/` |
| styx | Air-gapped workstation | Manual / Scripts | Remains isolated, scripts tracked in `hosts/styx` |
Use the `clusters/` directory for cluster-scoped state and the `hosts/` directory for baremetal orchestration.

View File

@ -5,4 +5,3 @@ resources:
- ../modules/base
- ../modules/profiles/atlas-ha
- ../sources/cert-manager/letsencrypt.yaml
- ../sources/cert-manager/letsencrypt-prod.yaml

View File

@ -7,7 +7,7 @@ metadata:
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: websecure
traefik.ingress.kubernetes.io/router.tls: "true"
traefik.ingress.kubernetes.io/router.middlewares: ""
traefik.ingress.kubernetes.io/router.middlewares: longhorn-system-longhorn-basicauth@kubernetescrd,longhorn-system-longhorn-headers@kubernetescrd
spec:
ingressClassName: traefik
tls:
@ -21,6 +21,6 @@ spec:
pathType: Prefix
backend:
service:
name: oauth2-proxy-longhorn
name: longhorn-frontend
port:
number: 80

View File

@ -4,4 +4,3 @@ kind: Kustomization
resources:
- middleware.yaml
- ingress.yaml
- oauth2-proxy-longhorn.yaml

View File

@ -20,20 +20,3 @@ spec:
headers:
customRequestHeaders:
X-Forwarded-Proto: "https"
---
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: longhorn-forward-auth
namespace: longhorn-system
spec:
forwardAuth:
address: https://auth.bstein.dev/oauth2/auth
trustForwardHeader: true
authResponseHeaders:
- Authorization
- X-Auth-Request-Email
- X-Auth-Request-User
- X-Auth-Request-Groups

View File

@ -1,102 +0,0 @@
# infrastructure/longhorn/ui-ingress/oauth2-proxy-longhorn.yaml
apiVersion: v1
kind: Service
metadata:
name: oauth2-proxy-longhorn
namespace: longhorn-system
labels:
app: oauth2-proxy-longhorn
spec:
ports:
- name: http
port: 80
targetPort: 4180
selector:
app: oauth2-proxy-longhorn
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: oauth2-proxy-longhorn
namespace: longhorn-system
labels:
app: oauth2-proxy-longhorn
spec:
replicas: 2
selector:
matchLabels:
app: oauth2-proxy-longhorn
template:
metadata:
labels:
app: oauth2-proxy-longhorn
spec:
nodeSelector:
node-role.kubernetes.io/worker: "true"
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 90
preference:
matchExpressions:
- key: hardware
operator: In
values: ["rpi5","rpi4"]
containers:
- name: oauth2-proxy
image: quay.io/oauth2-proxy/oauth2-proxy:v7.6.0
imagePullPolicy: IfNotPresent
args:
- --provider=oidc
- --redirect-url=https://longhorn.bstein.dev/oauth2/callback
- --oidc-issuer-url=https://sso.bstein.dev/realms/atlas
- --scope=openid profile email groups
- --email-domain=*
- --allowed-group=admin
- --set-xauthrequest=true
- --pass-access-token=true
- --set-authorization-header=true
- --cookie-secure=true
- --cookie-samesite=lax
- --cookie-refresh=20m
- --cookie-expire=168h
- --insecure-oidc-allow-unverified-email=true
- --upstream=http://longhorn-frontend.longhorn-system.svc.cluster.local
- --http-address=0.0.0.0:4180
- --skip-provider-button=true
- --skip-jwt-bearer-tokens=true
- --oidc-groups-claim=groups
- --cookie-domain=longhorn.bstein.dev
env:
- name: OAUTH2_PROXY_CLIENT_ID
valueFrom:
secretKeyRef:
name: oauth2-proxy-longhorn-oidc
key: client_id
- name: OAUTH2_PROXY_CLIENT_SECRET
valueFrom:
secretKeyRef:
name: oauth2-proxy-longhorn-oidc
key: client_secret
- name: OAUTH2_PROXY_COOKIE_SECRET
valueFrom:
secretKeyRef:
name: oauth2-proxy-longhorn-oidc
key: cookie_secret
ports:
- containerPort: 4180
name: http
readinessProbe:
httpGet:
path: /ping
port: 4180
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /ping
port: 4180
initialDelaySeconds: 20
periodSeconds: 20

View File

@ -1,14 +0,0 @@
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
email: brad.stein@gmail.com
server: https://acme-v02.api.letsencrypt.org/directory
privateKeySecretRef:
name: letsencrypt-prod-account-key
solvers:
- http01:
ingress:
class: traefik

View File

@ -1,10 +0,0 @@
# infrastructure/sources/helm/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- grafana.yaml
- hashicorp.yaml
- jetstack.yaml
- mailu.yaml
- prometheus.yaml
- victoria-metrics.yaml

View File

@ -1,9 +0,0 @@
# infrastructure/sources/helm/mailu.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
name: mailu
namespace: flux-system
spec:
interval: 1h
url: https://mailu.github.io/helm-charts

View File

@ -39,12 +39,6 @@ items:
- --metrics.prometheus.addEntryPointsLabels=true
- --metrics.prometheus.addRoutersLabels=true
- --metrics.prometheus.addServicesLabels=true
- --entrypoints.web.transport.respondingTimeouts.readTimeout=0s
- --entrypoints.web.transport.respondingTimeouts.writeTimeout=0s
- --entrypoints.web.transport.respondingTimeouts.idleTimeout=0s
- --entrypoints.websecure.transport.respondingTimeouts.readTimeout=0s
- --entrypoints.websecure.transport.respondingTimeouts.writeTimeout=0s
- --entrypoints.websecure.transport.respondingTimeouts.idleTimeout=0s
- --entrypoints.metrics.address=:9100
- --metrics.prometheus.entryPoint=metrics
image: traefik:v3.3.3

File diff suppressed because it is too large Load Diff

View File

@ -1,204 +0,0 @@
#!/usr/bin/env python3
"""
Sync Keycloak users to Mailu mailboxes.
- Generates/stores a mailu_app_password attribute in Keycloak (admin-only)
- Upserts the mailbox in Mailu Postgres using that password
"""
import os
import sys
import json
import time
import secrets
import string
import datetime
import requests
import psycopg2
from psycopg2.extras import RealDictCursor
from passlib.hash import bcrypt_sha256
KC_BASE = os.environ["KEYCLOAK_BASE_URL"].rstrip("/")
KC_REALM = os.environ["KEYCLOAK_REALM"]
KC_CLIENT_ID = os.environ["KEYCLOAK_CLIENT_ID"]
KC_CLIENT_SECRET = os.environ["KEYCLOAK_CLIENT_SECRET"]
MAILU_DOMAIN = os.environ["MAILU_DOMAIN"]
MAILU_DEFAULT_QUOTA = int(os.environ.get("MAILU_DEFAULT_QUOTA", "20000000000"))
DB_CONFIG = {
"host": os.environ["MAILU_DB_HOST"],
"port": int(os.environ.get("MAILU_DB_PORT", "5432")),
"dbname": os.environ["MAILU_DB_NAME"],
"user": os.environ["MAILU_DB_USER"],
"password": os.environ["MAILU_DB_PASSWORD"],
}
SESSION = requests.Session()
def log(msg):
sys.stdout.write(f"{msg}\n")
sys.stdout.flush()
def get_kc_token():
resp = SESSION.post(
f"{KC_BASE}/realms/{KC_REALM}/protocol/openid-connect/token",
data={
"grant_type": "client_credentials",
"client_id": KC_CLIENT_ID,
"client_secret": KC_CLIENT_SECRET,
},
timeout=15,
)
resp.raise_for_status()
return resp.json()["access_token"]
def kc_get_users(token):
users = []
first = 0
max_results = 200
headers = {"Authorization": f"Bearer {token}"}
while True:
resp = SESSION.get(
f"{KC_BASE}/admin/realms/{KC_REALM}/users",
params={"first": first, "max": max_results, "enabled": "true"},
headers=headers,
timeout=20,
)
resp.raise_for_status()
batch = resp.json()
users.extend(batch)
if len(batch) < max_results:
break
first += max_results
return users
def kc_update_attributes(token, user, attributes):
headers = {
"Authorization": f"Bearer {token}",
"Content-Type": "application/json",
}
payload = {
"firstName": user.get("firstName"),
"lastName": user.get("lastName"),
"email": user.get("email"),
"enabled": user.get("enabled", True),
"username": user["username"],
"emailVerified": user.get("emailVerified", False),
"attributes": attributes,
}
user_url = f"{KC_BASE}/admin/realms/{KC_REALM}/users/{user['id']}"
resp = SESSION.put(user_url, headers=headers, json=payload, timeout=20)
resp.raise_for_status()
verify = SESSION.get(
user_url,
headers={"Authorization": f"Bearer {token}"},
params={"briefRepresentation": "false"},
timeout=15,
)
verify.raise_for_status()
attrs = verify.json().get("attributes") or {}
if not attrs.get("mailu_app_password"):
raise Exception(f"attribute not persisted for {user.get('email') or user['username']}")
def random_password():
alphabet = string.ascii_letters + string.digits
return "".join(secrets.choice(alphabet) for _ in range(24))
def ensure_mailu_user(cursor, email, password, display_name):
localpart, domain = email.split("@", 1)
if domain.lower() != MAILU_DOMAIN.lower():
return
hashed = bcrypt_sha256.hash(password)
now = datetime.datetime.utcnow()
cursor.execute(
"""
INSERT INTO "user" (
email, localpart, domain_name, password,
quota_bytes, quota_bytes_used,
global_admin, enabled, enable_imap, enable_pop, allow_spoofing,
forward_enabled, forward_destination, forward_keep,
reply_enabled, reply_subject, reply_body, reply_startdate, reply_enddate,
displayed_name, spam_enabled, spam_mark_as_read, spam_threshold,
change_pw_next_login, created_at, updated_at, comment
)
VALUES (
%(email)s, %(localpart)s, %(domain)s, %(password)s,
%(quota)s, 0,
false, true, true, true, false,
false, '', true,
false, NULL, NULL, DATE '1900-01-01', DATE '2999-12-31',
%(display)s, true, true, 80,
false, CURRENT_DATE, %(now)s, ''
)
ON CONFLICT (email) DO UPDATE
SET password = EXCLUDED.password,
enabled = true,
updated_at = EXCLUDED.updated_at
""",
{
"email": email,
"localpart": localpart,
"domain": domain,
"password": hashed,
"quota": MAILU_DEFAULT_QUOTA,
"display": display_name or localpart,
"now": now,
},
)
def main():
token = get_kc_token()
users = kc_get_users(token)
if not users:
log("No users found; exiting.")
return
conn = psycopg2.connect(**DB_CONFIG)
conn.autocommit = True
cursor = conn.cursor(cursor_factory=RealDictCursor)
for user in users:
attrs = user.get("attributes", {}) or {}
app_pw_value = attrs.get("mailu_app_password")
if isinstance(app_pw_value, list):
app_pw = app_pw_value[0] if app_pw_value else None
elif isinstance(app_pw_value, str):
app_pw = app_pw_value
else:
app_pw = None
email = user.get("email")
if not email:
email = f"{user['username']}@{MAILU_DOMAIN}"
if not app_pw:
app_pw = random_password()
attrs["mailu_app_password"] = app_pw
kc_update_attributes(token, user, attrs)
log(f"Set mailu_app_password for {email}")
display_name = " ".join(
part for part in [user.get("firstName"), user.get("lastName")] if part
).strip()
ensure_mailu_user(cursor, email, app_pw, display_name)
log(f"Synced mailbox for {email}")
cursor.close()
conn.close()
if __name__ == "__main__":
try:
main()
except Exception as exc:
log(f"ERROR: {exc}")
sys.exit(1)

View File

@ -1,49 +0,0 @@
#!/bin/bash
set -euo pipefail
KC_BASE="${KC_BASE:?}"
KC_REALM="${KC_REALM:?}"
KC_ADMIN_USER="${KC_ADMIN_USER:?}"
KC_ADMIN_PASS="${KC_ADMIN_PASS:?}"
if ! command -v jq >/dev/null 2>&1; then
apt-get update && apt-get install -y jq curl >/dev/null
fi
account_exists() {
# Skip if the account email is already present in the mail app.
runuser -u www-data -- php occ mail:account:list 2>/dev/null | grep -Fq " ${1}" || \
runuser -u www-data -- php occ mail:account:list 2>/dev/null | grep -Fq "${1} "
}
token=$(
curl -s -d "grant_type=password" \
-d "client_id=admin-cli" \
-d "username=${KC_ADMIN_USER}" \
-d "password=${KC_ADMIN_PASS}" \
"${KC_BASE}/realms/master/protocol/openid-connect/token" | jq -r '.access_token'
)
if [[ -z "${token}" || "${token}" == "null" ]]; then
echo "Failed to obtain admin token"
exit 1
fi
users=$(curl -s -H "Authorization: Bearer ${token}" \
"${KC_BASE}/admin/realms/${KC_REALM}/users?max=2000")
echo "${users}" | jq -c '.[]' | while read -r user; do
username=$(echo "${user}" | jq -r '.username')
email=$(echo "${user}" | jq -r '.email // empty')
app_pw=$(echo "${user}" | jq -r '.attributes.mailu_app_password[0] // empty')
[[ -z "${email}" || -z "${app_pw}" ]] && continue
if account_exists "${email}"; then
echo "Skipping ${email}, already exists"
continue
fi
echo "Syncing ${email}"
runuser -u www-data -- php occ mail:account:create \
"${username}" "${username}" "${email}" \
mail.bstein.dev 993 ssl "${email}" "${app_pw}" \
mail.bstein.dev 587 tls "${email}" "${app_pw}" login || true
done

View File

@ -1,65 +0,0 @@
#!/bin/bash
set -euo pipefail
NC_URL="${NC_URL:-https://cloud.bstein.dev}"
ADMIN_USER="${ADMIN_USER:?}"
ADMIN_PASS="${ADMIN_PASS:?}"
export DEBIAN_FRONTEND=noninteractive
apt-get update -qq
apt-get install -y -qq curl jq >/dev/null
run_occ() {
runuser -u www-data -- php occ "$@"
}
log() { echo "[$(date -Is)] $*"; }
log "Applying Atlas theming"
run_occ theming:config name "Atlas Cloud"
run_occ theming:config slogan "Unified access to Atlas services"
run_occ theming:config url "https://cloud.bstein.dev"
run_occ theming:config color "#0f172a"
run_occ theming:config disable-user-theming yes
log "Setting default quota to 200 GB"
run_occ config:app:set files default_quota --value "200 GB"
API_BASE="${NC_URL}/ocs/v2.php/apps/external/api/v1"
AUTH=(-u "${ADMIN_USER}:${ADMIN_PASS}" -H "OCS-APIRequest: true")
log "Removing existing external links"
existing=$(curl -sf "${AUTH[@]}" "${API_BASE}?format=json" | jq -r '.ocs.data[].id // empty')
for id in ${existing}; do
curl -sf "${AUTH[@]}" -X DELETE "${API_BASE}/sites/${id}?format=json" >/dev/null || true
done
SITES=(
"Vaultwarden|https://vault.bstein.dev"
"Jellyfin|https://stream.bstein.dev"
"Gitea|https://scm.bstein.dev"
"Jenkins|https://ci.bstein.dev"
"Zot|https://registry.bstein.dev"
"Vault|https://secret.bstein.dev"
"Jitsi|https://meet.bstein.dev"
"Grafana|https://metrics.bstein.dev"
"Chat LLM|https://chat.ai.bstein.dev"
"Vision|https://draw.ai.bstein.dev"
"STT/TTS|https://talk.ai.bstein.dev"
)
log "Seeding external links"
for entry in "${SITES[@]}"; do
IFS="|" read -r name url <<<"${entry}"
curl -sf "${AUTH[@]}" -X POST "${API_BASE}/sites?format=json" \
-d "name=${name}" \
-d "url=${url}" \
-d "lang=" \
-d "type=link" \
-d "device=" \
-d "icon=" \
-d "groups[]=" \
-d "redirect=1" >/dev/null
done
log "Maintenance run completed"

View File

@ -1,575 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
# --- CONFIG (edit if needed) ---
# Leave NVME empty → script will auto-detect the SSK dock.
NVME="${NVME:-}"
FLAVOR="${FLAVOR:-desktop}"
# Persistent cache so the image survives reboots.
IMG_DIR="${IMG_DIR:-/var/cache/styx-rpi}"
IMG_FILE="${IMG_FILE:-ubuntu-24.04.3-preinstalled-${FLAVOR}-arm64+raspi.img}"
IMG_BOOT_MNT="${IMG_BOOT_MNT:-/mnt/img-boot}"
IMG_ROOT_MNT="${IMG_ROOT_MNT:-/mnt/img-root}"
TGT_ROOT="/mnt/target-root"
TGT_BOOT="/mnt/target-boot"
STYX_USER="styx"
STYX_HOSTNAME="titan-ag"
STYX_PASS="TempPass#123" # will be forced to change on first login via cloud-init
SSH_PUBKEY="ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOb8oMX6u0z3sH/p/WBGlvPXXdbGETCKzWYwR/dd6fZb titan-bastion"
# Video / input prefs
DSI_FLAGS="video=DSI-1:800x480@60D video=HDMI-A-1:off video=HDMI-A-2:off"
# --- Helpers ---
fatal(){ echo "ERROR: $*" >&2; exit 1; }
need(){ command -v "$1" >/dev/null || fatal "Missing tool: $1"; }
require_root(){ [[ $EUID -eq 0 ]] || exec sudo -E "$0" "$@"; }
part() {
local n="$1"
if [[ "$NVME" =~ [0-9]$ ]]; then
echo "${NVME}p${n}"
else
echo "${NVME}${n}"
fi
}
auto_detect_target_disk() {
# If user already set NVME, validate and return
if [[ -n "${NVME:-}" ]]; then
[[ -b "$NVME" ]] || fatal "NVME='$NVME' is not a block device"
return
fi
# Prefer stable by-id symlinks
local byid
byid=$(ls -1 /dev/disk/by-id/usb-SSK* 2>/dev/null | head -n1 || true)
if [[ -n "$byid" ]]; then
NVME=$(readlink -f "$byid")
else
# Heuristic via lsblk -S: look for USB with SSK/Ingram/Storage in vendor/model
NVME=$(lsblk -S -p -o NAME,TRAN,VENDOR,MODEL | \
awk '/ usb / && ($3 ~ /SSK|Ingram/i || $4 ~ /SSK|Storage/i){print $1; exit}')
fi
[[ -n "${NVME:-}" && -b "$NVME" ]] || fatal "Could not auto-detect SSK USB NVMe dock. Export NVME=/dev/sdX and re-run."
echo "Auto-detected target disk: $NVME"
}
preflight_cleanup() {
local img="$IMG_DIR/$IMG_FILE"
# 1) Unmount image mountpoints and detach only loops for this IMG
umount -lf "$IMG_BOOT_MNT" "$IMG_ROOT_MNT" 2>/dev/null || true
# losetup -j exits non-zero if no association → tolerate it
{ losetup -j "$img" | cut -d: -f1 | xargs -r losetup -d; } 2>/dev/null || true
# 2) Unmount our target mounts
umount -lf "$TGT_ROOT/boot/firmware" "$TGT_BOOT" "$TGT_ROOT" 2>/dev/null || true
# 3) Unmount the actual target partitions if mounted anywhere (tolerate 'not found')
for p in "$(part 1)" "$(part 2)"; do
# findmnt returns 1 when no match → capture and iterate if any
while read -r mnt; do
[ -n "$mnt" ] && umount -lf "$mnt" 2>/dev/null || true
done < <(findmnt -rno TARGET -S "$p" 2>/dev/null || true)
done
# 4) Close dm-crypt mapping (if it exists)
cryptsetup luksClose cryptroot 2>/dev/null || true
dmsetup remove -f cryptroot 2>/dev/null || true
# 5) Let udev settle
command -v udevadm >/dev/null && udevadm settle || true
}
guard_target_device() {
# Refuse to operate if NVME appears to be the current system disk
local root_src root_disk
root_src=$(findmnt -no SOURCE /)
root_disk=$(lsblk -no pkname "$root_src" 2>/dev/null || true)
if [[ -n "$root_disk" && "/dev/$root_disk" == "$NVME" ]]; then
fatal "Refusing to operate on system disk ($NVME). Pick the external NVMe."
fi
}
need_host_fido2() {
if ! command -v fido2-token >/dev/null 2>&1; then
echo "Host is missing fido2-token. On Arch: sudo pacman -S libfido2"
echo "On Debian/Ubuntu host: sudo apt-get install fido2-tools"
exit 1
fi
}
ensure_image() {
mkdir -p "$IMG_DIR"
chmod 755 "$IMG_DIR"
local BASE="https://cdimage.ubuntu.com/releases/noble/release"
local XZ="ubuntu-24.04.3-preinstalled-${FLAVOR}-arm64+raspi.img.xz"
# If the decompressed .img is missing, fetch/decompress into the cache.
if [[ ! -f "$IMG_DIR/$IMG_FILE" ]]; then
need curl; need unxz # Arch: pacman -S curl xz | Ubuntu: apt-get install curl xz-utils
if [[ ! -f "$IMG_DIR/$XZ" ]]; then
echo "Fetching image…"
curl -fL -o "$IMG_DIR/$XZ" "$BASE/$XZ"
fi
echo "Decompressing to $IMG_DIR/$IMG_FILE"
# Keep the .xz for future runs; stream-decompress to the .img
if command -v unxz >/dev/null 2>&1; then
unxz -c "$IMG_DIR/$XZ" > "$IMG_DIR/$IMG_FILE"
else
need xz
xz -dc "$IMG_DIR/$XZ" > "$IMG_DIR/$IMG_FILE"
fi
sync
else
echo "Using cached image: $IMG_DIR/$IMG_FILE"
fi
}
ensure_binfmt_aarch64(){
# Register qemu-aarch64 for chrooted ARM64 apt runs
if [[ ! -e /proc/sys/fs/binfmt_misc/qemu-aarch64 ]]; then
need docker
systemctl enable --now docker >/dev/null 2>&1 || true
docker run --rm --privileged tonistiigi/binfmt --install arm64 >/dev/null
fi
if [[ ! -x /usr/local/bin/qemu-aarch64-static ]]; then
docker rm -f qemu-static >/dev/null 2>&1 || true
docker create --name qemu-static docker.io/multiarch/qemu-user-static:latest >/dev/null
docker cp qemu-static:/usr/bin/qemu-aarch64-static /usr/local/bin/
install -D -m755 /usr/local/bin/qemu-aarch64-static /usr/local/bin/qemu-aarch64-static
docker rm qemu-static >/dev/null
fi
}
open_image() {
[[ -r "$IMG_DIR/$IMG_FILE" ]] || fatal "Image not found: $IMG_DIR/$IMG_FILE"
mkdir -p "$IMG_BOOT_MNT" "$IMG_ROOT_MNT"
# Pre-clean: detach any previous loop(s) for this image (tolerate absence)
umount -lf "$IMG_BOOT_MNT" 2>/dev/null || true
umount -lf "$IMG_ROOT_MNT" 2>/dev/null || true
# If no loop is attached, losetup -j returns non-zero → swallow it
mapfile -t OLD < <({ losetup -j "$IMG_DIR/$IMG_FILE" | cut -d: -f1; } 2>/dev/null || true)
for L in "${OLD[@]:-}"; do losetup -d "$L" 2>/dev/null || true; done
command -v udevadm >/dev/null && udevadm settle || true
# Attach with partition scan; wait for partition nodes to exist
LOOP=$(losetup --find --show --partscan "$IMG_DIR/$IMG_FILE") || fatal "losetup failed"
command -v udevadm >/dev/null && udevadm settle || true
for _ in {1..25}; do
[[ -b "${LOOP}p1" && -b "${LOOP}p2" ]] && break
sleep 0.1
command -v udevadm >/dev/null && udevadm settle || true
done
[[ -b "${LOOP}p1" ]] || fatal "loop partitions not present for $LOOP"
# Cleanup on exit: unmount first, then detach loop (tolerate absence)
trap 'umount -lf "'"$IMG_BOOT_MNT"'" "'"$IMG_ROOT_MNT"'" 2>/dev/null; losetup -d "'"$LOOP"'" 2>/dev/null' EXIT
# Mount image partitions read-only
mount -o ro "${LOOP}p1" "$IMG_BOOT_MNT"
mount -o ro "${LOOP}p2" "$IMG_ROOT_MNT"
# Sanity checks without using failing pipelines
# start*.elf must exist
if ! compgen -G "$IMG_BOOT_MNT/start*.elf" > /dev/null; then
fatal "start*.elf not found in image"
fi
# vmlinuz-* must exist
if ! compgen -G "$IMG_ROOT_MNT/boot/vmlinuz-*" > /dev/null; then
fatal "vmlinuz-* not found in image root"
fi
}
confirm_and_wipe(){
lsblk -o NAME,SIZE,MODEL,TRAN,LABEL "$NVME"
read -rp "Type EXACTLY 'WIPE' to destroy ALL DATA on $NVME: " ACK
[[ "$ACK" == "WIPE" ]] || fatal "Aborted"
wipefs -a "$NVME"
sgdisk -Zo "$NVME"
# GPT: 1: 1MiB..513MiB vfat ESP; 2: rest LUKS
parted -s "$NVME" mklabel gpt \
mkpart system-boot fat32 1MiB 513MiB set 1 esp on \
mkpart cryptroot 513MiB 100%
partprobe "$NVME"; sleep 1
mkfs.vfat -F32 -n system-boot "$(part 1)"
}
setup_luks(){
echo "Create LUKS2 on $(part 2) (you will be prompted for a passphrase; keep it as fallback)"
need cryptsetup
cryptsetup luksFormat --type luks2 "$(part 2)"
cryptsetup open "$(part 2)" cryptroot
mkfs.ext4 -L rootfs /dev/mapper/cryptroot
}
mount_targets(){
mkdir -p "$TGT_ROOT" "$TGT_BOOT"
mount /dev/mapper/cryptroot "$TGT_ROOT"
mkdir -p "$TGT_ROOT/boot/firmware"
mount "$(part 1)" "$TGT_BOOT"
mount --bind "$TGT_BOOT" "$TGT_ROOT/boot/firmware"
}
rsync_root_and_boot(){
need rsync
rsync -aAXH --numeric-ids --delete \
--exclude='/boot/firmware' --exclude='/boot/firmware/**' \
--exclude='/dev/*' --exclude='/proc/*' --exclude='/sys/*' \
--exclude='/run/*' --exclude='/tmp/*' --exclude='/mnt/*' \
--exclude='/media/*' --exclude='/lost+found' \
"$IMG_ROOT_MNT"/ "$TGT_ROOT"/
rsync -aH --delete "$IMG_BOOT_MNT"/ "$TGT_ROOT/boot/firmware"/
}
write_crypttab_fstab(){
LUUID=$(blkid -s UUID -o value "$(part 2)")
printf 'cryptroot UUID=%s none luks,discard,fido2-device=auto\n' "$LUUID" > "$TGT_ROOT/etc/crypttab"
cat > "$TGT_ROOT/etc/fstab" <<EOF
/dev/mapper/cryptroot / ext4 defaults,discard,errors=remount-ro 0 1
LABEL=system-boot /boot/firmware vfat defaults,umask=0077 0 1
EOF
}
fix_firmware_files(){
local C="$TGT_ROOT/boot/firmware/config.txt"
local CL="$TGT_ROOT/boot/firmware/cmdline.txt"
[[ -f "$C" ]] || fatal "missing $C"
# Always boot the uncompressed Pi 5 kernel
if grep -q '^kernel=' "$C"; then
sed -i 's#^kernel=.*#kernel=kernel_2712.img#' "$C"
else
sed -i '1i kernel=kernel_2712.img' "$C"
fi
# Ensure initramfs and cmdline indirection are set
grep -q '^initramfs ' "$C" || echo 'initramfs initrd.img followkernel' >> "$C"
grep -q '^cmdline=cmdline.txt' "$C" || sed -i '1i cmdline=cmdline.txt' "$C"
# Display & buses (Pi 5)
grep -q '^dtoverlay=vc4-kms-v3d-pi5' "$C" || echo 'dtoverlay=vc4-kms-v3d-pi5' >> "$C"
grep -q '^dtparam=i2c_arm=on' "$C" || echo 'dtparam=i2c_arm=on' >> "$C"
grep -q '^dtparam=pciex1=on' "$C" || echo 'dtparam=pciex1=on' >> "$C"
grep -q '^dtparam=pciex1_gen=2' "$C" || echo 'dtparam=pciex1_gen=2' >> "$C"
grep -q '^enable_uart=1' "$C" || echo 'enable_uart=1' >> "$C"
# Minimal, correct dracut hints using the bare UUID
local LUUID; LUUID=$(blkid -s UUID -o value "$(part 2)")
: > "$CL"
{
echo -n "rd.luks.uuid=$LUUID rd.luks.name=$LUUID=cryptroot "
echo -n "root=/dev/mapper/cryptroot rootfstype=ext4 rootwait fixrtc "
echo "console=serial0,115200 console=tty1 ds=nocloud;s=file:///boot/firmware/ ${DSI_FLAGS} rd.debug"
} >> "$CL"
}
seed_cloud_init(){
# NoCloud seed to create user, lock down SSH, set hostname, and enable avahi.
cat > "$TGT_ROOT/boot/firmware/user-data" <<EOF
#cloud-config
hostname: $STYX_HOSTNAME
manage_etc_hosts: true
users:
- name: $STYX_USER
gecos: "$STYX_USER"
shell: /bin/bash
groups: [sudo,video,i2c]
sudo: ALL=(ALL) NOPASSWD:ALL
lock_passwd: false
ssh_authorized_keys:
- $SSH_PUBKEY
chpasswd:
list: |
$STYX_USER:$STYX_PASS
expire: true
ssh_pwauth: false
package_update: true
packages: [openssh-server, avahi-daemon]
runcmd:
- systemctl enable --now ssh
- systemctl enable --now avahi-daemon || true
EOF
# Minimal meta-data for NoCloud
date +%s | awk '{print "instance-id: iid-titan-ag-"$1"\nlocal-hostname: '"$STYX_HOSTNAME"'"}' \
> "$TGT_ROOT/boot/firmware/meta-data"
}
prep_chroot_mounts(){
for d in dev proc sys; do mount --bind "/$d" "$TGT_ROOT/$d"; done
mount -t devpts devpts "$TGT_ROOT/dev/pts"
# Replace the usual resolv.conf symlink with a real file for apt to work
rm -f "$TGT_ROOT/etc/resolv.conf"
cp /etc/resolv.conf "$TGT_ROOT/etc/resolv.conf"
# Block service starts (no systemd in chroot)
cat > "$TGT_ROOT/usr/sbin/policy-rc.d" <<'EOP'
#!/bin/sh
exit 101
EOP
chmod +x "$TGT_ROOT/usr/sbin/policy-rc.d"
# Ensure qemu static is present inside chroot
install -D -m755 /usr/local/bin/qemu-aarch64-static "$TGT_ROOT/usr/bin/qemu-aarch64-static"
}
in_chroot(){
chroot "$TGT_ROOT" /usr/bin/qemu-aarch64-static /bin/bash -lc '
set -euo pipefail
export DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC
# --- APT sources (ports) ---
cat > /etc/apt/sources.list <<'"'"'EOS'"'"'
deb http://ports.ubuntu.com/ubuntu-ports noble main restricted universe multiverse
deb http://ports.ubuntu.com/ubuntu-ports noble-updates main restricted universe multiverse
deb http://ports.ubuntu.com/ubuntu-ports noble-security main restricted universe multiverse
EOS
apt-get update
# --- Remove snaps and pin them off ---
apt-get -y purge snapd || true
rm -rf /snap /var/snap /var/lib/snapd /home/*/snap || true
mkdir -p /etc/apt/preferences.d
cat > /etc/apt/preferences.d/nosnap.pref <<'"'"'EOS'"'"'
Package: snapd
Pin: release *
Pin-Priority: -10
EOS
# --- Base tools (no flash-kernel; we use dracut) ---
apt-get install -y --no-install-recommends \
openssh-client openssh-server openssh-sftp-server avahi-daemon \
cryptsetup dracut fido2-tools libfido2-1 i2c-tools \
python3-smbus python3-pil zbar-tools qrencode lm-sensors \
file zstd lz4 || true
# Camera apps: try rpicam-apps; otherwise basic libcamera tools
apt-get install -y rpicam-apps || apt-get install -y libcamera-tools || true
# --- Persistent journal so we can read logs after failed boot ---
mkdir -p /etc/systemd/journald.conf.d
cat > /etc/systemd/journald.conf.d/99-persistent.conf <<'"'"'EOS'"'"'
[Journal]
Storage=persistent
EOS
# --- SSH hardening (ensure file exists even if package was half-installed) ---
if [ ! -f /etc/ssh/sshd_config ]; then
mkdir -p /etc/ssh
cat > /etc/ssh/sshd_config <<'"'"'EOS'"'"'
PermitRootLogin no
PasswordAuthentication no
KbdInteractiveAuthentication no
PubkeyAuthentication yes
# Accept defaults for the rest
EOS
fi
sed -i -e "s/^#\?PasswordAuthentication .*/PasswordAuthentication no/" \
-e "s/^#\?KbdInteractiveAuthentication .*/KbdInteractiveAuthentication no/" \
-e "s/^#\?PermitRootLogin .*/PermitRootLogin no/" \
-e "s/^#\?PubkeyAuthentication .*/PubkeyAuthentication yes/" /etc/ssh/sshd_config || true
# --- Hostname & hosts ---
echo "'"$STYX_HOSTNAME"'" > /etc/hostname
if grep -q "^127\\.0\\.1\\.1" /etc/hosts; then
sed -i "s/^127\\.0\\.1\\.1.*/127.0.1.1\t'"$STYX_HOSTNAME"'/" /etc/hosts
else
echo -e "127.0.1.1\t'"$STYX_HOSTNAME"'" >> /etc/hosts
fi
# --- Enable services on first boot ---
mkdir -p /etc/systemd/system/multi-user.target.wants
ln -sf /lib/systemd/system/ssh.service /etc/systemd/system/multi-user.target.wants/ssh.service
ln -sf /lib/systemd/system/avahi-daemon.service /etc/systemd/system/multi-user.target.wants/avahi-daemon.service || true
# --- Ensure i2c group ---
getent group i2c >/dev/null || groupadd i2c
# --- Dracut configuration (generic, not host-only) ---
mkdir -p /etc/dracut.conf.d
cat > /etc/dracut.conf.d/00-hostonly.conf <<'"'"'EOS'"'"'
hostonly=no
EOS
cat > /etc/dracut.conf.d/10-systemd-crypt.conf <<'"'"'EOS'"'"'
add_dracutmodules+=" systemd crypt "
EOS
cat > /etc/dracut.conf.d/20-drivers.conf <<'"'"'EOS'"'"'
add_drivers+=" nvme xhci_pci xhci_hcd usbhid hid_generic hid "
EOS
cat > /etc/dracut.conf.d/30-fido2.conf <<'"'"'EOS'"'"'
install_items+="/usr/bin/systemd-cryptsetup /usr/bin/fido2-token /usr/lib/*/libfido2.so* /usr/lib/*/libcbor.so*"
EOS
# --- Build initramfs and place it where firmware expects it ---
KVER=$(ls -1 /lib/modules | sort -V | tail -n1)
dracut --force /boot/initramfs-$KVER.img $KVER
ln -sf initramfs-$KVER.img /boot/initrd.img
ln -sf initramfs-$KVER.img /boot/initrd.img-$KVER
cp -a /boot/initramfs-$KVER.img /boot/firmware/initrd.img
# --- Create uncompressed kernel for Pi 5 firmware ---
if [ -f "/usr/lib/linux-image-$KVER/Image" ]; then
cp -a "/usr/lib/linux-image-$KVER/Image" /boot/firmware/kernel_2712.img
else
FMT=$(file -b "/boot/vmlinuz-$KVER" || true)
case "$FMT" in
*Zstandard*|*zstd*) zstd -dc "/boot/vmlinuz-$KVER" > /boot/firmware/kernel_2712.img ;;
*LZ4*) lz4 -dc "/boot/vmlinuz-$KVER" > /boot/firmware/kernel_2712.img ;;
*gzip*) zcat "/boot/vmlinuz-$KVER" > /boot/firmware/kernel_2712.img ;;
*) cp -a "/boot/vmlinuz-$KVER" /boot/firmware/kernel_2712.img ;;
esac
fi
# --- Ensure Pi 5 DTB is present on the boot partition ---
DTB=$(find /lib/firmware -type f -name "bcm2712-rpi-5-b.dtb" | sort | tail -n1 || true)
[ -n "$DTB" ] && cp -a "$DTB" /boot/firmware/
# --- Dracut hook to copy rdsosreport.txt to the FAT partition on failure ---
mkdir -p /usr/lib/dracut/modules.d/99copylog
cat > /usr/lib/dracut/modules.d/99copylog/module-setup.sh <<'"'"'EOS'"'"'
#!/bin/bash
check() { return 0; }
depends() { echo base; return 0; }
install() {
# Guard $moddir for nounset; derive if absent
local mdir="${moddir:-$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)}"
inst_hook emergency 99 "$mdir/copylog.sh"
}
EOS
chmod +x /usr/lib/dracut/modules.d/99copylog/module-setup.sh
cat > /usr/lib/dracut/modules.d/99copylog/copylog.sh <<'"'"'EOS'"'"'
#!/bin/sh
set -e
for dev in /dev/nvme0n1p1 /dev/sda1 /dev/sdb1 /dev/mmcblk0p1; do
[ -b "$dev" ] || continue
mkdir -p /mnt/bootfat
if mount -t vfat "$dev" /mnt/bootfat 2>/dev/null; then
if [ -s /run/initramfs/rdsosreport.txt ]; then
cp -f /run/initramfs/rdsosreport.txt /mnt/bootfat/rdsosreport.txt 2>/dev/null || true
sync || true
fi
umount /mnt/bootfat || true
break
fi
done
EOS
chmod +x /usr/lib/dracut/modules.d/99copylog/copylog.sh
# Rebuild to ensure the copylog module is included
dracut --force /boot/initramfs-$KVER.img $KVER
ln -sf initramfs-$KVER.img /boot/initrd.img
cp -a /boot/initramfs-$KVER.img /boot/firmware/initrd.img
true
'
}
verify_boot_assets(){
echo "---- verify boot assets on FAT ----"
file "$TGT_ROOT/boot/firmware/kernel_2712.img" || true
ls -lh "$TGT_ROOT/boot/firmware/initrd.img" || true
echo "-- config.txt (key lines) --"
grep -E '^(kernel|initramfs|cmdline)=|^dtoverlay=|^dtparam=' "$TGT_ROOT/boot/firmware/config.txt" || true
echo "-- cmdline.txt --"
cat "$TGT_ROOT/boot/firmware/cmdline.txt" || true
echo "-- firmware blobs (sample) --"
ls -1 "$TGT_ROOT/boot/firmware"/start*.elf "$TGT_ROOT/boot/firmware"/fixup*.dat | head -n 8 || true
echo "-- Pi5 DTB --"
ls -l "$TGT_ROOT/boot/firmware/"*rpi-5-b.dtb || true
}
enroll_fido_tokens(){
echo "Enrolling FIDO2 Solo keys into $(part 2) ..."
need systemd-cryptenroll
need fido2-token
# Collect all hidraw paths from both output styles (some distros print 'Device: /dev/hidrawX')
mapfile -t DEVS < <(
fido2-token -L \
| sed -n 's,^\(/dev/hidraw[0-9]\+\):.*,\1,p; s,^Device:[[:space:]]\+/dev/hidraw\([0-9]\+\).*,/dev/hidraw\1,p' \
| sort -u
)
if (( ${#DEVS[@]} == 0 )); then
echo "No FIDO2 tokens detected; skipping enrollment (you can enroll later)."
echo "Example later: systemd-cryptenroll $(part 2) --fido2-device=/dev/hidrawX --fido2-with-client-pin=no"
return 0
fi
# Recommend keeping exactly ONE key plugged during first enrollment to avoid ambiguity.
if (( ${#DEVS[@]} > 1 )); then
echo "Note: multiple FIDO2 tokens present: ${DEVS[*]}"
echo "If enrollment fails, try with only one key inserted."
fi
local rc=0
for D in "${DEVS[@]}"; do
echo "-> Enrolling $D (you should be asked to touch the key)"
if ! SYSTEMD_LOG_LEVEL=debug systemd-cryptenroll "$(part 2)" \
--fido2-device="$D" \
--fido2-with-client-pin=no \
--fido2-with-user-presence=yes \
--fido2-with-user-verification=no \
--label="solo-$(basename "$D")"; then
echo "WARN: enrollment failed for $D"
rc=1
fi
done
echo "Tokens enrolled (if any):"
systemd-cryptenroll "$(part 2)" --list || true
return $rc
}
cleanup(){
rm -f "$TGT_ROOT/usr/sbin/policy-rc.d" || true
umount -lf "$TGT_ROOT/dev/pts" 2>/dev/null || true
for d in dev proc sys; do umount -lf "$TGT_ROOT/$d" 2>/dev/null || true; done
umount -lf "$TGT_ROOT/boot/firmware" 2>/dev/null || true
umount -lf "$TGT_BOOT" 2>/dev/null || true
umount -lf "$TGT_ROOT" 2>/dev/null || true
cryptsetup close cryptroot 2>/dev/null || true
umount -lf "$IMG_BOOT_MNT" 2>/dev/null || true
umount -lf "$IMG_ROOT_MNT" 2>/dev/null || true
}
main(){
require_root
need losetup; need parted; need rsync
auto_detect_target_disk
echo "Target disk: $NVME"
ensure_binfmt_aarch64
ensure_image
preflight_cleanup
guard_target_device
open_image
confirm_and_wipe
setup_luks
mount_targets
rsync_root_and_boot
write_crypttab_fstab
fix_firmware_files
seed_cloud_init
prep_chroot_mounts
in_chroot
verify_boot_assets
need_host_fido2
enroll_fido_tokens
cleanup
echo "✅ NVMe prepared."
echo " Install in the Pi 5 and boot with no SD."
echo " Expect LUKS to unlock automatically with a Solo key inserted;"
echo " passphrase fallback remains. Hostname: ${STYX_HOSTNAME} User: ${STYX_USER}"
echo " On first boot, reach it via: ssh -i ~/.ssh/id_ed25519_titan styx@titan-ag.local"
}
main "$@"

View File

@ -1,58 +0,0 @@
import importlib.util
import pathlib
def load_module():
path = pathlib.Path(__file__).resolve().parents[1] / "dashboards_render_atlas.py"
spec = importlib.util.spec_from_file_location("dashboards_render_atlas", path)
module = importlib.util.module_from_spec(spec)
assert spec.loader is not None
spec.loader.exec_module(module)
return module
def test_table_panel_options_and_filterable():
mod = load_module()
panel = mod.table_panel(
1,
"test",
"metric",
{"h": 1, "w": 1, "x": 0, "y": 0},
unit="percent",
transformations=[{"id": "labelsToFields", "options": {}}],
instant=True,
options={"showColumnFilters": False},
filterable=False,
footer={"show": False, "fields": "", "calcs": []},
format="table",
)
assert panel["fieldConfig"]["defaults"]["unit"] == "percent"
assert panel["fieldConfig"]["defaults"]["custom"]["filterable"] is False
assert panel["options"]["showHeader"] is True
assert panel["targets"][0]["format"] == "table"
def test_node_filter_and_expr_helpers():
mod = load_module()
expr = mod.node_filter("titan-.*")
assert "label_replace" in expr
cpu_expr = mod.node_cpu_expr("titan-.*")
mem_expr = mod.node_mem_expr("titan-.*")
assert "node_cpu_seconds_total" in cpu_expr
assert "node_memory_MemAvailable_bytes" in mem_expr
def test_render_configmap_writes(tmp_path):
mod = load_module()
mod.DASHBOARD_DIR = tmp_path / "dash"
mod.ROOT = tmp_path
uid = "atlas-test"
info = {"configmap": tmp_path / "cm.yaml"}
data = {"title": "Atlas Test"}
mod.write_json(uid, data)
mod.render_configmap(uid, info)
json_path = mod.DASHBOARD_DIR / f"{uid}.json"
assert json_path.exists()
content = (tmp_path / "cm.yaml").read_text()
assert "kind: ConfigMap" in content
assert f"{uid}.json" in content

View File

@ -1,181 +0,0 @@
import importlib.util
import pathlib
import pytest
def load_sync_module(monkeypatch):
# Minimal env required by module import
env = {
"KEYCLOAK_BASE_URL": "http://keycloak",
"KEYCLOAK_REALM": "atlas",
"KEYCLOAK_CLIENT_ID": "mailu-sync",
"KEYCLOAK_CLIENT_SECRET": "secret",
"MAILU_DOMAIN": "example.com",
"MAILU_DB_HOST": "localhost",
"MAILU_DB_PORT": "5432",
"MAILU_DB_NAME": "mailu",
"MAILU_DB_USER": "mailu",
"MAILU_DB_PASSWORD": "pw",
}
for k, v in env.items():
monkeypatch.setenv(k, v)
module_path = pathlib.Path(__file__).resolve().parents[1] / "mailu_sync.py"
spec = importlib.util.spec_from_file_location("mailu_sync_testmod", module_path)
module = importlib.util.module_from_spec(spec)
assert spec.loader is not None
spec.loader.exec_module(module)
return module
def test_random_password_length_and_charset(monkeypatch):
sync = load_sync_module(monkeypatch)
pw = sync.random_password()
assert len(pw) == 24
assert all(ch.isalnum() for ch in pw)
class _FakeResponse:
def __init__(self, json_data=None, status=200):
self._json_data = json_data or {}
self.status_code = status
def raise_for_status(self):
if self.status_code >= 400:
raise AssertionError(f"status {self.status_code}")
def json(self):
return self._json_data
class _FakeSession:
def __init__(self, put_resp, get_resp):
self.put_resp = put_resp
self.get_resp = get_resp
self.put_called = False
self.get_called = False
def post(self, *args, **kwargs):
return _FakeResponse({"access_token": "dummy"})
def put(self, *args, **kwargs):
self.put_called = True
return self.put_resp
def get(self, *args, **kwargs):
self.get_called = True
return self.get_resp
def test_kc_update_attributes_succeeds(monkeypatch):
sync = load_sync_module(monkeypatch)
ok_resp = _FakeResponse({"attributes": {"mailu_app_password": ["abc"]}})
sync.SESSION = _FakeSession(_FakeResponse({}), ok_resp)
sync.kc_update_attributes("token", {"id": "u1", "username": "u1"}, {"mailu_app_password": "abc"})
assert sync.SESSION.put_called and sync.SESSION.get_called
def test_kc_update_attributes_raises_without_attribute(monkeypatch):
sync = load_sync_module(monkeypatch)
missing_attr_resp = _FakeResponse({"attributes": {}}, status=200)
sync.SESSION = _FakeSession(_FakeResponse({}), missing_attr_resp)
with pytest.raises(Exception):
sync.kc_update_attributes("token", {"id": "u1", "username": "u1"}, {"mailu_app_password": "abc"})
def test_kc_get_users_paginates(monkeypatch):
sync = load_sync_module(monkeypatch)
class _PagedSession:
def __init__(self):
self.calls = 0
def post(self, *_, **__):
return _FakeResponse({"access_token": "tok"})
def get(self, *_, **__):
self.calls += 1
if self.calls == 1:
return _FakeResponse([{"id": "u1"}, {"id": "u2"}])
return _FakeResponse([]) # stop pagination
sync.SESSION = _PagedSession()
users = sync.kc_get_users("tok")
assert [u["id"] for u in users] == ["u1", "u2"]
assert sync.SESSION.calls == 2
def test_ensure_mailu_user_skips_foreign_domain(monkeypatch):
sync = load_sync_module(monkeypatch)
executed = []
class _Cursor:
def execute(self, sql, params):
executed.append((sql, params))
sync.ensure_mailu_user(_Cursor(), "user@other.com", "pw", "User")
assert not executed
def test_ensure_mailu_user_upserts(monkeypatch):
sync = load_sync_module(monkeypatch)
captured = {}
class _Cursor:
def execute(self, sql, params):
captured.update(params)
sync.ensure_mailu_user(_Cursor(), "user@example.com", "pw", "User Example")
assert captured["email"] == "user@example.com"
assert captured["localpart"] == "user"
# password should be hashed, not the raw string
assert captured["password"] != "pw"
def test_main_generates_password_and_upserts(monkeypatch):
sync = load_sync_module(monkeypatch)
users = [
{"id": "u1", "username": "user1", "email": "user1@example.com", "attributes": {}},
{"id": "u2", "username": "user2", "email": "user2@example.com", "attributes": {"mailu_app_password": ["keepme"]}},
{"id": "u3", "username": "user3", "email": "user3@other.com", "attributes": {}},
]
updated = []
class _Cursor:
def __init__(self):
self.executions = []
def execute(self, sql, params):
self.executions.append(params)
def close(self):
return None
class _Conn:
def __init__(self):
self.autocommit = False
self._cursor = _Cursor()
def cursor(self, cursor_factory=None):
return self._cursor
def close(self):
return None
monkeypatch.setattr(sync, "get_kc_token", lambda: "tok")
monkeypatch.setattr(sync, "kc_get_users", lambda token: users)
monkeypatch.setattr(sync, "kc_update_attributes", lambda token, user, attrs: updated.append((user["id"], attrs["mailu_app_password"])))
conns = []
def _connect(**kwargs):
conn = _Conn()
conns.append(conn)
return conn
monkeypatch.setattr(sync.psycopg2, "connect", _connect)
sync.main()
# Should attempt two inserts (third user skipped due to domain mismatch)
assert len(updated) == 1 # only one missing attr was backfilled
assert conns and len(conns[0]._cursor.executions) == 2

View File

@ -5,7 +5,7 @@ metadata:
name: gitea-ingress
namespace: gitea
annotations:
cert-manager.io/cluster-issuer: letsencrypt
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
tls:

View File

@ -1,49 +0,0 @@
# services/gitops-ui/helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: weave-gitops
namespace: flux-system
spec:
interval: 30m
chart:
spec:
chart: ./charts/gitops-server
sourceRef:
kind: GitRepository
name: weave-gitops-upstream
namespace: flux-system
# track upstream tag; see source object for version pin
install:
remediation:
retries: 3
upgrade:
remediation:
retries: 3
remediateLastFailure: true
cleanupOnFail: true
values:
adminUser:
create: true
createClusterRole: true
createSecret: true
username: admin
# bcrypt hash for temporary password "G1tOps!2025" (rotate after login)
passwordHash: "$2y$12$wDEOzR1Gc2dbvNSJ3ZXNdOBVFEjC6YASIxnZmHIbO.W1m0fie/QVi"
ingress:
enabled: true
className: traefik
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
traefik.ingress.kubernetes.io/router.entrypoints: websecure
hosts:
- host: cd.bstein.dev
paths:
- path: /
pathType: Prefix
tls:
- secretName: gitops-ui-tls
hosts:
- cd.bstein.dev
metrics:
enabled: true

View File

@ -1,7 +0,0 @@
# services/gitops-ui/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: flux-system
resources:
- source.yaml
- helmrelease.yaml

View File

@ -1,11 +0,0 @@
# services/gitops-ui/source.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: weave-gitops-upstream
namespace: flux-system
spec:
interval: 1h
url: https://github.com/weaveworks/weave-gitops.git
ref:
tag: v0.38.0

View File

@ -5,7 +5,7 @@ metadata:
name: jitsi
namespace: jitsi
annotations:
cert-manager.io/cluster-issuer: letsencrypt
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
ingressClassName: traefik
tls:

View File

@ -1,27 +0,0 @@
# services/keycloak
Keycloak is deployed via raw manifests and backed by the shared Postgres (`postgres-service.postgres.svc.cluster.local:5432`). Create these secrets before applying:
```bash
# DB creds (per-service DB/user in shared Postgres)
kubectl -n sso create secret generic keycloak-db \
--from-literal=username=keycloak \
--from-literal=password='<DB_PASSWORD>' \
--from-literal=database=keycloak
# Admin console creds (maps to KC admin user)
kubectl -n sso create secret generic keycloak-admin \
--from-literal=username=brad@bstein.dev \
--from-literal=password='<ADMIN_PASSWORD>'
```
Apply:
```bash
kubectl apply -k services/keycloak
```
Notes
- Service: `keycloak.sso.svc:80` (Ingress `sso.bstein.dev`, TLS via cert-manager).
- Uses Postgres schema `public`; DB/user should be provisioned in the shared Postgres instance.
- Health endpoints on :9000 are wired for probes.

View File

@ -1,154 +0,0 @@
# services/keycloak/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: keycloak
namespace: sso
labels:
app: keycloak
spec:
replicas: 1
selector:
matchLabels:
app: keycloak
template:
metadata:
labels:
app: keycloak
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: hardware
operator: In
values: ["rpi5","rpi4"]
- key: node-role.kubernetes.io/worker
operator: Exists
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values: ["titan-24"]
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 90
preference:
matchExpressions:
- key: hardware
operator: In
values: ["rpi5"]
- weight: 70
preference:
matchExpressions:
- key: hardware
operator: In
values: ["rpi4"]
securityContext:
runAsUser: 1000
runAsGroup: 0
fsGroup: 1000
fsGroupChangePolicy: OnRootMismatch
imagePullSecrets:
- name: zot-regcred
initContainers:
- name: mailu-http-listener
image: registry.bstein.dev/sso/mailu-http-listener:0.1.0
imagePullPolicy: IfNotPresent
command: ["/bin/sh", "-c"]
args:
- |
cp /plugin/mailu-http-listener-0.1.0.jar /providers/
cp -r /plugin/src /providers/src
volumeMounts:
- name: providers
mountPath: /providers
containers:
- name: keycloak
image: quay.io/keycloak/keycloak:26.0.7
imagePullPolicy: IfNotPresent
args:
- start
env:
- name: KC_DB
value: postgres
- name: KC_DB_URL_HOST
value: postgres-service.postgres.svc.cluster.local
- name: KC_DB_URL_DATABASE
valueFrom:
secretKeyRef:
name: keycloak-db
key: database
- name: KC_DB_USERNAME
valueFrom:
secretKeyRef:
name: keycloak-db
key: username
- name: KC_DB_PASSWORD
valueFrom:
secretKeyRef:
name: keycloak-db
key: password
- name: KC_DB_SCHEMA
value: public
- name: KC_HOSTNAME
value: sso.bstein.dev
- name: KC_HOSTNAME_URL
value: https://sso.bstein.dev
- name: KC_PROXY
value: edge
- name: KC_PROXY_HEADERS
value: xforwarded
- name: KC_HTTP_ENABLED
value: "true"
- name: KC_HTTP_MANAGEMENT_PORT
value: "9000"
- name: KC_HTTP_MANAGEMENT_BIND_ADDRESS
value: 0.0.0.0
- name: KC_HEALTH_ENABLED
value: "true"
- name: KC_METRICS_ENABLED
value: "true"
- name: KEYCLOAK_ADMIN
valueFrom:
secretKeyRef:
name: keycloak-admin
key: username
- name: KEYCLOAK_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: keycloak-admin
key: password
- name: KC_EVENTS_LISTENERS
value: jboss-logging,mailu-http
- name: KC_SPI_EVENTS_LISTENER_MAILU-HTTP_ENDPOINT
value: http://mailu-sync-listener.mailu-mailserver.svc.cluster.local:8080/events
ports:
- containerPort: 8080
name: http
- containerPort: 9000
name: metrics
readinessProbe:
httpGet:
path: /health/ready
port: 9000
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 6
livenessProbe:
httpGet:
path: /health/live
port: 9000
initialDelaySeconds: 60
periodSeconds: 15
failureThreshold: 6
volumeMounts:
- name: data
mountPath: /opt/keycloak/data
- name: providers
mountPath: /opt/keycloak/providers
volumes:
- name: data
persistentVolumeClaim:
claimName: keycloak-data
- name: providers
emptyDir: {}

View File

@ -1,24 +0,0 @@
# services/keycloak/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: keycloak
namespace: sso
annotations:
cert-manager.io/cluster-issuer: letsencrypt
spec:
ingressClassName: traefik
rules:
- host: sso.bstein.dev
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: keycloak
port:
number: 80
tls:
- hosts: [sso.bstein.dev]
secretName: keycloak-tls

View File

@ -1,10 +0,0 @@
# services/keycloak/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: sso
resources:
- namespace.yaml
- pvc.yaml
- deployment.yaml
- service.yaml
- ingress.yaml

View File

@ -1,5 +0,0 @@
# services/keycloak/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: sso

View File

@ -1,12 +0,0 @@
# services/keycloak/pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: keycloak-data
namespace: sso
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
storageClassName: astreae

View File

@ -1,15 +0,0 @@
# services/keycloak/service.yaml
apiVersion: v1
kind: Service
metadata:
name: keycloak
namespace: sso
labels:
app: keycloak
spec:
selector:
app: keycloak
ports:
- name: http
port: 80
targetPort: http

View File

@ -1,13 +0,0 @@
# services/mailu/certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: mailu-tls
namespace: mailu-mailserver
spec:
secretName: mailu-certificates
issuerRef:
kind: ClusterIssuer
name: letsencrypt-prod
dnsNames:
- mail.bstein.dev

View File

@ -1,287 +0,0 @@
# services/mailu/helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: mailu
namespace: mailu-mailserver
spec:
interval: 30m
chart:
spec:
chart: mailu
version: 2.1.2
sourceRef:
kind: HelmRepository
name: mailu
namespace: flux-system
install:
remediation: { retries: 3 }
timeout: 10m
upgrade:
remediation:
retries: 3
remediateLastFailure: true
cleanupOnFail: true
timeout: 10m
values:
mailuVersion: "2024.06"
domain: bstein.dev
hostnames: [mail.bstein.dev]
domains:
- name: bstein.dev
enabled: true
dkim:
enabled: true
externalRelay:
host: "[email-smtp.us-east-2.amazonaws.com]:587"
existingSecret: mailu-ses-relay
usernameKey: relay-username
passwordKey: relay-password
timezone: Etc/UTC
subnet: 10.42.0.0/16
existingSecret: mailu-secret
tls:
outboundLevel: encrypt
externalDatabase:
enabled: true
type: postgresql
host: postgres-service.postgres.svc.cluster.local
port: 5432
database: mailu
username: mailu
existingSecret: mailu-db-secret
existingSecretUsernameKey: username
existingSecretPasswordKey: password
existingSecretDatabaseKey: database
initialAccount:
enabled: true
username: test
domain: bstein.dev
existingSecret: mailu-initial-account-secret
existingSecretPasswordKey: password
persistence:
accessModes: [ReadWriteMany]
size: 100Gi
storageClass: astreae
single_pvc: true
front:
hostnames: [mail.bstein.dev]
proxied: true
hostPort:
enabled: false
https:
enabled: false
external: false
forceHttps: false
externalService:
enabled: true
type: LoadBalancer
externalTrafficPolicy: Cluster
ports:
submission: true
nodePorts:
pop3: 30010
pop3s: 30011
imap: 30143
imaps: 30993
manageSieve: 30419
smtp: 30025
smtps: 30465
submission: 30587
logLevel: DEBUG
nodeSelector:
hardware: rpi4
admin:
logLevel: DEBUG
nodeSelector:
hardware: rpi4
podLivenessProbe:
enabled: true
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 6
successThreshold: 1
podReadinessProbe:
enabled: true
initialDelaySeconds: 20
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 6
successThreshold: 1
extraEnvVars:
- name: FLASK_DEBUG
value: "1"
- name: ACCESSLOG
value: /dev/stdout
- name: ERRORLOG
value: /dev/stderr
- name: WEBROOT_REDIRECT
value: ""
- name: FORWARDED_ALLOW_IPS
value: 127.0.0.1,10.42.0.0/16
- name: DNS_RESOLVERS
value: 1.1.1.1,9.9.9.9
extraVolumes:
- name: unbound-config
configMap:
name: mailu-unbound
- name: unbound-run
emptyDir: {}
extraVolumeMounts:
- name: unbound-run
mountPath: /var/lib/unbound
extraContainers:
- name: unbound
image: docker.io/alpine:3.20
command: ["/bin/sh", "-c"]
args:
- |
while :; do
printf "nameserver 10.43.0.10\n" > /etc/resolv.conf
if apk add --no-cache unbound bind-tools; then
break
fi
echo "apk failed, retrying" >&2
sleep 10
done
cat >/etc/resolv.conf <<'EOF'
search mailu-mailserver.svc.cluster.local svc.cluster.local cluster.local
nameserver 127.0.0.1
EOF
unbound-anchor -a /var/lib/unbound/root.key || true
exec unbound -d -c /opt/unbound/etc/unbound/unbound.conf
ports:
- containerPort: 53
protocol: UDP
- containerPort: 53
protocol: TCP
volumeMounts:
- name: unbound-config
mountPath: /opt/unbound/etc/unbound
- name: unbound-run
mountPath: /var/lib/unbound
dnsPolicy: None
dnsConfig:
nameservers:
- 127.0.0.1
searches:
- mailu-mailserver.svc.cluster.local
- svc.cluster.local
- cluster.local
clamav:
image:
repository: clamav/clamav-debian
tag: "1.4"
logLevel: DEBUG
nodeSelector:
hardware: rpi5
resources:
requests:
cpu: 200m
memory: 1Gi
limits:
cpu: 500m
memory: 3Gi
livenessProbe:
enabled: false
initialDelaySeconds: 300
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 6
successThreshold: 1
startupProbe:
enabled: false
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 20
successThreshold: 1
readinessProbe:
enabled: false
initialDelaySeconds: 300
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 6
successThreshold: 1
dovecot:
logLevel: DEBUG
nodeSelector:
hardware: rpi4
oletools:
logLevel: DEBUG
nodeSelector:
hardware: rpi4
postfix:
logLevel: DEBUG
nodeSelector:
hardware: rpi4
overrides:
smtp_use_tls: "yes"
smtp_tls_security_level: "encrypt"
smtp_sasl_security_options: "noanonymous"
redis:
enabled: true
architecture: standalone
logLevel: DEBUG
image:
repository: bitnamilegacy/redis
tag: 8.0.3-debian-12-r3
master:
nodeSelector:
hardware: rpi4
persistence:
enabled: true
accessModes: [ReadWriteMany]
size: 8Gi
storageClass: astreae
rspamd:
logLevel: DEBUG
nodeSelector:
hardware: rpi4
persistence:
accessModes: [ReadWriteOnce]
size: 8Gi
storageClass: astreae
tika:
logLevel: DEBUG
nodeSelector:
hardware: rpi4
global:
logLevel: DEBUG
storageClass: astreae
webmail:
enabled: false
nodeSelector:
hardware: rpi4
ingress:
enabled: false
ingressClassName: traefik
tls: true
existingSecret: mailu-certificates
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: websecure
traefik.ingress.kubernetes.io/service.serversscheme: https
traefik.ingress.kubernetes.io/service.serverstransport: mailu-transport@kubernetescrd
extraRules:
- host: mail.bstein.dev
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: mailu-front
port:
number: 443
service:
ports:
smtp:
port: 25
targetPort: 25
smtps:
port: 465
targetPort: 465
submission:
port: 587
targetPort: 587

View File

@ -1,19 +0,0 @@
# services/mailu/ingressroute.yaml
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: mailu
namespace: mailu-mailserver
spec:
entryPoints:
- websecure
routes:
- match: Host(`mail.bstein.dev`)
kind: Rule
services:
- name: mailu-front
port: 443
scheme: https
serversTransport: mailu-transport
tls:
secretName: mailu-certificates

View File

@ -1,23 +0,0 @@
# services/mailu/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: mailu-mailserver
resources:
- namespace.yaml
- helmrelease.yaml
- certificate.yaml
- vip-controller.yaml
- unbound-configmap.yaml
- serverstransport.yaml
- ingressroute.yaml
- mailu-sync-job.yaml
- mailu-sync-cronjob.yaml
- mailu-sync-listener.yaml
configMapGenerator:
- name: mailu-sync-script
namespace: mailu-mailserver
files:
- sync.py=../../scripts/mailu_sync.py
options:
disableNameSuffixHash: true

View File

@ -1,77 +0,0 @@
# services/mailu/mailu-sync-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: mailu-sync-nightly
namespace: mailu-mailserver
spec:
schedule: "30 4 * * *"
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: mailu-sync
image: python:3.11-alpine
imagePullPolicy: IfNotPresent
command: ["/bin/sh", "-c"]
args:
- |
pip install --no-cache-dir requests psycopg2-binary passlib >/tmp/pip.log \
&& python /app/sync.py
env:
- name: KEYCLOAK_BASE_URL
value: http://keycloak.sso.svc.cluster.local
- name: KEYCLOAK_REALM
value: atlas
- name: MAILU_DOMAIN
value: bstein.dev
- name: MAILU_DEFAULT_QUOTA
value: "20000000000"
- name: MAILU_DB_HOST
value: postgres-service.postgres.svc.cluster.local
- name: MAILU_DB_PORT
value: "5432"
- name: MAILU_DB_NAME
valueFrom:
secretKeyRef:
name: mailu-db-secret
key: database
- name: MAILU_DB_USER
valueFrom:
secretKeyRef:
name: mailu-db-secret
key: username
- name: MAILU_DB_PASSWORD
valueFrom:
secretKeyRef:
name: mailu-db-secret
key: password
- name: KEYCLOAK_CLIENT_ID
valueFrom:
secretKeyRef:
name: mailu-sync-credentials
key: client-id
- name: KEYCLOAK_CLIENT_SECRET
valueFrom:
secretKeyRef:
name: mailu-sync-credentials
key: client-secret
volumeMounts:
- name: sync-script
mountPath: /app/sync.py
subPath: sync.py
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
volumes:
- name: sync-script
configMap:
name: mailu-sync-script
defaultMode: 0444

View File

@ -1,73 +0,0 @@
# services/mailu/mailu-sync-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: mailu-sync
namespace: mailu-mailserver
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: mailu-sync
image: python:3.11-alpine
imagePullPolicy: IfNotPresent
command: ["/bin/sh", "-c"]
args:
- |
pip install --no-cache-dir requests psycopg2-binary passlib >/tmp/pip.log \
&& python /app/sync.py
env:
- name: KEYCLOAK_BASE_URL
value: http://keycloak.sso.svc.cluster.local
- name: KEYCLOAK_REALM
value: atlas
- name: MAILU_DOMAIN
value: bstein.dev
- name: MAILU_DEFAULT_QUOTA
value: "20000000000"
- name: MAILU_DB_HOST
value: postgres-service.postgres.svc.cluster.local
- name: MAILU_DB_PORT
value: "5432"
- name: MAILU_DB_NAME
valueFrom:
secretKeyRef:
name: mailu-db-secret
key: database
- name: MAILU_DB_USER
valueFrom:
secretKeyRef:
name: mailu-db-secret
key: username
- name: MAILU_DB_PASSWORD
valueFrom:
secretKeyRef:
name: mailu-db-secret
key: password
- name: KEYCLOAK_CLIENT_ID
valueFrom:
secretKeyRef:
name: mailu-sync-credentials
key: client-id
- name: KEYCLOAK_CLIENT_SECRET
valueFrom:
secretKeyRef:
name: mailu-sync-credentials
key: client-secret
volumeMounts:
- name: sync-script
mountPath: /app/sync.py
subPath: sync.py
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
volumes:
- name: sync-script
configMap:
name: mailu-sync-script
defaultMode: 0444

View File

@ -1,154 +0,0 @@
# services/mailu/mailu-sync-listener.yaml
apiVersion: v1
kind: Service
metadata:
name: mailu-sync-listener
namespace: mailu-mailserver
spec:
selector:
app: mailu-sync-listener
ports:
- name: http
port: 8080
targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: mailu-sync-listener
namespace: mailu-mailserver
labels:
app: mailu-sync-listener
spec:
replicas: 1
selector:
matchLabels:
app: mailu-sync-listener
template:
metadata:
labels:
app: mailu-sync-listener
spec:
restartPolicy: Always
containers:
- name: listener
image: python:3.11-alpine
imagePullPolicy: IfNotPresent
command: ["/bin/sh", "-c"]
args:
- |
pip install --no-cache-dir requests psycopg2-binary passlib >/tmp/pip.log \
&& python /app/listener.py
env:
- name: KEYCLOAK_BASE_URL
value: http://keycloak.sso.svc.cluster.local
- name: KEYCLOAK_REALM
value: atlas
- name: MAILU_DOMAIN
value: bstein.dev
- name: MAILU_DEFAULT_QUOTA
value: "20000000000"
- name: MAILU_DB_HOST
value: postgres-service.postgres.svc.cluster.local
- name: MAILU_DB_PORT
value: "5432"
- name: MAILU_DB_NAME
valueFrom:
secretKeyRef:
name: mailu-db-secret
key: database
- name: MAILU_DB_USER
valueFrom:
secretKeyRef:
name: mailu-db-secret
key: username
- name: MAILU_DB_PASSWORD
valueFrom:
secretKeyRef:
name: mailu-db-secret
key: password
- name: KEYCLOAK_CLIENT_ID
valueFrom:
secretKeyRef:
name: mailu-sync-credentials
key: client-id
- name: KEYCLOAK_CLIENT_SECRET
valueFrom:
secretKeyRef:
name: mailu-sync-credentials
key: client-secret
volumeMounts:
- name: sync-script
mountPath: /app/sync.py
subPath: sync.py
- name: listener-script
mountPath: /app/listener.py
subPath: listener.py
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
volumes:
- name: sync-script
configMap:
name: mailu-sync-script
defaultMode: 0444
- name: listener-script
configMap:
name: mailu-sync-listener
defaultMode: 0444
---
apiVersion: v1
kind: ConfigMap
metadata:
name: mailu-sync-listener
namespace: mailu-mailserver
data:
listener.py: |
import http.server
import json
import os
import subprocess
import threading
from time import time
# Simple debounce to avoid hammering on bursts
MIN_INTERVAL_SECONDS = 10
last_run = 0.0
lock = threading.Lock()
def trigger_sync():
global last_run
with lock:
now = time()
if now - last_run < MIN_INTERVAL_SECONDS:
return
last_run = now
# Fire and forget; output to stdout
subprocess.Popen(["python", "/app/sync.py"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
class Handler(http.server.BaseHTTPRequestHandler):
def do_POST(self):
length = int(self.headers.get("Content-Length", 0))
body = self.rfile.read(length) if length else b""
try:
json.loads(body or b"{}")
except json.JSONDecodeError:
self.send_response(400)
self.end_headers()
return
trigger_sync()
self.send_response(202)
self.end_headers()
def log_message(self, fmt, *args):
# Quiet logging
return
if __name__ == "__main__":
server = http.server.ThreadingHTTPServer(("", 8080), Handler)
server.serve_forever()

View File

@ -1,5 +0,0 @@
# services/mailu/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: mailu-mailserver

View File

@ -1,10 +0,0 @@
# services/mailu/serverstransport.yaml
apiVersion: traefik.io/v1alpha1
kind: ServersTransport
metadata:
name: mailu-transport
namespace: mailu-mailserver
spec:
# Force SNI to mail.bstein.dev and skip backend cert verification (backend cert is for the host, not the pod IP).
serverName: mail.bstein.dev
insecureSkipVerify: true

View File

@ -1,49 +0,0 @@
# services/mailu/unbound-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: mailu-unbound
namespace: mailu-mailserver
data:
unbound.conf: |
server:
verbosity: 1
interface: 0.0.0.0
do-ip4: yes
do-ip6: no
do-udp: yes
do-tcp: yes
auto-trust-anchor-file: "/var/lib/unbound/root.key"
prefetch: yes
qname-minimisation: yes
harden-dnssec-stripped: yes
val-clean-additional: yes
domain-insecure: "mailu-mailserver.svc.cluster.local."
domain-insecure: "svc.cluster.local."
domain-insecure: "cluster.local."
cache-min-ttl: 120
cache-max-ttl: 86400
access-control: 0.0.0.0/0 allow
forward-zone:
name: "mailu-mailserver.svc.cluster.local."
forward-addr: 10.43.0.10
forward-no-cache: yes
forward-first: yes
forward-zone:
name: "svc.cluster.local."
forward-addr: 10.43.0.10
forward-no-cache: yes
forward-first: yes
forward-zone:
name: "cluster.local."
forward-addr: 10.43.0.10
forward-no-cache: yes
forward-first: yes
forward-zone:
name: "."
forward-addr: 9.9.9.9
forward-addr: 1.1.1.1

View File

@ -1,71 +0,0 @@
# services/mailu/vip-controller.yaml
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: vip-controller
namespace: mailu-mailserver
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: vip-controller-role
namespace: mailu-mailserver
rules:
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list", "patch", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: vip-controller-binding
namespace: mailu-mailserver
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: vip-controller-role
subjects:
- kind: ServiceAccount
name: vip-controller
namespace: mailu-mailserver
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: vip-controller
namespace: mailu-mailserver
spec:
selector:
matchLabels:
app: vip-controller
template:
metadata:
labels:
app: vip-controller
spec:
serviceAccountName: vip-controller
hostNetwork: true
nodeSelector:
mailu.bstein.dev/vip: "true"
containers:
- name: vip-controller
image: lachlanevenson/k8s-kubectl:latest
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
args:
- |
set -e
while true; do
if ip addr show end0 | grep -q 'inet 192\.168\.22\.9/32'; then
NODE=$(hostname)
echo "VIP found on node ${NODE}."
kubectl patch deployment mailu-front -n mailu-mailserver --type='merge' \
-p "{\"spec\":{\"template\":{\"spec\":{\"nodeSelector\":{\"kubernetes.io/hostname\":\"${NODE}\"}}}}}"
else
echo "No VIP on node ${HOSTNAME}."
fi
sleep 60
done

View File

@ -1,186 +0,0 @@
{
"uid": "atlas-gpu",
"title": "Atlas GPU",
"folderUid": "atlas-internal",
"editable": true,
"panels": [
{
"id": 1,
"type": "piechart",
"title": "Namespace GPU Share",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"targets": [
{
"expr": "100 * ( ( (sum by (namespace) (max_over_time(DCGM_FI_DEV_GPU_UTIL{namespace!=\"\",pod!=\"\"}[$__range]))) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) ) and on(namespace) ( (topk(10, ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) + (sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) / 1e9) + ((sum((kube_pod_container_resource_requests{namespace!=\"\",resource=\"nvidia.com/gpu\"} or kube_pod_container_resource_limits{namespace!=\"\",resource=\"nvidia.com/gpu\"})) by (namespace)) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) * 100)) >= bool 0) ) ) / clamp_min(sum( ( (sum by (namespace) (max_over_time(DCGM_FI_DEV_GPU_UTIL{namespace!=\"\",pod!=\"\"}[$__range]))) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) ) and on(namespace) ( (topk(10, ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) + (sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) / 1e9) + ((sum((kube_pod_container_resource_requests{namespace!=\"\",resource=\"nvidia.com/gpu\"} or kube_pod_container_resource_limits{namespace!=\"\",resource=\"nvidia.com/gpu\"})) by (namespace)) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) * 100)) >= bool 0) ) ), 1)",
"refId": "A",
"legendFormat": "{{namespace}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"color": {
"mode": "palette-classic"
}
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "list",
"placement": "right"
},
"pieType": "pie",
"displayLabels": [],
"tooltip": {
"mode": "single"
},
"colorScheme": "interpolateSpectral",
"colorBy": "value",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
}
}
},
{
"id": 2,
"type": "timeseries",
"title": "GPU Util by Namespace",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"targets": [
{
"expr": "sum(DCGM_FI_DEV_GPU_UTIL{namespace!=\"\",pod!=\"\"}) by (namespace)",
"refId": "A",
"legendFormat": "{{namespace}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right"
},
"tooltip": {
"mode": "multi"
}
}
},
{
"id": 3,
"type": "timeseries",
"title": "GPU Util by Node",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"targets": [
{
"expr": "sum by (Hostname) (DCGM_FI_DEV_GPU_UTIL{pod!=\"\"})",
"refId": "A",
"legendFormat": "{{Hostname}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right"
},
"tooltip": {
"mode": "multi"
}
}
},
{
"id": 4,
"type": "table",
"title": "Top Pods by GPU Util",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"targets": [
{
"expr": "topk(10, sum(DCGM_FI_DEV_GPU_UTIL{pod!=\"\"}) by (namespace,pod,Hostname))",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"custom": {
"filterable": true
}
},
"overrides": []
},
"options": {
"showHeader": true,
"columnFilters": false
},
"transformations": [
{
"id": "labelsToFields",
"options": {}
}
]
}
],
"time": {
"from": "now-12h",
"to": "now"
},
"annotations": {
"list": []
},
"schemaVersion": 39,
"style": "dark",
"tags": [
"atlas",
"gpu"
]
}

View File

@ -1,668 +0,0 @@
{
"uid": "atlas-network",
"title": "Atlas Network",
"folderUid": "atlas-internal",
"editable": true,
"panels": [
{
"id": 1,
"type": "stat",
"title": "Ingress Success Rate (5m)",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 6,
"x": 0,
"y": 0
},
"targets": [
{
"expr": "(sum(rate(traefik_entrypoint_requests_total{code!~\"5..\"}[5m]))) / clamp_min(sum(rate(traefik_entrypoint_requests_total[5m])), 1)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "red",
"value": null
},
{
"color": "orange",
"value": 0.995
},
{
"color": "yellow",
"value": 0.999
},
{
"color": "green",
"value": 0.9995
}
]
},
"unit": "percentunit",
"custom": {
"displayMode": "auto"
},
"decimals": 2
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 2,
"type": "stat",
"title": "Error Budget Burn (1h)",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 6,
"x": 6,
"y": 0
},
"targets": [
{
"expr": "(1 - ((sum(rate(traefik_entrypoint_requests_total{code!~\"5..\"}[1h]))) / clamp_min(sum(rate(traefik_entrypoint_requests_total[1h])), 1))) / 0.0010000000000000009",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 1
},
{
"color": "orange",
"value": 2
},
{
"color": "red",
"value": 4
}
]
},
"unit": "none",
"custom": {
"displayMode": "auto"
},
"decimals": 2
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 3,
"type": "stat",
"title": "Error Budget Burn (6h)",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 6,
"x": 12,
"y": 0
},
"targets": [
{
"expr": "(1 - ((sum(rate(traefik_entrypoint_requests_total{code!~\"5..\"}[6h]))) / clamp_min(sum(rate(traefik_entrypoint_requests_total[6h])), 1))) / 0.0010000000000000009",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 1
},
{
"color": "orange",
"value": 2
},
{
"color": "red",
"value": 4
}
]
},
"unit": "none",
"custom": {
"displayMode": "auto"
},
"decimals": 2
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 4,
"type": "stat",
"title": "Edge P99 Latency (ms)",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 6,
"x": 18,
"y": 0
},
"targets": [
{
"expr": "histogram_quantile(0.99, sum by (le) (rate(traefik_entrypoint_request_duration_seconds_bucket[5m]))) * 1000",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 200
},
{
"color": "orange",
"value": 350
},
{
"color": "red",
"value": 500
}
]
},
"unit": "ms",
"custom": {
"displayMode": "auto"
},
"decimals": 1
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 5,
"type": "stat",
"title": "Ingress Traffic",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 8,
"x": 0,
"y": 4
},
"targets": [
{
"expr": "sum(rate(node_network_receive_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "rgba(115, 115, 115, 1)",
"value": null
},
{
"color": "green",
"value": 1
}
]
},
"unit": "Bps",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 6,
"type": "stat",
"title": "Egress Traffic",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 8,
"x": 8,
"y": 4
},
"targets": [
{
"expr": "sum(rate(node_network_transmit_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "rgba(115, 115, 115, 1)",
"value": null
},
{
"color": "green",
"value": 1
}
]
},
"unit": "Bps",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 7,
"type": "stat",
"title": "Intra-Cluster Traffic",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 8,
"x": 16,
"y": 4
},
"targets": [
{
"expr": "sum(rate(container_network_receive_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m]) + rate(container_network_transmit_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m])) or on() vector(0)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "rgba(115, 115, 115, 1)",
"value": null
},
{
"color": "green",
"value": 1
}
]
},
"unit": "Bps",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 8,
"type": "timeseries",
"title": "Per-Node Throughput",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 8,
"w": 24,
"x": 0,
"y": 8
},
"targets": [
{
"expr": "avg by (node) ((sum(rate(node_network_transmit_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0) + sum(rate(node_network_receive_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))",
"refId": "A",
"legendFormat": "{{node}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "Bps"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right"
},
"tooltip": {
"mode": "multi"
}
}
},
{
"id": 9,
"type": "table",
"title": "Top Namespaces",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 12,
"x": 0,
"y": 16
},
"targets": [
{
"expr": "topk(10, sum(rate(container_network_transmit_bytes_total{namespace!=\"\"}[5m]) + rate(container_network_receive_bytes_total{namespace!=\"\"}[5m])) by (namespace))",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "Bps",
"custom": {
"filterable": true
}
},
"overrides": []
},
"options": {
"showHeader": true,
"columnFilters": false
},
"transformations": [
{
"id": "labelsToFields",
"options": {}
}
]
},
{
"id": 10,
"type": "table",
"title": "Top Pods",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 12,
"x": 12,
"y": 16
},
"targets": [
{
"expr": "topk(10, sum(rate(container_network_transmit_bytes_total{pod!=\"\"}[5m]) + rate(container_network_receive_bytes_total{pod!=\"\"}[5m])) by (namespace,pod))",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "Bps",
"custom": {
"filterable": true
}
},
"overrides": []
},
"options": {
"showHeader": true,
"columnFilters": false
},
"transformations": [
{
"id": "labelsToFields",
"options": {}
}
]
},
{
"id": 11,
"type": "timeseries",
"title": "Traefik Routers (req/s)",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 12,
"x": 0,
"y": 25
},
"targets": [
{
"expr": "topk(10, sum by (router) (rate(traefik_router_requests_total[5m])))",
"refId": "A",
"legendFormat": "{{router}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "req/s"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right"
},
"tooltip": {
"mode": "multi"
}
}
},
{
"id": 12,
"type": "timeseries",
"title": "Traefik Entrypoints (req/s)",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 12,
"x": 12,
"y": 25
},
"targets": [
{
"expr": "sum by (entrypoint) (rate(traefik_entrypoint_requests_total[5m]))",
"refId": "A",
"legendFormat": "{{entrypoint}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "req/s"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right"
},
"tooltip": {
"mode": "multi"
}
}
}
],
"time": {
"from": "now-12h",
"to": "now"
},
"annotations": {
"list": []
},
"schemaVersion": 39,
"style": "dark",
"tags": [
"atlas",
"network"
]
}

View File

@ -1,602 +0,0 @@
{
"uid": "atlas-nodes",
"title": "Atlas Nodes",
"folderUid": "atlas-internal",
"editable": true,
"panels": [
{
"id": 1,
"type": "stat",
"title": "Worker Nodes Ready",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 8,
"x": 0,
"y": 0
},
"targets": [
{
"expr": "sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-04|titan-05|titan-06|titan-07|titan-08|titan-09|titan-10|titan-11|titan-12|titan-13|titan-14|titan-15|titan-16|titan-17|titan-18|titan-19|titan-22|titan-24\"})",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "rgba(115, 115, 115, 1)",
"value": null
},
{
"color": "green",
"value": 1
}
]
},
"unit": "none",
"custom": {
"displayMode": "auto",
"valueSuffix": "/18"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 2,
"type": "stat",
"title": "Control Plane Ready",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 8,
"x": 8,
"y": 0
},
"targets": [
{
"expr": "sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-0a|titan-0b|titan-0c\"})",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "rgba(115, 115, 115, 1)",
"value": null
},
{
"color": "green",
"value": 1
}
]
},
"unit": "none",
"custom": {
"displayMode": "auto",
"valueSuffix": "/3"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 3,
"type": "stat",
"title": "Control Plane Workloads",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 8,
"x": 16,
"y": 0
},
"targets": [
{
"expr": "sum(kube_pod_info{node=~\"titan-0a|titan-0b|titan-0c\",namespace!~\"kube-system|kube-public|kube-node-lease|longhorn-system|monitoring|flux-system\"})",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "rgba(115, 115, 115, 1)",
"value": null
},
{
"color": "green",
"value": 1
}
]
},
"unit": "none",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 9,
"type": "stat",
"title": "API Server 5xx rate",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 8,
"x": 0,
"y": 4
},
"targets": [
{
"expr": "sum(rate(apiserver_request_total{code=~\"5..\"}[5m]))",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 0.05
},
{
"color": "orange",
"value": 0.2
},
{
"color": "red",
"value": 0.5
}
]
},
"unit": "req/s",
"custom": {
"displayMode": "auto"
},
"decimals": 3
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 10,
"type": "stat",
"title": "API Server P99 latency",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 8,
"x": 8,
"y": 4
},
"targets": [
{
"expr": "histogram_quantile(0.99, sum by (le) (rate(apiserver_request_duration_seconds_bucket[5m]))) * 1000",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 250
},
{
"color": "orange",
"value": 400
},
{
"color": "red",
"value": 600
}
]
},
"unit": "ms",
"custom": {
"displayMode": "auto"
},
"decimals": 1
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 11,
"type": "stat",
"title": "etcd P99 latency",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 8,
"x": 16,
"y": 4
},
"targets": [
{
"expr": "histogram_quantile(0.99, sum by (le) (rate(etcd_request_duration_seconds_bucket[5m]))) * 1000",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 50
},
{
"color": "orange",
"value": 100
},
{
"color": "red",
"value": 200
}
]
},
"unit": "ms",
"custom": {
"displayMode": "auto"
},
"decimals": 1
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 4,
"type": "timeseries",
"title": "Node CPU",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 24,
"x": 0,
"y": 8
},
"targets": [
{
"expr": "avg by (node) (((1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))",
"refId": "A",
"legendFormat": "{{node}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right",
"calcs": [
"last"
]
},
"tooltip": {
"mode": "multi"
}
}
},
{
"id": 5,
"type": "timeseries",
"title": "Node RAM",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 24,
"x": 0,
"y": 17
},
"targets": [
{
"expr": "avg by (node) ((avg by (instance) ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))",
"refId": "A",
"legendFormat": "{{node}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right",
"calcs": [
"last"
]
},
"tooltip": {
"mode": "multi"
}
}
},
{
"id": 6,
"type": "timeseries",
"title": "Control Plane (incl. titan-db) CPU",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 12,
"x": 0,
"y": 26
},
"targets": [
{
"expr": "(avg by (node) (((1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c|titan-db\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")",
"refId": "A",
"legendFormat": "{{node}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right"
},
"tooltip": {
"mode": "multi"
}
}
},
{
"id": 7,
"type": "timeseries",
"title": "Control Plane (incl. titan-db) RAM",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 12,
"x": 12,
"y": 26
},
"targets": [
{
"expr": "(avg by (node) ((avg by (instance) ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c|titan-db\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")",
"refId": "A",
"legendFormat": "{{node}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right"
},
"tooltip": {
"mode": "multi"
}
}
},
{
"id": 8,
"type": "timeseries",
"title": "Root Filesystem Usage",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 24,
"x": 0,
"y": 35
},
"targets": [
{
"expr": "avg by (node) ((avg by (instance) ((1 - (node_filesystem_avail_bytes{mountpoint=\"/\",fstype!~\"tmpfs|overlay\"} / node_filesystem_size_bytes{mountpoint=\"/\",fstype!~\"tmpfs|overlay\"})) * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))",
"refId": "A",
"legendFormat": "{{node}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right"
},
"tooltip": {
"mode": "multi"
}
},
"timeFrom": "30d"
}
],
"time": {
"from": "now-12h",
"to": "now"
},
"annotations": {
"list": []
},
"schemaVersion": 39,
"style": "dark",
"tags": [
"atlas",
"nodes"
]
}

File diff suppressed because it is too large Load Diff

View File

@ -1,616 +0,0 @@
{
"uid": "atlas-pods",
"title": "Atlas Pods",
"folderUid": "atlas-internal",
"editable": true,
"panels": [
{
"id": 1,
"type": "stat",
"title": "Problem Pods",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 6,
"x": 0,
"y": 0
},
"targets": [
{
"expr": "sum(max by (namespace,pod) (kube_pod_status_phase{phase!~\"Running|Succeeded\"})) or on() vector(0)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 1
}
]
},
"unit": "none",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 2,
"type": "stat",
"title": "CrashLoop / ImagePull",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 6,
"x": 6,
"y": 0
},
"targets": [
{
"expr": "sum(max by (namespace,pod) (kube_pod_container_status_waiting_reason{reason=~\"CrashLoopBackOff|ImagePullBackOff\"})) or on() vector(0)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 1
}
]
},
"unit": "none",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 3,
"type": "stat",
"title": "Stuck Terminating (>10m)",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 6,
"x": 12,
"y": 0
},
"targets": [
{
"expr": "sum(max by (namespace,pod) (((time() - kube_pod_deletion_timestamp{pod!=\"\"}) > bool 600) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0))) or on() vector(0)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 1
}
]
},
"unit": "none",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 4,
"type": "stat",
"title": "Control Plane Workloads",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 6,
"x": 18,
"y": 0
},
"targets": [
{
"expr": "sum(kube_pod_info{node=~\"titan-0a|titan-0b|titan-0c\",namespace!~\"kube-system|kube-public|kube-node-lease|longhorn-system|monitoring|flux-system\"})",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 1
}
]
},
"unit": "none",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 5,
"type": "table",
"title": "Pods Not Running",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 10,
"w": 24,
"x": 0,
"y": 4
},
"targets": [
{
"expr": "(time() - kube_pod_created{pod!=\"\"}) * on(namespace,pod) group_left(node) kube_pod_info * on(namespace,pod) group_left(phase) max by (namespace,pod,phase) (kube_pod_status_phase{phase!~\"Running|Succeeded\"})",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"custom": {
"filterable": true
}
},
"overrides": []
},
"options": {
"showHeader": true,
"columnFilters": false
},
"transformations": [
{
"id": "labelsToFields",
"options": {}
}
]
},
{
"id": 6,
"type": "table",
"title": "CrashLoop / ImagePull",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 10,
"w": 24,
"x": 0,
"y": 14
},
"targets": [
{
"expr": "(time() - kube_pod_created{pod!=\"\"}) * on(namespace,pod) group_left(node) kube_pod_info * on(namespace,pod,container) group_left(reason) max by (namespace,pod,container,reason) (kube_pod_container_status_waiting_reason{reason=~\"CrashLoopBackOff|ImagePullBackOff\"})",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"custom": {
"filterable": true
}
},
"overrides": []
},
"options": {
"showHeader": true,
"columnFilters": false
},
"transformations": [
{
"id": "labelsToFields",
"options": {}
}
]
},
{
"id": 7,
"type": "table",
"title": "Terminating >10m",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 10,
"w": 24,
"x": 0,
"y": 24
},
"targets": [
{
"expr": "(((time() - kube_pod_deletion_timestamp{pod!=\"\"}) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0)) * on(namespace,pod) group_left(node) kube_pod_info)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"custom": {
"filterable": true
}
},
"overrides": []
},
"options": {
"showHeader": true,
"columnFilters": false
},
"transformations": [
{
"id": "labelsToFields",
"options": {}
},
{
"id": "filterByValue",
"options": {
"match": "Value",
"operator": "gt",
"value": 600
}
}
]
},
{
"id": 8,
"type": "piechart",
"title": "Node Pod Share",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 34
},
"targets": [
{
"expr": "(sum(kube_pod_info{pod!=\"\" , node!=\"\"}) by (node) / clamp_min(sum(kube_pod_info{pod!=\"\" , node!=\"\"}), 1)) * 100",
"refId": "A",
"legendFormat": "{{namespace}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"color": {
"mode": "palette-classic"
}
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "list",
"placement": "right"
},
"pieType": "pie",
"displayLabels": [],
"tooltip": {
"mode": "single"
},
"colorScheme": "interpolateSpectral",
"colorBy": "value",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
}
}
},
{
"id": 9,
"type": "bargauge",
"title": "Top Nodes by Pod Count",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 34
},
"targets": [
{
"expr": "topk(12, sum(kube_pod_info{pod!=\"\" , node!=\"\"}) by (node))",
"refId": "A",
"legendFormat": "{{node}}",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"unit": "none",
"min": 0,
"max": null,
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 50
},
{
"color": "orange",
"value": 75
},
{
"color": "red",
"value": 100
}
]
},
"decimals": 0
},
"overrides": []
},
"options": {
"displayMode": "gradient",
"orientation": "horizontal",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
}
},
"transformations": [
{
"id": "sortBy",
"options": {
"fields": [
"Value"
],
"order": "desc"
}
},
{
"id": "limit",
"options": {
"limit": 12
}
}
]
},
{
"id": 10,
"type": "table",
"title": "Namespace Plurality by Node v27",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 8,
"w": 24,
"x": 0,
"y": 42
},
"targets": [
{
"expr": "(sum by (namespace,node) (kube_pod_info{pod!=\"\" , node!=\"\"}) / on(namespace) group_left() clamp_min(sum by (namespace) (kube_pod_info{pod!=\"\"}), 1) * 100) * on(namespace,node) group_left() ((sum by (namespace,node) (kube_pod_info{pod!=\"\" , node!=\"\"}) / on(namespace) group_left() clamp_min(sum by (namespace) (kube_pod_info{pod!=\"\"}), 1) * 100) + on(node) group_left() ((sum by (node) (kube_node_info{node=\"titan-0a\"}) * 0 + 0.001) or (sum by (node) (kube_node_info{node=\"titan-0b\"}) * 0 + 0.002) or (sum by (node) (kube_node_info{node=\"titan-0c\"}) * 0 + 0.003) or (sum by (node) (kube_node_info{node=\"titan-db\"}) * 0 + 0.004) or (sum by (node) (kube_node_info{node=\"titan-04\"}) * 0 + 0.005) or (sum by (node) (kube_node_info{node=\"titan-05\"}) * 0 + 0.006) or (sum by (node) (kube_node_info{node=\"titan-06\"}) * 0 + 0.007) or (sum by (node) (kube_node_info{node=\"titan-07\"}) * 0 + 0.008) or (sum by (node) (kube_node_info{node=\"titan-08\"}) * 0 + 0.009000000000000001) or (sum by (node) (kube_node_info{node=\"titan-09\"}) * 0 + 0.01) or (sum by (node) (kube_node_info{node=\"titan-10\"}) * 0 + 0.011) or (sum by (node) (kube_node_info{node=\"titan-11\"}) * 0 + 0.012) or (sum by (node) (kube_node_info{node=\"titan-12\"}) * 0 + 0.013000000000000001) or (sum by (node) (kube_node_info{node=\"titan-13\"}) * 0 + 0.014) or (sum by (node) (kube_node_info{node=\"titan-14\"}) * 0 + 0.015) or (sum by (node) (kube_node_info{node=\"titan-15\"}) * 0 + 0.016) or (sum by (node) (kube_node_info{node=\"titan-16\"}) * 0 + 0.017) or (sum by (node) (kube_node_info{node=\"titan-17\"}) * 0 + 0.018000000000000002) or (sum by (node) (kube_node_info{node=\"titan-18\"}) * 0 + 0.019) or (sum by (node) (kube_node_info{node=\"titan-19\"}) * 0 + 0.02) or (sum by (node) (kube_node_info{node=\"titan-22\"}) * 0 + 0.021) or (sum by (node) (kube_node_info{node=\"titan-24\"}) * 0 + 0.022)) == bool on(namespace) group_left() (max by (namespace) ((sum by (namespace,node) (kube_pod_info{pod!=\"\" , node!=\"\"}) / on(namespace) group_left() clamp_min(sum by (namespace) (kube_pod_info{pod!=\"\"}), 1) * 100) + on(node) group_left() ((sum by (node) (kube_node_info{node=\"titan-0a\"}) * 0 + 0.001) or (sum by (node) (kube_node_info{node=\"titan-0b\"}) * 0 + 0.002) or (sum by (node) (kube_node_info{node=\"titan-0c\"}) * 0 + 0.003) or (sum by (node) (kube_node_info{node=\"titan-db\"}) * 0 + 0.004) or (sum by (node) (kube_node_info{node=\"titan-04\"}) * 0 + 0.005) or (sum by (node) (kube_node_info{node=\"titan-05\"}) * 0 + 0.006) or (sum by (node) (kube_node_info{node=\"titan-06\"}) * 0 + 0.007) or (sum by (node) (kube_node_info{node=\"titan-07\"}) * 0 + 0.008) or (sum by (node) (kube_node_info{node=\"titan-08\"}) * 0 + 0.009000000000000001) or (sum by (node) (kube_node_info{node=\"titan-09\"}) * 0 + 0.01) or (sum by (node) (kube_node_info{node=\"titan-10\"}) * 0 + 0.011) or (sum by (node) (kube_node_info{node=\"titan-11\"}) * 0 + 0.012) or (sum by (node) (kube_node_info{node=\"titan-12\"}) * 0 + 0.013000000000000001) or (sum by (node) (kube_node_info{node=\"titan-13\"}) * 0 + 0.014) or (sum by (node) (kube_node_info{node=\"titan-14\"}) * 0 + 0.015) or (sum by (node) (kube_node_info{node=\"titan-15\"}) * 0 + 0.016) or (sum by (node) (kube_node_info{node=\"titan-16\"}) * 0 + 0.017) or (sum by (node) (kube_node_info{node=\"titan-17\"}) * 0 + 0.018000000000000002) or (sum by (node) (kube_node_info{node=\"titan-18\"}) * 0 + 0.019) or (sum by (node) (kube_node_info{node=\"titan-19\"}) * 0 + 0.02) or (sum by (node) (kube_node_info{node=\"titan-22\"}) * 0 + 0.021) or (sum by (node) (kube_node_info{node=\"titan-24\"}) * 0 + 0.022)))))",
"refId": "A",
"instant": true,
"format": "table"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"custom": {
"filterable": false
}
},
"overrides": []
},
"options": {
"showHeader": true,
"columnFilters": false,
"showColumnFilters": false,
"footer": {
"show": false,
"fields": "",
"calcs": []
}
},
"transformations": [
{
"id": "labelsToFields",
"options": {}
},
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true
}
}
},
{
"id": "filterByValue",
"options": {
"match": "Value",
"operator": "gt",
"value": 0
}
},
{
"id": "sortBy",
"options": {
"fields": [
"Value"
],
"order": "desc"
}
},
{
"id": "groupBy",
"options": {
"fields": {
"namespace": {
"aggregations": [
{
"field": "Value",
"operation": "max"
},
{
"field": "node",
"operation": "first"
}
]
}
},
"rowBy": [
"namespace"
]
}
}
]
}
],
"time": {
"from": "now-12h",
"to": "now"
},
"annotations": {
"list": []
},
"schemaVersion": 39,
"style": "dark",
"tags": [
"atlas",
"pods"
]
}

View File

@ -1,427 +0,0 @@
{
"uid": "atlas-storage",
"title": "Atlas Storage",
"folderUid": "atlas-internal",
"editable": true,
"panels": [
{
"id": 1,
"type": "stat",
"title": "Astreae Usage",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 5,
"w": 6,
"x": 0,
"y": 0
},
"targets": [
{
"expr": "100 - (sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"}) / sum(node_filesystem_size_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"}) * 100)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 50
},
{
"color": "orange",
"value": 75
},
{
"color": "red",
"value": 91.5
}
]
},
"unit": "percent",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 2,
"type": "stat",
"title": "Asteria Usage",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 5,
"w": 6,
"x": 6,
"y": 0
},
"targets": [
{
"expr": "100 - (sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"}) / sum(node_filesystem_size_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"}) * 100)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 50
},
{
"color": "orange",
"value": 75
},
{
"color": "red",
"value": 91.5
}
]
},
"unit": "percent",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 3,
"type": "stat",
"title": "Astreae Free",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 5,
"w": 6,
"x": 12,
"y": 0
},
"targets": [
{
"expr": "sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"})",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "rgba(115, 115, 115, 1)",
"value": null
},
{
"color": "green",
"value": 1
}
]
},
"unit": "decbytes",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 4,
"type": "stat",
"title": "Asteria Free",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 5,
"w": 6,
"x": 18,
"y": 0
},
"targets": [
{
"expr": "sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"})",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "rgba(115, 115, 115, 1)",
"value": null
},
{
"color": "green",
"value": 1
}
]
},
"unit": "decbytes",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 5,
"type": "timeseries",
"title": "Astreae Per-Node Usage",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 12,
"x": 0,
"y": 5
},
"targets": [
{
"expr": "(avg by (node) ((avg by (instance) ((1 - (node_filesystem_avail_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"} / node_filesystem_size_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"})) * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-1[2-9]|titan-2[24]\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")",
"refId": "A",
"legendFormat": "{{node}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right"
},
"tooltip": {
"mode": "multi"
}
},
"timeFrom": "30d"
},
{
"id": 6,
"type": "timeseries",
"title": "Asteria Per-Node Usage",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 12,
"x": 12,
"y": 5
},
"targets": [
{
"expr": "(avg by (node) ((avg by (instance) ((1 - (node_filesystem_avail_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"} / node_filesystem_size_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"})) * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-1[2-9]|titan-2[24]\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")",
"refId": "A",
"legendFormat": "{{node}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right"
},
"tooltip": {
"mode": "multi"
}
},
"timeFrom": "30d"
},
{
"id": 7,
"type": "timeseries",
"title": "Astreae Usage History",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 12,
"x": 0,
"y": 14
},
"targets": [
{
"expr": "100 - (sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"}) / sum(node_filesystem_size_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"}) * 100)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "bottom"
},
"tooltip": {
"mode": "multi"
}
},
"timeFrom": "90d"
},
{
"id": 8,
"type": "timeseries",
"title": "Asteria Usage History",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 12,
"x": 12,
"y": 14
},
"targets": [
{
"expr": "100 - (sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"}) / sum(node_filesystem_size_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"}) * 100)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "bottom"
},
"tooltip": {
"mode": "multi"
}
},
"timeFrom": "90d"
}
],
"time": {
"from": "now-12h",
"to": "now"
},
"annotations": {
"list": []
},
"schemaVersion": 39,
"style": "dark",
"tags": [
"atlas",
"storage"
]
}

View File

@ -1,80 +0,0 @@
# services/monitoring/dcgm-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: monitoring
labels:
app: dcgm-exporter
spec:
selector:
matchLabels:
app: dcgm-exporter
updateStrategy:
rollingUpdate:
maxUnavailable: 2
template:
metadata:
labels:
app: dcgm-exporter
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9400"
spec:
serviceAccountName: default
runtimeClassName: nvidia
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- titan-20
- titan-21
- titan-22
- titan-24
tolerations:
- operator: Exists
containers:
- name: dcgm-exporter
image: registry.bstein.dev/monitoring/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04
imagePullPolicy: Always
ports:
- name: metrics
containerPort: 9400
env:
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
securityContext:
privileged: true
resources:
requests:
cpu: 50m
memory: 64Mi
volumeMounts:
- name: pod-resources
mountPath: /var/lib/kubelet/pod-resources
imagePullSecrets:
- name: zot-regcred
volumes:
- name: pod-resources
hostPath:
path: /var/lib/kubelet/pod-resources
type: Directory
---
apiVersion: v1
kind: Service
metadata:
name: dcgm-exporter
namespace: monitoring
labels:
app: dcgm-exporter
spec:
selector:
app: dcgm-exporter
ports:
- name: metrics
port: 9400
targetPort: metrics

View File

@ -1,195 +0,0 @@
# services/monitoring/grafana-dashboard-gpu.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-gpu
labels:
grafana_dashboard: "1"
data:
atlas-gpu.json: |
{
"uid": "atlas-gpu",
"title": "Atlas GPU",
"folderUid": "atlas-internal",
"editable": true,
"panels": [
{
"id": 1,
"type": "piechart",
"title": "Namespace GPU Share",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"targets": [
{
"expr": "100 * ( ( (sum by (namespace) (max_over_time(DCGM_FI_DEV_GPU_UTIL{namespace!=\"\",pod!=\"\"}[$__range]))) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) ) and on(namespace) ( (topk(10, ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) + (sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) / 1e9) + ((sum((kube_pod_container_resource_requests{namespace!=\"\",resource=\"nvidia.com/gpu\"} or kube_pod_container_resource_limits{namespace!=\"\",resource=\"nvidia.com/gpu\"})) by (namespace)) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) * 100)) >= bool 0) ) ) / clamp_min(sum( ( (sum by (namespace) (max_over_time(DCGM_FI_DEV_GPU_UTIL{namespace!=\"\",pod!=\"\"}[$__range]))) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) ) and on(namespace) ( (topk(10, ( sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) ) + (sum(container_memory_working_set_bytes{namespace!=\"\",pod!=\"\",container!=\"\"}) by (namespace) / 1e9) + ((sum((kube_pod_container_resource_requests{namespace!=\"\",resource=\"nvidia.com/gpu\"} or kube_pod_container_resource_limits{namespace!=\"\",resource=\"nvidia.com/gpu\"})) by (namespace)) or on(namespace) (sum(rate(container_cpu_usage_seconds_total{namespace!=\"\",pod!=\"\",container!=\"\"}[5m])) by (namespace) * 0) * 100)) >= bool 0) ) ), 1)",
"refId": "A",
"legendFormat": "{{namespace}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"color": {
"mode": "palette-classic"
}
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "list",
"placement": "right"
},
"pieType": "pie",
"displayLabels": [],
"tooltip": {
"mode": "single"
},
"colorScheme": "interpolateSpectral",
"colorBy": "value",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
}
}
},
{
"id": 2,
"type": "timeseries",
"title": "GPU Util by Namespace",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"targets": [
{
"expr": "sum(DCGM_FI_DEV_GPU_UTIL{namespace!=\"\",pod!=\"\"}) by (namespace)",
"refId": "A",
"legendFormat": "{{namespace}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right"
},
"tooltip": {
"mode": "multi"
}
}
},
{
"id": 3,
"type": "timeseries",
"title": "GPU Util by Node",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"targets": [
{
"expr": "sum by (Hostname) (DCGM_FI_DEV_GPU_UTIL{pod!=\"\"})",
"refId": "A",
"legendFormat": "{{Hostname}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right"
},
"tooltip": {
"mode": "multi"
}
}
},
{
"id": 4,
"type": "table",
"title": "Top Pods by GPU Util",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"targets": [
{
"expr": "topk(10, sum(DCGM_FI_DEV_GPU_UTIL{pod!=\"\"}) by (namespace,pod,Hostname))",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"custom": {
"filterable": true
}
},
"overrides": []
},
"options": {
"showHeader": true,
"columnFilters": false
},
"transformations": [
{
"id": "labelsToFields",
"options": {}
}
]
}
],
"time": {
"from": "now-12h",
"to": "now"
},
"annotations": {
"list": []
},
"schemaVersion": 39,
"style": "dark",
"tags": [
"atlas",
"gpu"
]
}

View File

@ -1,677 +0,0 @@
# services/monitoring/grafana-dashboard-network.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-network
labels:
grafana_dashboard: "1"
data:
atlas-network.json: |
{
"uid": "atlas-network",
"title": "Atlas Network",
"folderUid": "atlas-internal",
"editable": true,
"panels": [
{
"id": 1,
"type": "stat",
"title": "Ingress Success Rate (5m)",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 6,
"x": 0,
"y": 0
},
"targets": [
{
"expr": "(sum(rate(traefik_entrypoint_requests_total{code!~\"5..\"}[5m]))) / clamp_min(sum(rate(traefik_entrypoint_requests_total[5m])), 1)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "red",
"value": null
},
{
"color": "orange",
"value": 0.995
},
{
"color": "yellow",
"value": 0.999
},
{
"color": "green",
"value": 0.9995
}
]
},
"unit": "percentunit",
"custom": {
"displayMode": "auto"
},
"decimals": 2
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 2,
"type": "stat",
"title": "Error Budget Burn (1h)",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 6,
"x": 6,
"y": 0
},
"targets": [
{
"expr": "(1 - ((sum(rate(traefik_entrypoint_requests_total{code!~\"5..\"}[1h]))) / clamp_min(sum(rate(traefik_entrypoint_requests_total[1h])), 1))) / 0.0010000000000000009",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 1
},
{
"color": "orange",
"value": 2
},
{
"color": "red",
"value": 4
}
]
},
"unit": "none",
"custom": {
"displayMode": "auto"
},
"decimals": 2
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 3,
"type": "stat",
"title": "Error Budget Burn (6h)",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 6,
"x": 12,
"y": 0
},
"targets": [
{
"expr": "(1 - ((sum(rate(traefik_entrypoint_requests_total{code!~\"5..\"}[6h]))) / clamp_min(sum(rate(traefik_entrypoint_requests_total[6h])), 1))) / 0.0010000000000000009",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 1
},
{
"color": "orange",
"value": 2
},
{
"color": "red",
"value": 4
}
]
},
"unit": "none",
"custom": {
"displayMode": "auto"
},
"decimals": 2
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 4,
"type": "stat",
"title": "Edge P99 Latency (ms)",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 6,
"x": 18,
"y": 0
},
"targets": [
{
"expr": "histogram_quantile(0.99, sum by (le) (rate(traefik_entrypoint_request_duration_seconds_bucket[5m]))) * 1000",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 200
},
{
"color": "orange",
"value": 350
},
{
"color": "red",
"value": 500
}
]
},
"unit": "ms",
"custom": {
"displayMode": "auto"
},
"decimals": 1
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 5,
"type": "stat",
"title": "Ingress Traffic",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 8,
"x": 0,
"y": 4
},
"targets": [
{
"expr": "sum(rate(node_network_receive_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "rgba(115, 115, 115, 1)",
"value": null
},
{
"color": "green",
"value": 1
}
]
},
"unit": "Bps",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 6,
"type": "stat",
"title": "Egress Traffic",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 8,
"x": 8,
"y": 4
},
"targets": [
{
"expr": "sum(rate(node_network_transmit_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "rgba(115, 115, 115, 1)",
"value": null
},
{
"color": "green",
"value": 1
}
]
},
"unit": "Bps",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 7,
"type": "stat",
"title": "Intra-Cluster Traffic",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 8,
"x": 16,
"y": 4
},
"targets": [
{
"expr": "sum(rate(container_network_receive_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m]) + rate(container_network_transmit_bytes_total{namespace!=\"traefik\",pod!=\"\"}[5m])) or on() vector(0)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "rgba(115, 115, 115, 1)",
"value": null
},
{
"color": "green",
"value": 1
}
]
},
"unit": "Bps",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 8,
"type": "timeseries",
"title": "Per-Node Throughput",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 8,
"w": 24,
"x": 0,
"y": 8
},
"targets": [
{
"expr": "avg by (node) ((sum(rate(node_network_transmit_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0) + sum(rate(node_network_receive_bytes_total{device!~\"lo|cni.*|veth.*|flannel.*|docker.*|virbr.*|vxlan.*|wg.*\"}[5m])) or on() vector(0)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))",
"refId": "A",
"legendFormat": "{{node}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "Bps"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right"
},
"tooltip": {
"mode": "multi"
}
}
},
{
"id": 9,
"type": "table",
"title": "Top Namespaces",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 12,
"x": 0,
"y": 16
},
"targets": [
{
"expr": "topk(10, sum(rate(container_network_transmit_bytes_total{namespace!=\"\"}[5m]) + rate(container_network_receive_bytes_total{namespace!=\"\"}[5m])) by (namespace))",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "Bps",
"custom": {
"filterable": true
}
},
"overrides": []
},
"options": {
"showHeader": true,
"columnFilters": false
},
"transformations": [
{
"id": "labelsToFields",
"options": {}
}
]
},
{
"id": 10,
"type": "table",
"title": "Top Pods",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 12,
"x": 12,
"y": 16
},
"targets": [
{
"expr": "topk(10, sum(rate(container_network_transmit_bytes_total{pod!=\"\"}[5m]) + rate(container_network_receive_bytes_total{pod!=\"\"}[5m])) by (namespace,pod))",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "Bps",
"custom": {
"filterable": true
}
},
"overrides": []
},
"options": {
"showHeader": true,
"columnFilters": false
},
"transformations": [
{
"id": "labelsToFields",
"options": {}
}
]
},
{
"id": 11,
"type": "timeseries",
"title": "Traefik Routers (req/s)",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 12,
"x": 0,
"y": 25
},
"targets": [
{
"expr": "topk(10, sum by (router) (rate(traefik_router_requests_total[5m])))",
"refId": "A",
"legendFormat": "{{router}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "req/s"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right"
},
"tooltip": {
"mode": "multi"
}
}
},
{
"id": 12,
"type": "timeseries",
"title": "Traefik Entrypoints (req/s)",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 12,
"x": 12,
"y": 25
},
"targets": [
{
"expr": "sum by (entrypoint) (rate(traefik_entrypoint_requests_total[5m]))",
"refId": "A",
"legendFormat": "{{entrypoint}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "req/s"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right"
},
"tooltip": {
"mode": "multi"
}
}
}
],
"time": {
"from": "now-12h",
"to": "now"
},
"annotations": {
"list": []
},
"schemaVersion": 39,
"style": "dark",
"tags": [
"atlas",
"network"
]
}

View File

@ -1,611 +0,0 @@
# services/monitoring/grafana-dashboard-nodes.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-nodes
labels:
grafana_dashboard: "1"
data:
atlas-nodes.json: |
{
"uid": "atlas-nodes",
"title": "Atlas Nodes",
"folderUid": "atlas-internal",
"editable": true,
"panels": [
{
"id": 1,
"type": "stat",
"title": "Worker Nodes Ready",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 8,
"x": 0,
"y": 0
},
"targets": [
{
"expr": "sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-04|titan-05|titan-06|titan-07|titan-08|titan-09|titan-10|titan-11|titan-12|titan-13|titan-14|titan-15|titan-16|titan-17|titan-18|titan-19|titan-22|titan-24\"})",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "rgba(115, 115, 115, 1)",
"value": null
},
{
"color": "green",
"value": 1
}
]
},
"unit": "none",
"custom": {
"displayMode": "auto",
"valueSuffix": "/18"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 2,
"type": "stat",
"title": "Control Plane Ready",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 8,
"x": 8,
"y": 0
},
"targets": [
{
"expr": "sum(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"titan-0a|titan-0b|titan-0c\"})",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "rgba(115, 115, 115, 1)",
"value": null
},
{
"color": "green",
"value": 1
}
]
},
"unit": "none",
"custom": {
"displayMode": "auto",
"valueSuffix": "/3"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 3,
"type": "stat",
"title": "Control Plane Workloads",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 8,
"x": 16,
"y": 0
},
"targets": [
{
"expr": "sum(kube_pod_info{node=~\"titan-0a|titan-0b|titan-0c\",namespace!~\"kube-system|kube-public|kube-node-lease|longhorn-system|monitoring|flux-system\"})",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "rgba(115, 115, 115, 1)",
"value": null
},
{
"color": "green",
"value": 1
}
]
},
"unit": "none",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 9,
"type": "stat",
"title": "API Server 5xx rate",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 8,
"x": 0,
"y": 4
},
"targets": [
{
"expr": "sum(rate(apiserver_request_total{code=~\"5..\"}[5m]))",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 0.05
},
{
"color": "orange",
"value": 0.2
},
{
"color": "red",
"value": 0.5
}
]
},
"unit": "req/s",
"custom": {
"displayMode": "auto"
},
"decimals": 3
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 10,
"type": "stat",
"title": "API Server P99 latency",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 8,
"x": 8,
"y": 4
},
"targets": [
{
"expr": "histogram_quantile(0.99, sum by (le) (rate(apiserver_request_duration_seconds_bucket[5m]))) * 1000",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 250
},
{
"color": "orange",
"value": 400
},
{
"color": "red",
"value": 600
}
]
},
"unit": "ms",
"custom": {
"displayMode": "auto"
},
"decimals": 1
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 11,
"type": "stat",
"title": "etcd P99 latency",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 8,
"x": 16,
"y": 4
},
"targets": [
{
"expr": "histogram_quantile(0.99, sum by (le) (rate(etcd_request_duration_seconds_bucket[5m]))) * 1000",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 50
},
{
"color": "orange",
"value": 100
},
{
"color": "red",
"value": 200
}
]
},
"unit": "ms",
"custom": {
"displayMode": "auto"
},
"decimals": 1
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 4,
"type": "timeseries",
"title": "Node CPU",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 24,
"x": 0,
"y": 8
},
"targets": [
{
"expr": "avg by (node) (((1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))",
"refId": "A",
"legendFormat": "{{node}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right",
"calcs": [
"last"
]
},
"tooltip": {
"mode": "multi"
}
}
},
{
"id": 5,
"type": "timeseries",
"title": "Node RAM",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 24,
"x": 0,
"y": 17
},
"targets": [
{
"expr": "avg by (node) ((avg by (instance) ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))",
"refId": "A",
"legendFormat": "{{node}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right",
"calcs": [
"last"
]
},
"tooltip": {
"mode": "multi"
}
}
},
{
"id": 6,
"type": "timeseries",
"title": "Control Plane (incl. titan-db) CPU",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 12,
"x": 0,
"y": 26
},
"targets": [
{
"expr": "(avg by (node) (((1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c|titan-db\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")",
"refId": "A",
"legendFormat": "{{node}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right"
},
"tooltip": {
"mode": "multi"
}
}
},
{
"id": 7,
"type": "timeseries",
"title": "Control Plane (incl. titan-db) RAM",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 12,
"x": 12,
"y": 26
},
"targets": [
{
"expr": "(avg by (node) ((avg by (instance) ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-0a|titan-0b|titan-0c|titan-db\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")",
"refId": "A",
"legendFormat": "{{node}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right"
},
"tooltip": {
"mode": "multi"
}
}
},
{
"id": 8,
"type": "timeseries",
"title": "Root Filesystem Usage",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 24,
"x": 0,
"y": 35
},
"targets": [
{
"expr": "avg by (node) ((avg by (instance) ((1 - (node_filesystem_avail_bytes{mountpoint=\"/\",fstype!~\"tmpfs|overlay\"} / node_filesystem_size_bytes{mountpoint=\"/\",fstype!~\"tmpfs|overlay\"})) * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))",
"refId": "A",
"legendFormat": "{{node}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right"
},
"tooltip": {
"mode": "multi"
}
},
"timeFrom": "30d"
}
],
"time": {
"from": "now-12h",
"to": "now"
},
"annotations": {
"list": []
},
"schemaVersion": 39,
"style": "dark",
"tags": [
"atlas",
"nodes"
]
}

File diff suppressed because it is too large Load Diff

View File

@ -1,625 +0,0 @@
# services/monitoring/grafana-dashboard-pods.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-pods
labels:
grafana_dashboard: "1"
data:
atlas-pods.json: |
{
"uid": "atlas-pods",
"title": "Atlas Pods",
"folderUid": "atlas-internal",
"editable": true,
"panels": [
{
"id": 1,
"type": "stat",
"title": "Problem Pods",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 6,
"x": 0,
"y": 0
},
"targets": [
{
"expr": "sum(max by (namespace,pod) (kube_pod_status_phase{phase!~\"Running|Succeeded\"})) or on() vector(0)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 1
}
]
},
"unit": "none",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 2,
"type": "stat",
"title": "CrashLoop / ImagePull",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 6,
"x": 6,
"y": 0
},
"targets": [
{
"expr": "sum(max by (namespace,pod) (kube_pod_container_status_waiting_reason{reason=~\"CrashLoopBackOff|ImagePullBackOff\"})) or on() vector(0)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 1
}
]
},
"unit": "none",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 3,
"type": "stat",
"title": "Stuck Terminating (>10m)",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 6,
"x": 12,
"y": 0
},
"targets": [
{
"expr": "sum(max by (namespace,pod) (((time() - kube_pod_deletion_timestamp{pod!=\"\"}) > bool 600) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0))) or on() vector(0)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 1
}
]
},
"unit": "none",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 4,
"type": "stat",
"title": "Control Plane Workloads",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 4,
"w": 6,
"x": 18,
"y": 0
},
"targets": [
{
"expr": "sum(kube_pod_info{node=~\"titan-0a|titan-0b|titan-0c\",namespace!~\"kube-system|kube-public|kube-node-lease|longhorn-system|monitoring|flux-system\"})",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 1
}
]
},
"unit": "none",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 5,
"type": "table",
"title": "Pods Not Running",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 10,
"w": 24,
"x": 0,
"y": 4
},
"targets": [
{
"expr": "(time() - kube_pod_created{pod!=\"\"}) * on(namespace,pod) group_left(node) kube_pod_info * on(namespace,pod) group_left(phase) max by (namespace,pod,phase) (kube_pod_status_phase{phase!~\"Running|Succeeded\"})",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"custom": {
"filterable": true
}
},
"overrides": []
},
"options": {
"showHeader": true,
"columnFilters": false
},
"transformations": [
{
"id": "labelsToFields",
"options": {}
}
]
},
{
"id": 6,
"type": "table",
"title": "CrashLoop / ImagePull",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 10,
"w": 24,
"x": 0,
"y": 14
},
"targets": [
{
"expr": "(time() - kube_pod_created{pod!=\"\"}) * on(namespace,pod) group_left(node) kube_pod_info * on(namespace,pod,container) group_left(reason) max by (namespace,pod,container,reason) (kube_pod_container_status_waiting_reason{reason=~\"CrashLoopBackOff|ImagePullBackOff\"})",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"custom": {
"filterable": true
}
},
"overrides": []
},
"options": {
"showHeader": true,
"columnFilters": false
},
"transformations": [
{
"id": "labelsToFields",
"options": {}
}
]
},
{
"id": 7,
"type": "table",
"title": "Terminating >10m",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 10,
"w": 24,
"x": 0,
"y": 24
},
"targets": [
{
"expr": "(((time() - kube_pod_deletion_timestamp{pod!=\"\"}) and on(namespace,pod) (kube_pod_deletion_timestamp{pod!=\"\"} > bool 0)) * on(namespace,pod) group_left(node) kube_pod_info)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"custom": {
"filterable": true
}
},
"overrides": []
},
"options": {
"showHeader": true,
"columnFilters": false
},
"transformations": [
{
"id": "labelsToFields",
"options": {}
},
{
"id": "filterByValue",
"options": {
"match": "Value",
"operator": "gt",
"value": 600
}
}
]
},
{
"id": 8,
"type": "piechart",
"title": "Node Pod Share",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 34
},
"targets": [
{
"expr": "(sum(kube_pod_info{pod!=\"\" , node!=\"\"}) by (node) / clamp_min(sum(kube_pod_info{pod!=\"\" , node!=\"\"}), 1)) * 100",
"refId": "A",
"legendFormat": "{{namespace}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"color": {
"mode": "palette-classic"
}
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "list",
"placement": "right"
},
"pieType": "pie",
"displayLabels": [],
"tooltip": {
"mode": "single"
},
"colorScheme": "interpolateSpectral",
"colorBy": "value",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
}
}
},
{
"id": 9,
"type": "bargauge",
"title": "Top Nodes by Pod Count",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 34
},
"targets": [
{
"expr": "topk(12, sum(kube_pod_info{pod!=\"\" , node!=\"\"}) by (node))",
"refId": "A",
"legendFormat": "{{node}}",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"unit": "none",
"min": 0,
"max": null,
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 50
},
{
"color": "orange",
"value": 75
},
{
"color": "red",
"value": 100
}
]
},
"decimals": 0
},
"overrides": []
},
"options": {
"displayMode": "gradient",
"orientation": "horizontal",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
}
},
"transformations": [
{
"id": "sortBy",
"options": {
"fields": [
"Value"
],
"order": "desc"
}
},
{
"id": "limit",
"options": {
"limit": 12
}
}
]
},
{
"id": 10,
"type": "table",
"title": "Namespace Plurality by Node v27",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 8,
"w": 24,
"x": 0,
"y": 42
},
"targets": [
{
"expr": "(sum by (namespace,node) (kube_pod_info{pod!=\"\" , node!=\"\"}) / on(namespace) group_left() clamp_min(sum by (namespace) (kube_pod_info{pod!=\"\"}), 1) * 100) * on(namespace,node) group_left() ((sum by (namespace,node) (kube_pod_info{pod!=\"\" , node!=\"\"}) / on(namespace) group_left() clamp_min(sum by (namespace) (kube_pod_info{pod!=\"\"}), 1) * 100) + on(node) group_left() ((sum by (node) (kube_node_info{node=\"titan-0a\"}) * 0 + 0.001) or (sum by (node) (kube_node_info{node=\"titan-0b\"}) * 0 + 0.002) or (sum by (node) (kube_node_info{node=\"titan-0c\"}) * 0 + 0.003) or (sum by (node) (kube_node_info{node=\"titan-db\"}) * 0 + 0.004) or (sum by (node) (kube_node_info{node=\"titan-04\"}) * 0 + 0.005) or (sum by (node) (kube_node_info{node=\"titan-05\"}) * 0 + 0.006) or (sum by (node) (kube_node_info{node=\"titan-06\"}) * 0 + 0.007) or (sum by (node) (kube_node_info{node=\"titan-07\"}) * 0 + 0.008) or (sum by (node) (kube_node_info{node=\"titan-08\"}) * 0 + 0.009000000000000001) or (sum by (node) (kube_node_info{node=\"titan-09\"}) * 0 + 0.01) or (sum by (node) (kube_node_info{node=\"titan-10\"}) * 0 + 0.011) or (sum by (node) (kube_node_info{node=\"titan-11\"}) * 0 + 0.012) or (sum by (node) (kube_node_info{node=\"titan-12\"}) * 0 + 0.013000000000000001) or (sum by (node) (kube_node_info{node=\"titan-13\"}) * 0 + 0.014) or (sum by (node) (kube_node_info{node=\"titan-14\"}) * 0 + 0.015) or (sum by (node) (kube_node_info{node=\"titan-15\"}) * 0 + 0.016) or (sum by (node) (kube_node_info{node=\"titan-16\"}) * 0 + 0.017) or (sum by (node) (kube_node_info{node=\"titan-17\"}) * 0 + 0.018000000000000002) or (sum by (node) (kube_node_info{node=\"titan-18\"}) * 0 + 0.019) or (sum by (node) (kube_node_info{node=\"titan-19\"}) * 0 + 0.02) or (sum by (node) (kube_node_info{node=\"titan-22\"}) * 0 + 0.021) or (sum by (node) (kube_node_info{node=\"titan-24\"}) * 0 + 0.022)) == bool on(namespace) group_left() (max by (namespace) ((sum by (namespace,node) (kube_pod_info{pod!=\"\" , node!=\"\"}) / on(namespace) group_left() clamp_min(sum by (namespace) (kube_pod_info{pod!=\"\"}), 1) * 100) + on(node) group_left() ((sum by (node) (kube_node_info{node=\"titan-0a\"}) * 0 + 0.001) or (sum by (node) (kube_node_info{node=\"titan-0b\"}) * 0 + 0.002) or (sum by (node) (kube_node_info{node=\"titan-0c\"}) * 0 + 0.003) or (sum by (node) (kube_node_info{node=\"titan-db\"}) * 0 + 0.004) or (sum by (node) (kube_node_info{node=\"titan-04\"}) * 0 + 0.005) or (sum by (node) (kube_node_info{node=\"titan-05\"}) * 0 + 0.006) or (sum by (node) (kube_node_info{node=\"titan-06\"}) * 0 + 0.007) or (sum by (node) (kube_node_info{node=\"titan-07\"}) * 0 + 0.008) or (sum by (node) (kube_node_info{node=\"titan-08\"}) * 0 + 0.009000000000000001) or (sum by (node) (kube_node_info{node=\"titan-09\"}) * 0 + 0.01) or (sum by (node) (kube_node_info{node=\"titan-10\"}) * 0 + 0.011) or (sum by (node) (kube_node_info{node=\"titan-11\"}) * 0 + 0.012) or (sum by (node) (kube_node_info{node=\"titan-12\"}) * 0 + 0.013000000000000001) or (sum by (node) (kube_node_info{node=\"titan-13\"}) * 0 + 0.014) or (sum by (node) (kube_node_info{node=\"titan-14\"}) * 0 + 0.015) or (sum by (node) (kube_node_info{node=\"titan-15\"}) * 0 + 0.016) or (sum by (node) (kube_node_info{node=\"titan-16\"}) * 0 + 0.017) or (sum by (node) (kube_node_info{node=\"titan-17\"}) * 0 + 0.018000000000000002) or (sum by (node) (kube_node_info{node=\"titan-18\"}) * 0 + 0.019) or (sum by (node) (kube_node_info{node=\"titan-19\"}) * 0 + 0.02) or (sum by (node) (kube_node_info{node=\"titan-22\"}) * 0 + 0.021) or (sum by (node) (kube_node_info{node=\"titan-24\"}) * 0 + 0.022)))))",
"refId": "A",
"instant": true,
"format": "table"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"custom": {
"filterable": false
}
},
"overrides": []
},
"options": {
"showHeader": true,
"columnFilters": false,
"showColumnFilters": false,
"footer": {
"show": false,
"fields": "",
"calcs": []
}
},
"transformations": [
{
"id": "labelsToFields",
"options": {}
},
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true
}
}
},
{
"id": "filterByValue",
"options": {
"match": "Value",
"operator": "gt",
"value": 0
}
},
{
"id": "sortBy",
"options": {
"fields": [
"Value"
],
"order": "desc"
}
},
{
"id": "groupBy",
"options": {
"fields": {
"namespace": {
"aggregations": [
{
"field": "Value",
"operation": "max"
},
{
"field": "node",
"operation": "first"
}
]
}
},
"rowBy": [
"namespace"
]
}
}
]
}
],
"time": {
"from": "now-12h",
"to": "now"
},
"annotations": {
"list": []
},
"schemaVersion": 39,
"style": "dark",
"tags": [
"atlas",
"pods"
]
}

View File

@ -1,436 +0,0 @@
# services/monitoring/grafana-dashboard-storage.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-storage
labels:
grafana_dashboard: "1"
data:
atlas-storage.json: |
{
"uid": "atlas-storage",
"title": "Atlas Storage",
"folderUid": "atlas-internal",
"editable": true,
"panels": [
{
"id": 1,
"type": "stat",
"title": "Astreae Usage",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 5,
"w": 6,
"x": 0,
"y": 0
},
"targets": [
{
"expr": "100 - (sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"}) / sum(node_filesystem_size_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"}) * 100)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 50
},
{
"color": "orange",
"value": 75
},
{
"color": "red",
"value": 91.5
}
]
},
"unit": "percent",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 2,
"type": "stat",
"title": "Asteria Usage",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 5,
"w": 6,
"x": 6,
"y": 0
},
"targets": [
{
"expr": "100 - (sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"}) / sum(node_filesystem_size_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"}) * 100)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 50
},
{
"color": "orange",
"value": 75
},
{
"color": "red",
"value": 91.5
}
]
},
"unit": "percent",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 3,
"type": "stat",
"title": "Astreae Free",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 5,
"w": 6,
"x": 12,
"y": 0
},
"targets": [
{
"expr": "sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"})",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "rgba(115, 115, 115, 1)",
"value": null
},
{
"color": "green",
"value": 1
}
]
},
"unit": "decbytes",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 4,
"type": "stat",
"title": "Asteria Free",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 5,
"w": 6,
"x": 18,
"y": 0
},
"targets": [
{
"expr": "sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"})",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "rgba(115, 115, 115, 1)",
"value": null
},
{
"color": "green",
"value": 1
}
]
},
"unit": "decbytes",
"custom": {
"displayMode": "auto"
}
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "center",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value"
}
},
{
"id": 5,
"type": "timeseries",
"title": "Astreae Per-Node Usage",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 12,
"x": 0,
"y": 5
},
"targets": [
{
"expr": "(avg by (node) ((avg by (instance) ((1 - (node_filesystem_avail_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"} / node_filesystem_size_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"})) * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-1[2-9]|titan-2[24]\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")",
"refId": "A",
"legendFormat": "{{node}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right"
},
"tooltip": {
"mode": "multi"
}
},
"timeFrom": "30d"
},
{
"id": 6,
"type": "timeseries",
"title": "Asteria Per-Node Usage",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 12,
"x": 12,
"y": 5
},
"targets": [
{
"expr": "(avg by (node) ((avg by (instance) ((1 - (node_filesystem_avail_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"} / node_filesystem_size_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"})) * 100)) * on(instance) group_left(node) label_replace(node_uname_info{nodename!=\"\"}, \"node\", \"$1\", \"nodename\", \"(.*)\"))) * on(node) group_left() label_replace(node_uname_info{nodename=~\"titan-1[2-9]|titan-2[24]\"}, \"node\", \"$1\", \"nodename\", \"(.*)\")",
"refId": "A",
"legendFormat": "{{node}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right"
},
"tooltip": {
"mode": "multi"
}
},
"timeFrom": "30d"
},
{
"id": 7,
"type": "timeseries",
"title": "Astreae Usage History",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 12,
"x": 0,
"y": 14
},
"targets": [
{
"expr": "100 - (sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"}) / sum(node_filesystem_size_bytes{mountpoint=\"/mnt/astreae\",fstype!~\"tmpfs|overlay\"}) * 100)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "bottom"
},
"tooltip": {
"mode": "multi"
}
},
"timeFrom": "90d"
},
{
"id": 8,
"type": "timeseries",
"title": "Asteria Usage History",
"datasource": {
"type": "prometheus",
"uid": "atlas-vm"
},
"gridPos": {
"h": 9,
"w": 12,
"x": 12,
"y": 14
},
"targets": [
{
"expr": "100 - (sum(node_filesystem_avail_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"}) / sum(node_filesystem_size_bytes{mountpoint=\"/mnt/asteria\",fstype!~\"tmpfs|overlay\"}) * 100)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
},
"overrides": []
},
"options": {
"legend": {
"displayMode": "table",
"placement": "bottom"
},
"tooltip": {
"mode": "multi"
}
},
"timeFrom": "90d"
}
],
"time": {
"from": "now-12h",
"to": "now"
},
"annotations": {
"list": []
},
"schemaVersion": 39,
"style": "dark",
"tags": [
"atlas",
"storage"
]
}

View File

@ -1,35 +0,0 @@
# services/monitoring/grafana-folders.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-folders
labels:
app.kubernetes.io/name: grafana
app.kubernetes.io/component: folders
data:
folders.yaml: |
apiVersion: 1
folders:
- uid: overview
title: Overview
permissions:
- role: Viewer
permission: View
- role: Editor
permission: Edit
- role: Admin
permission: Admin
- uid: atlas-internal
title: Atlas Internal
permissions:
- role: Editor
permission: View
- role: Admin
permission: Admin
- uid: oceanus-internal
title: Oceanus Internal
permissions:
- role: Editor
permission: View
- role: Admin
permission: Admin

View File

@ -65,13 +65,14 @@ spec:
namespace: flux-system
values:
server:
# keep 1 year; supports "d", "y"
# keep ~3 months; change as you like (supports "d", "y")
extraArgs:
retentionPeriod: "1y" # VM flag -retentionPeriod=1y. :contentReference[oaicite:11]{index=11}
retentionPeriod: "90d" # VM flag -retentionPeriod=90d. :contentReference[oaicite:11]{index=11}
persistentVolume:
enabled: true
size: 250Gi
size: 100Gi # adjust; uses default StorageClass (Longhorn)
# storageClassName: "" # set if you want a specific class
# Enable built-in Kubernetes scraping
scrape:
@ -186,15 +187,6 @@ spec:
- targets: ["longhorn-backend.longhorn-system.svc:9500"]
metrics_path: /metrics
# --- titan-db node_exporter (external control-plane DB host) ---
- job_name: "titan-db"
static_configs:
- targets: ["192.168.22.10:9100"]
relabel_configs:
- source_labels: [__address__]
target_label: instance
replacement: titan-db
# --- cert-manager (pods expose on 9402) ---
- job_name: "cert-manager"
kubernetes_sd_configs: [{ role: pod }]
@ -218,195 +210,3 @@ spec:
- action: keep
source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_label_app_kubernetes_io_part_of]
regex: flux-system;flux
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: grafana
namespace: monitoring
spec:
interval: 15m
chart:
spec:
chart: grafana
version: "~8.5.0"
sourceRef:
kind: HelmRepository
name: grafana
namespace: flux-system
values:
admin:
existingSecret: grafana-admin
userKey: admin-user
passwordKey: admin-password
persistence:
enabled: true
size: 20Gi
storageClassName: astreae
service:
type: ClusterIP
env:
GF_AUTH_ANONYMOUS_ENABLED: "false"
GF_SECURITY_ALLOW_EMBEDDING: "true"
GF_AUTH_GENERIC_OAUTH_ENABLED: "true"
GF_AUTH_GENERIC_OAUTH_NAME: "Keycloak"
GF_AUTH_GENERIC_OAUTH_ALLOW_SIGN_UP: "true"
GF_AUTH_GENERIC_OAUTH_SCOPES: "openid profile email groups"
GF_AUTH_GENERIC_OAUTH_AUTH_URL: "https://sso.bstein.dev/realms/atlas/protocol/openid-connect/auth"
GF_AUTH_GENERIC_OAUTH_TOKEN_URL: "https://sso.bstein.dev/realms/atlas/protocol/openid-connect/token"
GF_AUTH_GENERIC_OAUTH_API_URL: "https://sso.bstein.dev/realms/atlas/protocol/openid-connect/userinfo"
GF_AUTH_GENERIC_OAUTH_ROLE_ATTRIBUTE_PATH: "contains(groups, 'admin') && 'Admin' || 'Viewer'"
GF_AUTH_GENERIC_OAUTH_TLS_SKIP_VERIFY_INSECURE: "false"
GF_AUTH_SIGNOUT_REDIRECT_URL: "https://sso.bstein.dev/realms/atlas/protocol/openid-connect/logout?redirect_uri=https://metrics.bstein.dev/"
envValueFrom:
GF_AUTH_GENERIC_OAUTH_CLIENT_ID:
secretKeyRef:
name: grafana-oidc
key: client_id
GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET:
secretKeyRef:
name: grafana-oidc
key: client_secret
grafana.ini:
server:
domain: metrics.bstein.dev
root_url: https://metrics.bstein.dev/
dashboards:
default_home_dashboard_path: /var/lib/grafana/dashboards/overview/atlas-overview.json
auth.anonymous:
hide_version: true
users:
default_theme: dark
ingress:
enabled: true
ingressClassName: traefik
annotations:
cert-manager.io/cluster-issuer: letsencrypt
hosts:
- metrics.bstein.dev
path: /
tls:
- secretName: grafana-metrics-tls
hosts:
- metrics.bstein.dev
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: VictoriaMetrics
type: prometheus
access: proxy
url: http://victoria-metrics-single-server:8428
isDefault: true
jsonData:
timeInterval: "15s"
uid: atlas-vm
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: overview
orgId: 1
folder: Overview
type: file
disableDeletion: false
editable: false
options:
path: /var/lib/grafana/dashboards/overview
- name: pods
orgId: 1
folder: Atlas Internal
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/pods
- name: nodes
orgId: 1
folder: Atlas Internal
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/nodes
- name: storage
orgId: 1
folder: Atlas Internal
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/storage
- name: gpu
orgId: 1
folder: Atlas Internal
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/gpu
- name: network
orgId: 1
folder: Atlas Internal
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/network
dashboardsConfigMaps:
overview: grafana-dashboard-overview
pods: grafana-dashboard-pods
nodes: grafana-dashboard-nodes
storage: grafana-dashboard-storage
gpu: grafana-dashboard-gpu
network: grafana-dashboard-network
extraConfigmapMounts:
- name: grafana-folders
mountPath: /etc/grafana/provisioning/folders
configMap: grafana-folders
readOnly: true
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: alertmanager
namespace: monitoring
spec:
interval: 15m
chart:
spec:
chart: alertmanager
version: "~1.9.0"
sourceRef:
kind: HelmRepository
name: prometheus
namespace: flux-system
values:
ingress:
enabled: true
ingressClassName: traefik
annotations:
cert-manager.io/cluster-issuer: letsencrypt
hosts:
- host: alerts.bstein.dev
paths:
- path: /
pathType: Prefix
tls:
- secretName: alerts-bstein-dev-tls
hosts:
- alerts.bstein.dev
config:
global:
resolve_timeout: 5m
route:
receiver: default
group_wait: 30s
group_interval: 5m
repeat_interval: 2h
receivers:
- name: default

View File

@ -5,12 +5,4 @@ namespace: monitoring
resources:
- namespace.yaml
- rbac.yaml
- grafana-dashboard-overview.yaml
- grafana-dashboard-pods.yaml
- grafana-dashboard-nodes.yaml
- grafana-dashboard-storage.yaml
- grafana-dashboard-network.yaml
- grafana-dashboard-gpu.yaml
- dcgm-exporter.yaml
- grafana-folders.yaml
- helmrelease.yaml

View File

@ -1,48 +0,0 @@
# services/nextcloud/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: nextcloud-config
namespace: nextcloud
data:
extra.config.php: |
<?php
$CONFIG = array (
'trusted_domains' =>
array (
0 => 'cloud.bstein.dev',
),
'overwritehost' => 'cloud.bstein.dev',
'overwriteprotocol' => 'https',
'overwrite.cli.url' => 'https://cloud.bstein.dev',
'default_phone_region' => 'US',
'mail_smtpmode' => 'smtp',
'mail_sendmailmode' => 'smtp',
'mail_smtphost' => 'mail.bstein.dev',
'mail_smtpport' => '587',
'mail_smtpsecure' => 'tls',
'mail_smtpauth' => true,
'mail_smtpauthtype' => 'LOGIN',
'mail_domain' => 'bstein.dev',
'mail_from_address' => 'no-reply',
'oidc_login_provider_url' => 'https://sso.bstein.dev/realms/atlas',
'oidc_login_client_id' => getenv('OIDC_CLIENT_ID'),
'oidc_login_client_secret' => getenv('OIDC_CLIENT_SECRET'),
'oidc_login_auto_redirect' => false,
'oidc_login_end_session_redirect' => true,
'oidc_login_button_text' => 'Login with Keycloak',
'oidc_login_hide_password_form' => false,
'oidc_login_attributes' =>
array (
'id' => 'preferred_username',
'mail' => 'email',
'name' => 'name',
),
'oidc_login_scope' => 'openid profile email',
'oidc_login_unique_id' => 'preferred_username',
'oidc_login_use_pkce' => true,
'oidc_login_disable_registration' => false,
'oidc_login_create_groups' => false,
# External storage for user data should be configured to Asteria via the External Storage app (admin UI),
# keeping the astreae PVC for app internals only.
);

View File

@ -1,32 +0,0 @@
# services/nextcloud/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: nextcloud-cron
namespace: nextcloud
spec:
schedule: "*/5 * * * *"
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
securityContext:
runAsUser: 33
runAsGroup: 33
fsGroup: 33
restartPolicy: OnFailure
containers:
- name: nextcloud-cron
image: nextcloud:29-apache
imagePullPolicy: IfNotPresent
command: ["/bin/sh", "-c"]
args:
- "cd /var/www/html && php -f cron.php"
volumeMounts:
- name: nextcloud-data
mountPath: /var/www/html
volumes:
- name: nextcloud-data
persistentVolumeClaim:
claimName: nextcloud-data

View File

@ -1,143 +0,0 @@
# services/nextcloud/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nextcloud
namespace: nextcloud
labels:
app: nextcloud
spec:
replicas: 1
selector:
matchLabels:
app: nextcloud
template:
metadata:
labels:
app: nextcloud
spec:
nodeSelector:
hardware: rpi5
securityContext:
fsGroup: 33
runAsUser: 33
runAsGroup: 33
initContainers:
- name: fix-perms
image: alpine:3.20
command: ["/bin/sh", "-c"]
args:
- |
chown -R 33:33 /var/www/html/config || true
chown -R 33:33 /var/www/html/data || true
securityContext:
runAsUser: 0
runAsGroup: 0
volumeMounts:
- name: nextcloud-data
mountPath: /var/www/html
- name: nextcloud-config
mountPath: /var/www/html/config/extra.config.php
subPath: extra.config.php
containers:
- name: nextcloud
image: nextcloud:29-apache
imagePullPolicy: IfNotPresent
env:
# DB (external secret required: nextcloud-db with keys username,password,database)
- name: POSTGRES_HOST
value: postgres-service.postgres.svc.cluster.local
- name: POSTGRES_DB
valueFrom:
secretKeyRef:
name: nextcloud-db
key: database
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: nextcloud-db
key: db-username
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: nextcloud-db
key: db-password
# Admin bootstrap (external secret: nextcloud-admin with keys admin-user, admin-password)
- name: NEXTCLOUD_ADMIN_USER
valueFrom:
secretKeyRef:
name: nextcloud-admin
key: admin-user
- name: NEXTCLOUD_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: nextcloud-admin
key: admin-password
- name: NEXTCLOUD_TRUSTED_DOMAINS
value: cloud.bstein.dev
- name: OVERWRITEHOST
value: cloud.bstein.dev
- name: OVERWRITEPROTOCOL
value: https
- name: OVERWRITECLIURL
value: https://cloud.bstein.dev
# SMTP (external secret: nextcloud-smtp with keys username, password)
- name: SMTP_HOST
value: mail.bstein.dev
- name: SMTP_PORT
value: "587"
- name: SMTP_SECURE
value: tls
- name: SMTP_NAME
valueFrom:
secretKeyRef:
name: nextcloud-smtp
key: smtp-username
- name: SMTP_PASSWORD
valueFrom:
secretKeyRef:
name: nextcloud-smtp
key: smtp-password
- name: MAIL_FROM_ADDRESS
value: no-reply
- name: MAIL_DOMAIN
value: bstein.dev
# OIDC (external secret: nextcloud-oidc with keys client-id, client-secret)
- name: OIDC_CLIENT_ID
valueFrom:
secretKeyRef:
name: nextcloud-oidc
key: client-id
- name: OIDC_CLIENT_SECRET
valueFrom:
secretKeyRef:
name: nextcloud-oidc
key: client-secret
- name: NEXTCLOUD_UPDATE
value: "1"
- name: APP_INSTALL
value: "mail,oidc_login,external"
ports:
- containerPort: 80
name: http
volumeMounts:
- name: nextcloud-data
mountPath: /var/www/html
- name: nextcloud-config
mountPath: /var/www/html/config/extra.config.php
subPath: extra.config.php
resources:
requests:
cpu: 250m
memory: 1Gi
limits:
cpu: 1
memory: 3Gi
volumes:
- name: nextcloud-data
persistentVolumeClaim:
claimName: nextcloud-data
- name: nextcloud-config
configMap:
name: nextcloud-config
defaultMode: 0444

View File

@ -1,25 +0,0 @@
# services/nextcloud/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: nextcloud
namespace: nextcloud
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
traefik.ingress.kubernetes.io/router.entrypoints: websecure
spec:
tls:
- hosts:
- cloud.bstein.dev
secretName: nextcloud-tls
rules:
- host: cloud.bstein.dev
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: nextcloud
port:
number: 80

View File

@ -1,25 +0,0 @@
# services/nextcloud/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: nextcloud
resources:
- namespace.yaml
- configmap.yaml
- pvc.yaml
- deployment.yaml
- service.yaml
- ingress.yaml
- cronjob.yaml
- mail-sync-cronjob.yaml
- maintenance-cronjob.yaml
configMapGenerator:
- name: nextcloud-maintenance-script
files:
- maintenance.sh=../../scripts/nextcloud-maintenance.sh
options:
disableNameSuffixHash: true
- name: nextcloud-mail-sync-script
files:
- sync.sh=../../scripts/nextcloud-mail-sync.sh
options:
disableNameSuffixHash: true

View File

@ -1,58 +0,0 @@
# services/nextcloud/mail-sync-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: nextcloud-mail-sync
namespace: nextcloud
spec:
schedule: "0 5 * * *"
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
securityContext:
runAsUser: 0
runAsGroup: 0
containers:
- name: mail-sync
image: nextcloud:29-apache
imagePullPolicy: IfNotPresent
command: ["/bin/bash", "/sync/sync.sh"]
env:
- name: KC_BASE
value: https://sso.bstein.dev
- name: KC_REALM
value: atlas
- name: KC_ADMIN_USER
valueFrom:
secretKeyRef:
name: nextcloud-keycloak-admin
key: username
- name: KC_ADMIN_PASS
valueFrom:
secretKeyRef:
name: nextcloud-keycloak-admin
key: password
volumeMounts:
- name: nextcloud-data
mountPath: /var/www/html
- name: sync-script
mountPath: /sync/sync.sh
subPath: sync.sh
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
volumes:
- name: nextcloud-data
persistentVolumeClaim:
claimName: nextcloud-data
- name: sync-script
configMap:
name: nextcloud-mail-sync-script
defaultMode: 0755

View File

@ -1,56 +0,0 @@
# services/nextcloud/maintenance-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: nextcloud-maintenance
namespace: nextcloud
spec:
schedule: "30 4 * * *"
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
securityContext:
runAsUser: 0
runAsGroup: 0
containers:
- name: maintenance
image: nextcloud:29-apache
imagePullPolicy: IfNotPresent
command: ["/bin/bash", "/maintenance/maintenance.sh"]
env:
- name: NC_URL
value: https://cloud.bstein.dev
- name: ADMIN_USER
valueFrom:
secretKeyRef:
name: nextcloud-admin
key: admin-user
- name: ADMIN_PASS
valueFrom:
secretKeyRef:
name: nextcloud-admin
key: admin-password
volumeMounts:
- name: nextcloud-data
mountPath: /var/www/html
- name: maintenance-script
mountPath: /maintenance/maintenance.sh
subPath: maintenance.sh
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
volumes:
- name: nextcloud-data
persistentVolumeClaim:
claimName: nextcloud-data
- name: maintenance-script
configMap:
name: nextcloud-maintenance-script
defaultMode: 0755

View File

@ -1,5 +0,0 @@
# services/nextcloud/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: nextcloud

View File

@ -1,13 +0,0 @@
# services/nextcloud/pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nextcloud-data
namespace: nextcloud
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 200Gi
storageClassName: astreae

View File

@ -1,13 +0,0 @@
# services/nextcloud/service.yaml
apiVersion: v1
kind: Service
metadata:
name: nextcloud
namespace: nextcloud
spec:
selector:
app: nextcloud
ports:
- name: http
port: 80
targetPort: http

View File

@ -1,82 +0,0 @@
# services/oauth2-proxy/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: oauth2-proxy
namespace: sso
labels:
app: oauth2-proxy
spec:
replicas: 2
selector:
matchLabels:
app: oauth2-proxy
template:
metadata:
labels:
app: oauth2-proxy
spec:
nodeSelector:
node-role.kubernetes.io/worker: "true"
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 90
preference:
matchExpressions:
- key: hardware
operator: In
values: ["rpi5","rpi4"]
containers:
- name: oauth2-proxy
image: quay.io/oauth2-proxy/oauth2-proxy:v7.6.0
imagePullPolicy: IfNotPresent
args:
- --provider=oidc
- --redirect-url=https://auth.bstein.dev/oauth2/callback
- --oidc-issuer-url=https://sso.bstein.dev/realms/atlas
- --scope=openid profile email groups
- --email-domain=*
- --set-xauthrequest=true
- --pass-access-token=true
- --set-authorization-header=true
- --cookie-secure=true
- --cookie-samesite=lax
- --cookie-refresh=20m
- --cookie-expire=168h
- --upstream=static://200
- --http-address=0.0.0.0:4180
- --skip-provider-button=true
- --skip-jwt-bearer-tokens=true
- --oidc-groups-claim=groups
env:
- name: OAUTH2_PROXY_CLIENT_ID
valueFrom:
secretKeyRef:
name: oauth2-proxy-oidc
key: client_id
- name: OAUTH2_PROXY_CLIENT_SECRET
valueFrom:
secretKeyRef:
name: oauth2-proxy-oidc
key: client_secret
- name: OAUTH2_PROXY_COOKIE_SECRET
valueFrom:
secretKeyRef:
name: oauth2-proxy-oidc
key: cookie_secret
ports:
- containerPort: 4180
name: http
readinessProbe:
httpGet:
path: /ping
port: 4180
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /ping
port: 4180
initialDelaySeconds: 20
periodSeconds: 20

View File

@ -1,25 +0,0 @@
# services/oauth2-proxy/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: oauth2-proxy
namespace: sso
annotations:
cert-manager.io/cluster-issuer: letsencrypt
traefik.ingress.kubernetes.io/router.middlewares: sso-oauth2-proxy-errors@kubernetescrd
spec:
ingressClassName: traefik
rules:
- host: auth.bstein.dev
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: oauth2-proxy
port:
number: 80
tls:
- hosts: [auth.bstein.dev]
secretName: auth-tls

View File

@ -1,10 +0,0 @@
# services/oauth2-proxy/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: sso
resources:
- deployment.yaml
- service.yaml
- ingress.yaml
- middleware.yaml
- middleware-errors.yaml

View File

@ -1,15 +0,0 @@
# services/oauth2-proxy/middleware-errors.yaml
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: oauth2-proxy-errors
namespace: sso
spec:
errors:
status:
- "401"
- "403"
service:
name: oauth2-proxy
port: 80
query: /oauth2/start?rd={url}

View File

@ -1,15 +0,0 @@
# services/oauth2-proxy/middleware.yaml
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: oauth2-proxy-forward-auth
namespace: sso
spec:
forwardAuth:
address: http://oauth2-proxy.sso.svc.cluster.local:4180/oauth2/auth
trustForwardHeader: true
authResponseHeaders:
- Authorization
- X-Auth-Request-Email
- X-Auth-Request-User
- X-Auth-Request-Groups

View File

@ -1,15 +0,0 @@
# services/oauth2-proxy/service.yaml
apiVersion: v1
kind: Service
metadata:
name: oauth2-proxy
namespace: sso
labels:
app: oauth2-proxy
spec:
selector:
app: oauth2-proxy
ports:
- name: http
port: 80
targetPort: 4180

View File

@ -8,7 +8,7 @@ metadata:
kubernetes.io/ingress.class: traefik
traefik.ingress.kubernetes.io/router.entrypoints: websecure
traefik.ingress.kubernetes.io/router.tls: "true"
cert-manager.io/cluster-issuer: letsencrypt
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
tls:
- hosts: [ "pegasus.bstein.dev" ]

View File

@ -8,7 +8,7 @@ spec:
secretName: vault-server-tls
issuerRef:
kind: ClusterIssuer
name: letsencrypt
name: letsencrypt-prod
commonName: secret.bstein.dev
dnsNames:
- secret.bstein.dev

View File

@ -7,6 +7,7 @@ metadata:
annotations:
kubernetes.io/ingress.class: traefik
traefik.ingress.kubernetes.io/router.entrypoints: websecure
traefik.ingress.kubernetes.io/router.middlewares: vault-vault-basicauth@kubernetescrd
traefik.ingress.kubernetes.io/service.serversscheme: https
traefik.ingress.kubernetes.io/service.serversTransport: vault-vault-to-https@kubernetescrd
spec:

View File

@ -7,4 +7,5 @@ resources:
- helmrelease.yaml
- certificate.yaml
- ingress.yaml
- middleware.yaml
- serverstransport.yaml

View File

@ -0,0 +1,9 @@
# services/vault/middleware.yaml
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: vault-basicauth
namespace: vault
spec:
basicAuth:
secret: vault-basic-auth

View File

@ -5,7 +5,7 @@ metadata:
name: zot
namespace: zot
annotations:
cert-manager.io/cluster-issuer: letsencrypt
cert-manager.io/cluster-issuer: letsencrypt-prod
traefik.ingress.kubernetes.io/router.entrypoints: websecure
traefik.ingress.kubernetes.io/router.tls: "true"
traefik.ingress.kubernetes.io/router.middlewares: zot-zot-resp-headers@kubernetescrd