titan-iac/AGENTS.md

8.5 KiB

Repository Guidelines

Local-only note: apply changes through Flux-tracked manifests, not by manual kubectl edits in-cluster—manual tweaks will be reverted by Flux.

Project Structure & Module Organization

  • infrastructure/: cluster-scoped building blocks (core, flux-system, traefik, longhorn). Add new platform features by mirroring this layout.
  • services/: workload manifests per app (services/gitea/, etc.) with kustomization.yaml plus one file per kind; keep diffs small and focused.
  • dockerfiles/ hosts bespoke images, while scripts/ stores operational Fish/Bash helpers—extend these directories instead of relying on ad-hoc commands.

Build, Test, and Development Commands

  • kustomize build services/<app> (or kubectl kustomize ...) renders manifests exactly as Flux will.
  • kubectl apply --server-side --dry-run=client -k services/<app> checks schema compatibility without touching the cluster.
  • flux reconcile kustomization <name> --namespace flux-system --with-source pulls the latest Git state after merges or hotfixes.
  • fish scripts/flux_hammer.fish --help explains the recovery tool; read it before running against production workloads.

Coding Style & Naming Conventions

  • YAML uses two-space indents; retain the leading path comment (e.g. # services/gitea/deployment.yaml) to speed code review.
  • Keep resource names lowercase kebab-case, align labels/selectors, and mirror namespaces with directory names.
  • List resources in kustomization.yaml from namespace/config, through storage, then workloads and networking for predictable diffs.
  • Scripts start with #!/usr/bin/env fish or bash, stay executable, and follow snake_case names such as flux_hammer.fish.

Testing Guidelines

  • Run kustomize build and the dry-run apply for every service you touch; capture failures before opening a PR.
  • flux diff kustomization <name> --path services/<app> previews reconciliations—link notable output when behavior shifts.
  • Docker edits: docker build -f dockerfiles/Dockerfile.monerod . (swap the file you changed) to verify image builds.

Commit & Pull Request Guidelines

  • Keep commit subjects short, present-tense, and optionally scoped (gpu(titan-24): add RuntimeClass); squash fixups before review.
  • Describe linked issues, affected services, and required operator steps (e.g. flux reconcile kustomization services-gitea) in the PR body.
  • Focus each PR on one kustomization or service and update infrastructure/flux-system when Flux must track new folders.
  • Record the validation you ran (dry-runs, diffs, builds) and add screenshots only when ingress or UI behavior changes.

Security & Configuration Tips

  • Never commit credentials; use Vault workflows (services/vault/) or SOPS-encrypted manifests wired through infrastructure/flux-system.
  • Node selectors and tolerations gate workloads to hardware like hardware: rpi4; confirm labels before scaling or renaming nodes.
  • Pin external images by digest or rely on Flux image automation to follow approved tags and avoid drift.

Dashboard roadmap / context (2025-12-02)

  • Atlas dashboards are generated via scripts/dashboards_render_atlas.py --build, which writes JSON under services/monitoring/dashboards/ and ConfigMaps under services/monitoring/. Keep the Grafana manifests in sync by regenerating after edits.
  • Atlas Overview panels are paired with internal dashboards (pods, nodes, storage, network, GPU). A new atlas-gpu internal dashboard holds the detailed GPU metrics that feed the overview share pie.
  • Old Grafana folders (Atlas Storage, Atlas SRE, Atlas Public, Atlas Nodes) should be removed in Grafana UI when convenient; only Atlas Overview and Atlas Internal should remain provisioned.
  • Future work: add a separate generator (e.g., dashboards_render_oceanus.py) for SUI/oceanus validation dashboards, mirroring the atlas pattern of internal dashboards feeding a public overview.

Monitoring state (2025-12-03)

  • dcgm-exporter DaemonSet pulls registry.bstein.dev/monitoring/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04 with nvidia runtime/imagePullSecret; titan-24 exports metrics, titan-22 remains NotReady.
  • Atlas Overview is the Grafana home (1h range, 1m refresh), Overview folder UID overview, internal folder atlas-internal (oceanus-internal stub).
  • Panels standardized via generator; hottest row compressed, worker/control rows taller, root disk row taller and top12 bar gauge with labels. GPU share pie uses 1h avg_over_time to persist idle activity.
  • Internal dashboards are provisioned without Viewer role; if anonymous still sees them, restart Grafana and tighten auth if needed.
  • GPU share panel updated (feature/sso) to use max_over_time(…[$__range]), so longer ranges (e.g., 12h) keep recent activity visible. Flux tracking feature/sso.

Upcoming priorities (SSO/storage/mail)

  • Establish SSO (Keycloak or similar) and federate Grafana, Gitea, Zot, Nextcloud, Pegasus/Jellyfin; keep Vaultwarden separate until safe.
  • Add Nextcloud (limit to rpi5 workers) with office suite; integrate with SSO; plan storage class and ingress.
  • Plan mail: mostly self-hosted, relay through trusted provider for outbound; integrate with services (Nextcloud, Vaultwarden, etc.) for notifications and account flows.

SSO plan sketch (2025-12-03)

  • IdP: use Keycloak (preferred) in a new sso namespace, Bitnami or codecentric chart with Postgres backing store (single PVC), ingress sso.bstein.dev, admin user bound to brad@bstein.dev; stick with local DB initially (no external IdP).
  • Auth flow goals: Grafana (OIDC), Gitea (OAuth2/Keycloak), Zot (via Traefik forward-auth/oauth2-proxy), Jellyfin/Pegasus via Jellyfin OAuth/OpenID plugin (map existing usernames; run migration to pre-create users in Keycloak with same usernames/emails and temporary passwords), Pegasus keeps using Jellyfin tokens.
  • Steps to implement:
    1. Add service folder services/keycloak/ (namespace, PVC, HelmRelease, ingress, secret for admin creds). Verify with kustomize + Flux reconcile.
    2. Seed realm atlas with users (import CSV/realm). Create client for Grafana (public/implicit), Gitea (confidential), and a “jellyfin” client for the OAuth plugin; set email for brad@bstein.dev as admin.
    3. Reconfigure Grafana to OIDC (disable anonymous to internal folders, leave Overview public via folder permissions). Reconfigure Gitea to OIDC (app.ini).
    4. Add Traefik forward-auth (oauth2-proxy) in front of Zot and any other services needing headers-based auth.
    5. Deploy Jellyfin OpenID plugin; map Keycloak users to existing Jellyfin usernames; communicate password reset path.
  • Migration caution: do not delete existing local creds until SSO validated; keep Pegasus working via Jellyfin tokens during transition.

Postgres centralization (2025-12-03)

  • Prefer a shared in-cluster Postgres deployment with per-service databases to reduce resource sprawl on Pi nodes. Use it for services that can easily point at an external DB.
  • Candidates to migrate to shared Postgres: Keycloak (realm DB), Gitea (git DB), Nextcloud (app DB), possibly Grafana (if persistence needed beyond current provisioner), Jitsi prosody/JVB state (if external DB supported). Keep tightly-coupled or lightweight embedded DBs as-is when migration is painful or not supported.

SSO integration snapshot (2025-12-08)

  • Current blockers: Zot still prompts for basic auth/double-login; Vault still wants the token UI after Keycloak (previously 502/404 when vault-0 sealed). Forward-auth middleware on Zot Ingress likely still causing the 401/Found hop; Vault OIDC mount not completing UI flow unless unsealed and preferred login is set.
  • Flux-only changes required: remove zot forward-auth middleware from Ingress (let oauth2-proxy handle redirect), ensure Vault OIDC mount is preferred UI login and bound to admin group; keep all edits in repo so Flux enforces them.
  • Secrets present (per user): zot-oidc-client (client_secret only), oauth2-proxy-zot-oidc, oauth2-proxy-vault-oidc, vault-oidc-admin-token. Zot needs its regcred in the zot namespace if image pulls fail.
  • Cluster validation blocked here: kubectl get nodes fails (403/permission) and DNS to *.bstein.dev fails in this session, so no live curl verification could be run. Re-test on a host with cluster/DNS access after Flux applies fixes.

Docs hygiene

  • Do not add per-service README.md files; use NOTES.md if documentation is needed inside service folders. Keep only the top-level repo README.
  • Keep comments succinct and in a human voice—no AI-sounding notes. Use NOTES.md for scratch notes instead of sprinkling reminders into code or extra READMEs.