diff --git a/AGENTS.md b/AGENTS.md deleted file mode 100644 index 9dc36ac..0000000 --- a/AGENTS.md +++ /dev/null @@ -1,81 +0,0 @@ - - -Repository Guidelines - -> Local-only note: apply changes through Flux-tracked manifests, not by manual kubectl edits in-cluster—manual tweaks will be reverted by Flux. - -## Project Structure & Module Organization -- `infrastructure/`: cluster-scoped building blocks (core, flux-system, traefik, longhorn). Add new platform features by mirroring this layout. -- `services/`: workload manifests per app (`services/gitea/`, etc.) with `kustomization.yaml` plus one file per kind; keep diffs small and focused. -- `dockerfiles/` hosts bespoke images, while `scripts/` stores operational Fish/Bash helpers—extend these directories instead of relying on ad-hoc commands. - -## Build, Test, and Development Commands -- `kustomize build services/` (or `kubectl kustomize ...`) renders manifests exactly as Flux will. -- `kubectl apply --server-side --dry-run=client -k services/` checks schema compatibility without touching the cluster. -- `flux reconcile kustomization --namespace flux-system --with-source` pulls the latest Git state after merges or hotfixes. -- `fish scripts/flux_hammer.fish --help` explains the recovery tool; read it before running against production workloads. - -## Coding Style & Naming Conventions -- YAML uses two-space indents; retain the leading path comment (e.g. `# services/gitea/deployment.yaml`) to speed code review. -- Keep resource names lowercase kebab-case, align labels/selectors, and mirror namespaces with directory names. -- List resources in `kustomization.yaml` from namespace/config, through storage, then workloads and networking for predictable diffs. -- Scripts start with `#!/usr/bin/env fish` or bash, stay executable, and follow snake_case names such as `flux_hammer.fish`. - -## Testing Guidelines -- Run `kustomize build` and the dry-run apply for every service you touch; capture failures before opening a PR. -- `flux diff kustomization --path services/` previews reconciliations—link notable output when behavior shifts. -- Docker edits: `docker build -f dockerfiles/Dockerfile.monerod .` (swap the file you changed) to verify image builds. - -## Commit & Pull Request Guidelines -- Keep commit subjects short, present-tense, and optionally scoped (`gpu(titan-24): add RuntimeClass`); squash fixups before review. -- Describe linked issues, affected services, and required operator steps (e.g. `flux reconcile kustomization services-gitea`) in the PR body. -- Focus each PR on one kustomization or service and update `infrastructure/flux-system` when Flux must track new folders. -- Record the validation you ran (dry-runs, diffs, builds) and add screenshots only when ingress or UI behavior changes. - -## Security & Configuration Tips -- Never commit credentials; use Vault workflows (`services/vault/`) or SOPS-encrypted manifests wired through `infrastructure/flux-system`. -- Node selectors and tolerations gate workloads to hardware like `hardware: rpi4`; confirm labels before scaling or renaming nodes. -- Pin external images by digest or rely on Flux image automation to follow approved tags and avoid drift. - -## Dashboard roadmap / context (2025-12-02) -- Atlas dashboards are generated via `scripts/dashboards_render_atlas.py --build`, which writes JSON under `services/monitoring/dashboards/` and ConfigMaps under `services/monitoring/`. Keep the Grafana manifests in sync by regenerating after edits. -- Atlas Overview panels are paired with internal dashboards (pods, nodes, storage, network, GPU). A new `atlas-gpu` internal dashboard holds the detailed GPU metrics that feed the overview share pie. -- Old Grafana folders (`Atlas Storage`, `Atlas SRE`, `Atlas Public`, `Atlas Nodes`) should be removed in Grafana UI when convenient; only `Atlas Overview` and `Atlas Internal` should remain provisioned. -- Future work: add a separate generator (e.g., `dashboards_render_oceanus.py`) for SUI/oceanus validation dashboards, mirroring the atlas pattern of internal dashboards feeding a public overview. - -## Monitoring state (2025-12-03) -- dcgm-exporter DaemonSet pulls `registry.bstein.dev/monitoring/dcgm-exporter:4.4.2-4.7.0-ubuntu22.04` with nvidia runtime/imagePullSecret; titan-24 exports metrics, titan-22 remains NotReady. -- Atlas Overview is the Grafana home (1h range, 1m refresh), Overview folder UID `overview`, internal folder `atlas-internal` (oceanus-internal stub). -- Panels standardized via generator; hottest row compressed, worker/control rows taller, root disk row taller and top12 bar gauge with labels. GPU share pie uses 1h avg_over_time to persist idle activity. -- Internal dashboards are provisioned without Viewer role; if anonymous still sees them, restart Grafana and tighten auth if needed. -- GPU share panel updated (feature/sso) to use `max_over_time(…[$__range])`, so longer ranges (e.g., 12h) keep recent activity visible. Flux tracking `feature/sso`. - -## Upcoming priorities (SSO/storage/mail) -- Establish SSO (Keycloak or similar) and federate Grafana, Gitea, Zot, Nextcloud, Pegasus/Jellyfin; keep Vaultwarden separate until safe. -- Add Nextcloud (limit to rpi5 workers) with office suite; integrate with SSO; plan storage class and ingress. -- Plan mail: mostly self-hosted, relay through trusted provider for outbound; integrate with services (Nextcloud, Vaultwarden, etc.) for notifications and account flows. - -## SSO plan sketch (2025-12-03) -- IdP: use Keycloak (preferred) in a new `sso` namespace, Bitnami or codecentric chart with Postgres backing store (single PVC), ingress `sso.bstein.dev`, admin user bound to brad@bstein.dev; stick with local DB initially (no external IdP). -- Auth flow goals: Grafana (OIDC), Gitea (OAuth2/Keycloak), Zot (via Traefik forward-auth/oauth2-proxy), Jellyfin/Pegasus via Jellyfin OAuth/OpenID plugin (map existing usernames; run migration to pre-create users in Keycloak with same usernames/emails and temporary passwords), Pegasus keeps using Jellyfin tokens. -- Steps to implement: - 1) Add service folder `services/keycloak/` (namespace, PVC, HelmRelease, ingress, secret for admin creds). Verify with kustomize + Flux reconcile. - 2) Seed realm `atlas` with users (import CSV/realm). Create client for Grafana (public/implicit), Gitea (confidential), and a “jellyfin” client for the OAuth plugin; set email for brad@bstein.dev as admin. - 3) Reconfigure Grafana to OIDC (disable anonymous to internal folders, leave Overview public via folder permissions). Reconfigure Gitea to OIDC (app.ini). - 4) Add Traefik forward-auth (oauth2-proxy) in front of Zot and any other services needing headers-based auth. - 5) Deploy Jellyfin OpenID plugin; map Keycloak users to existing Jellyfin usernames; communicate password reset path. -- Migration caution: do not delete existing local creds until SSO validated; keep Pegasus working via Jellyfin tokens during transition. - -## Postgres centralization (2025-12-03) -- Prefer a shared in-cluster Postgres deployment with per-service databases to reduce resource sprawl on Pi nodes. Use it for services that can easily point at an external DB. -- Candidates to migrate to shared Postgres: Keycloak (realm DB), Gitea (git DB), Nextcloud (app DB), possibly Grafana (if persistence needed beyond current provisioner), Jitsi prosody/JVB state (if external DB supported). Keep tightly-coupled or lightweight embedded DBs as-is when migration is painful or not supported. - -## SSO integration snapshot (2025-12-08) -- Current blockers: Zot still prompts for basic auth/double-login; Vault still wants the token UI after Keycloak (previously 502/404 when vault-0 sealed). Forward-auth middleware on Zot Ingress likely still causing the 401/Found hop; Vault OIDC mount not completing UI flow unless unsealed and preferred login is set. -- Flux-only changes required: remove zot forward-auth middleware from Ingress (let oauth2-proxy handle redirect), ensure Vault OIDC mount is preferred UI login and bound to admin group; keep all edits in repo so Flux enforces them. -- Secrets present (per user): `zot-oidc-client` (client_secret only), `oauth2-proxy-zot-oidc`, `oauth2-proxy-vault-oidc`, `vault-oidc-admin-token`. Zot needs its regcred in the zot namespace if image pulls fail. -- Cluster validation blocked here: `kubectl get nodes` fails (403/permission) and DNS to `*.bstein.dev` fails in this session, so no live curl verification could be run. Re-test on a host with cluster/DNS access after Flux applies fixes. - -## Docs hygiene -- Do not add per-service `README.md` files; use `NOTES.md` if documentation is needed inside service folders. Keep only the top-level repo README. -- Keep comments succinct and in a human voice—no AI-sounding notes. Use `NOTES.md` for scratch notes instead of sprinkling reminders into code or extra READMEs. diff --git a/docs/topology.md b/docs/topology.md deleted file mode 100644 index 1e37235..0000000 --- a/docs/topology.md +++ /dev/null @@ -1,15 +0,0 @@ -# Titan Homelab Topology - -| Hostname | Role / Function | Managed By | Notes | -|------------|--------------------------------|---------------------|-------| -| titan-db | HA control plane database | Ansible | PostgreSQL / etcd backing services | -| titan-0a | Kubernetes control-plane | Flux (atlas cluster)| HA leader, tainted for control only | -| titan-0b | Kubernetes control-plane | Flux (atlas cluster)| Standby control node | -| titan-0c | Kubernetes control-plane | Flux (atlas cluster)| Standby control node | -| titan-04-19| Raspberry Pi workers | Flux (atlas cluster)| Workload nodes, labelled per hardware | -| titan-20&21| NVIDIA Jetson workers | Flux (atlas cluster)| Workload nodes, labelled per hardware | -| titan-22 | GPU mini-PC (Jellyfin) | Flux + Ansible | NVIDIA runtime managed via `modules/profiles/atlas-ha` | -| titan-23 | Dedicated SUI validator Oceanus| Manual + Ansible | Baremetal validator workloads, exposes metrics to atlas | -| titan-24 | Tethys hybrid node | Flux + Ansible | Runs SUI metrics via K8s, validator via Ansible | -| titan-jh | Jumphost & bastion & lesavka | Ansible | Entry point / future KVM services / custom kvm - lesavaka | -| styx | Air-gapped workstation | Manual / Scripts | Remains isolated, scripts tracked in `hosts/styx` | diff --git a/hosts/styx/NOTES.md b/hosts/styx/NOTES.md deleted file mode 100644 index 992bac5..0000000 --- a/hosts/styx/NOTES.md +++ /dev/null @@ -1,2 +0,0 @@ -# hosts/styx/README.md -Styx is air-gapped; provisioning scripts live under `scripts/`. diff --git a/services/gitops-ui/certificate.yaml b/services/gitops-ui/certificate.yaml index d16a83a..d2ea1fd 100644 --- a/services/gitops-ui/certificate.yaml +++ b/services/gitops-ui/certificate.yaml @@ -8,6 +8,6 @@ spec: secretName: gitops-ui-tls issuerRef: kind: ClusterIssuer - name: letsencrypt-prod + name: letsencrypt dnsNames: - cd.bstein.dev diff --git a/services/gitops-ui/helmrelease.yaml b/services/gitops-ui/helmrelease.yaml index 27b610d..974251c 100644 --- a/services/gitops-ui/helmrelease.yaml +++ b/services/gitops-ui/helmrelease.yaml @@ -34,7 +34,7 @@ spec: enabled: true className: traefik annotations: - cert-manager.io/cluster-issuer: letsencrypt-prod + cert-manager.io/cluster-issuer: letsencrypt traefik.ingress.kubernetes.io/router.entrypoints: websecure hosts: - host: cd.bstein.dev diff --git a/services/keycloak/NOTES.md b/services/keycloak/NOTES.md deleted file mode 100644 index bf7c21b..0000000 --- a/services/keycloak/NOTES.md +++ /dev/null @@ -1,27 +0,0 @@ -# services/keycloak - -Keycloak is deployed via raw manifests and backed by the shared Postgres (`postgres-service.postgres.svc.cluster.local:5432`). Create these secrets before applying: - -```bash -# DB creds (per-service DB/user in shared Postgres) -kubectl -n sso create secret generic keycloak-db \ - --from-literal=username=keycloak \ - --from-literal=password='' \ - --from-literal=database=keycloak - -# Admin console creds (maps to KC admin user) -kubectl -n sso create secret generic keycloak-admin \ - --from-literal=username=brad@bstein.dev \ - --from-literal=password='' -``` - -Apply: - -```bash -kubectl apply -k services/keycloak -``` - -Notes -- Service: `keycloak.sso.svc:80` (Ingress `sso.bstein.dev`, TLS via cert-manager). -- Uses Postgres schema `public`; DB/user should be provisioned in the shared Postgres instance. -- Health endpoints on :9000 are wired for probes.