2026-06-09 12:54:34 -03:00

5.4 KiB

Veles Infrastructure Contract

This stack is staged for Flux and intentionally starts the app deployments at replicas: 0 until images, native OIDC/session support, and smoke gates are ready.

Cluster Contract

  • Namespace: veles
  • Hostname: https://veles.bstein.dev
  • Namespace: veles; no alternate alpha namespace is used.
  • Backend service: veles-backend.veles.svc.cluster.local:80
  • Frontend service: veles-frontend.veles.svc.cluster.local:80
  • Postgres service: veles-postgres.veles.svc.cluster.local:5432
  • Artifact PVC: veles-artifacts, mounted at /data/veles-artifacts
  • Storage classes: veles-oceanus-db, veles-oceanus-artifacts
  • Images:
    • registry.bstein.dev/veles/veles-backend
    • registry.bstein.dev/veles/veles-frontend
    • registry.bstein.dev/veles/veles-sim-worker
  • Backend http container port: 8796
  • Frontend http container port: 8080
  • Backend/frontend deployments remain scaled to 0 until native OIDC/session support, image tags, and smoke gates are ready. Services route to a named http target port so Ingress does not depend on numeric container ports.

Auth Contract

Veles owns authorization in the app. The veles Ingress does not use oauth2-proxy or Traefik forward-auth, so no ingress/auth layer should strip OIDC token claims. The app should validate tokens from https://sso.bstein.dev/realms/veles and expect stable sub, email, preferred_username, groups, and realm_access.roles claims. Do not scale Veles for real user traffic until native OIDC login/session flow is implemented and smoke-tested.

The Keycloak realm setup creates both groups and realm roles named alpha and admin. Members of the alpha group receive the alpha realm role; members of admin receive both alpha and admin. Built-in/meta strategies can stay universal, while runs and user-created strategies should remain user-scoped in the Veles database.

Runtime Env

Veles should consume:

  • VELES_PUBLIC_BASE_URL=https://veles.bstein.dev
  • VELES_OIDC_ISSUER=https://sso.bstein.dev/realms/veles
  • VELES_OIDC_CLIENT_ID=veles-web
  • VELES_OIDC_REQUIRED_GROUPS=alpha,admin
  • VELES_OIDC_GROUPS_CLAIM=groups
  • VELES_OIDC_ROLES_CLAIM=realm_access.roles
  • DATABASE_URL from kv/data/atlas/veles/veles-db
  • VELES_SESSION_SECRET from kv/data/atlas/veles/app-secrets
  • VELES_BYOK_ENCRYPTION_KEY from kv/data/atlas/veles/app-secrets

User OpenAI API keys must stay in the Veles database encrypted with VELES_BYOK_ENCRYPTION_KEY; do not store per-user BYOK secrets in Vault.

Backend runtime secrets are synced from Vault by veles-vault into the generated Kubernetes Secret veles-runtime-secrets; no secret values are committed. The backend consumes that secret with envFrom.

Artifact Contract

veles-artifacts is an RWO Longhorn PVC mounted into backend pods at /data/veles-artifacts. Backend pods own artifact writes and serving. Simulation Jobs should not mount or write directly to this PVC unless they are explicitly scheduled on Oceanus with the Veles toleration and the app has chosen a same-node direct-write model. Queue-mediated upload/copy through the backend remains the safer default until the app contract settles.

Backend, simulation workers, and retention/cleanup workers must run on Oceanus/titan-23 when they need artifact access. Frontend pods must not mount veles-artifacts.

Simulation Jobs

The backend service account can create, watch, and delete Jobs only inside the veles namespace. Simulation pods should use service account veles-sim, set automountServiceAccountToken: false, and use:

priorityClassName: veles-sim
nodeSelector:
  veles.bstein.dev/simulation: "true"
tolerations:
  - key: veles.bstein.dev/simulation
    operator: Equal
    value: "true"
    effect: NoSchedule

Retention/cleanup Jobs that touch artifacts should use the same node selector and toleration. If they do not need Kubernetes API access, use veles-sim; otherwise keep control-plane actions in the backend/controller and run artifact cleanup through a no-token worker.

Staged Operator Steps

  1. Join titan-23/Oceanus to Atlas as a worker.
  2. Use Metis with titan-23 in METIS_FLASH_HOSTS; the existing node secret placeholder uses 192.168.22.23.
  3. Confirm the node normalizer applies the Veles labels and taint.
  4. Add Oceanus Longhorn disks at paths tagged by the Longhorn tag ensure job.
  5. Let Vault policy reconciliation run, then unsuspend veles-secrets-ensure-2.
  6. Unsuspend veles-realm-ensure-4 in services/keycloak to create the realm/client secret, groups, and roles.
  7. Create the Harbor veles project or robot access before image automation is enabled in production.
  8. Keep backend/frontend scaled to 0 until native OIDC/session support is implemented, image tags exist, and smoke gates pass.

Assumptions

  • veles-oceanus-artifacts is RWO for alpha; simulation workers should either run on Oceanus with the backend or stream logs to the backend, which owns writes.
  • Longhorn default backup target is s3://atlas-soteria@us-west-004/ with credential secret longhorn-backup-b2; the live BackupTarget/default currently reports available. Postgres and artifact volumes have Longhorn recurring snapshot and backup jobs attached by their StorageClasses. This is not a substitute for a tested restore drill.
  • The Jenkins job skeleton points at the Veles repo but stays disabled until that repo provides a Jenkinsfile.