2026-06-09 01:26:22 -03:00

65 lines
2.8 KiB
Markdown

# Veles Infrastructure Contract
This stack is staged for Flux and intentionally starts the app deployments at `replicas: 0` until images and the app-side runtime contract are ready.
## Cluster Contract
- Namespace: `veles`
- Hostname: `https://veles.bstein.dev`
- Namespace: `veles`; no alternate alpha namespace is used.
- Backend service: `veles-backend.veles.svc.cluster.local:80`
- Frontend service: `veles-frontend.veles.svc.cluster.local:80`
- Postgres service: `veles-postgres.veles.svc.cluster.local:5432`
- Artifact PVC: `veles-artifacts`, mounted at `/data/veles-artifacts`
- Storage classes: `veles-oceanus-db`, `veles-oceanus-artifacts`
- Images:
- `registry.bstein.dev/veles/veles-backend`
- `registry.bstein.dev/veles/veles-frontend`
- `registry.bstein.dev/veles/veles-sim-worker`
## Runtime Env
Veles should consume:
- `VELES_PUBLIC_BASE_URL=https://veles.bstein.dev`
- `VELES_OIDC_ISSUER=https://sso.bstein.dev/realms/veles`
- `VELES_OIDC_CLIENT_ID=veles-web`
- `VELES_OIDC_REQUIRED_GROUPS=alpha,admin`
- `DATABASE_URL` from `kv/data/atlas/veles/veles-db`
- `VELES_SESSION_SECRET` from `kv/data/atlas/veles/app-secrets`
- `VELES_BYOK_ENCRYPTION_KEY` from `kv/data/atlas/veles/app-secrets`
User OpenAI API keys must stay in the Veles database encrypted with `VELES_BYOK_ENCRYPTION_KEY`; do not store per-user BYOK secrets in Vault.
## Simulation Jobs
The backend service account can create, watch, and delete Jobs only inside the `veles` namespace. Simulation pods should use service account `veles-sim`, set `automountServiceAccountToken: false`, and use:
```yaml
priorityClassName: veles-sim
nodeSelector:
veles.bstein.dev/simulation: "true"
tolerations:
- key: veles.bstein.dev/simulation
operator: Equal
value: "true"
effect: NoSchedule
```
## Staged Operator Steps
1. Join `titan-23`/Oceanus to Atlas as a worker.
2. Use Metis with `titan-23` in `METIS_FLASH_HOSTS`; the existing node secret placeholder uses `192.168.22.23`.
3. Confirm the node normalizer applies the Veles labels and taint.
4. Add Oceanus Longhorn disks at paths tagged by the Longhorn tag ensure job.
5. Let Vault policy reconciliation run, then unsuspend `veles-secrets-ensure-2`.
6. Unsuspend `veles-realm-ensure-3` in `services/keycloak` to create the realm/client secret.
7. Create the Harbor `veles` project or robot access before image automation is enabled in production.
8. Scale `veles-postgres`, then backend/frontend once app images exist.
## Assumptions
- `veles-oceanus-artifacts` is RWO for alpha; simulation workers should either run on Oceanus with the backend or stream logs to the backend, which owns writes.
- Postgres uses Longhorn backup recurring jobs off Oceanus. This is not a substitute for a tested restore drill.
- The Jenkins job skeleton points at the Veles repo but stays disabled until that repo provides a Jenkinsfile.