65 lines
2.8 KiB
Markdown
Raw Normal View History

2026-06-09 00:46:46 -03:00
# Veles Infrastructure Contract
This stack is staged for Flux and intentionally starts the app deployments at `replicas: 0` until images and the app-side runtime contract are ready.
## Cluster Contract
- Namespace: `veles`
- Hostname: `https://veles.bstein.dev`
- Namespace: `veles`; no alternate alpha namespace is used.
- Backend service: `veles-backend.veles.svc.cluster.local:80`
- Frontend service: `veles-frontend.veles.svc.cluster.local:80`
- Postgres service: `veles-postgres.veles.svc.cluster.local:5432`
- Artifact PVC: `veles-artifacts`, mounted at `/data/veles-artifacts`
- Storage classes: `veles-oceanus-db`, `veles-oceanus-artifacts`
- Images:
- `registry.bstein.dev/veles/veles-backend`
- `registry.bstein.dev/veles/veles-frontend`
- `registry.bstein.dev/veles/veles-sim-worker`
## Runtime Env
Veles should consume:
- `VELES_PUBLIC_BASE_URL=https://veles.bstein.dev`
- `VELES_OIDC_ISSUER=https://sso.bstein.dev/realms/veles`
- `VELES_OIDC_CLIENT_ID=veles-web`
- `VELES_OIDC_REQUIRED_GROUPS=alpha,admin`
- `DATABASE_URL` from `kv/data/atlas/veles/veles-db`
- `VELES_SESSION_SECRET` from `kv/data/atlas/veles/app-secrets`
- `VELES_BYOK_ENCRYPTION_KEY` from `kv/data/atlas/veles/app-secrets`
User OpenAI API keys must stay in the Veles database encrypted with `VELES_BYOK_ENCRYPTION_KEY`; do not store per-user BYOK secrets in Vault.
## Simulation Jobs
The backend service account can create, watch, and delete Jobs only inside the `veles` namespace. Simulation pods should use service account `veles-sim`, set `automountServiceAccountToken: false`, and use:
```yaml
priorityClassName: veles-sim
nodeSelector:
veles.bstein.dev/simulation: "true"
tolerations:
- key: veles.bstein.dev/simulation
operator: Equal
value: "true"
effect: NoSchedule
```
## Staged Operator Steps
1. Join `titan-23`/Oceanus to Atlas as a worker.
2. Use Metis with `titan-23` in `METIS_FLASH_HOSTS`; the existing node secret placeholder uses `192.168.22.23`.
3. Confirm the node normalizer applies the Veles labels and taint.
4. Add Oceanus Longhorn disks at paths tagged by the Longhorn tag ensure job.
2026-06-09 01:06:18 -03:00
5. Let Vault policy reconciliation run, then unsuspend `veles-secrets-ensure-2`.
2026-06-09 01:18:30 -03:00
6. Unsuspend `veles-realm-ensure-2` in `services/keycloak` to create the realm/client secret.
2026-06-09 00:46:46 -03:00
7. Create the Harbor `veles` project or robot access before image automation is enabled in production.
8. Scale `veles-postgres`, then backend/frontend once app images exist.
## Assumptions
- `veles-oceanus-artifacts` is RWO for alpha; simulation workers should either run on Oceanus with the backend or stream logs to the backend, which owns writes.
- Postgres uses Longhorn backup recurring jobs off Oceanus. This is not a substitute for a tested restore drill.
- The Jenkins job skeleton points at the Veles repo but stays disabled until that repo provides a Jenkinsfile.