65 lines
2.8 KiB
Markdown
65 lines
2.8 KiB
Markdown
# Veles Infrastructure Contract
|
|
|
|
This stack is staged for Flux and intentionally starts the app deployments at `replicas: 0` until images and the app-side runtime contract are ready.
|
|
|
|
## Cluster Contract
|
|
|
|
- Namespace: `veles`
|
|
- Hostname: `https://veles.bstein.dev`
|
|
- Namespace: `veles`; no alternate alpha namespace is used.
|
|
- Backend service: `veles-backend.veles.svc.cluster.local:80`
|
|
- Frontend service: `veles-frontend.veles.svc.cluster.local:80`
|
|
- Postgres service: `veles-postgres.veles.svc.cluster.local:5432`
|
|
- Artifact PVC: `veles-artifacts`, mounted at `/data/veles-artifacts`
|
|
- Storage classes: `veles-oceanus-db`, `veles-oceanus-artifacts`
|
|
- Images:
|
|
- `registry.bstein.dev/veles/veles-backend`
|
|
- `registry.bstein.dev/veles/veles-frontend`
|
|
- `registry.bstein.dev/veles/veles-sim-worker`
|
|
|
|
## Runtime Env
|
|
|
|
Veles should consume:
|
|
|
|
- `VELES_PUBLIC_BASE_URL=https://veles.bstein.dev`
|
|
- `VELES_OIDC_ISSUER=https://sso.bstein.dev/realms/veles`
|
|
- `VELES_OIDC_CLIENT_ID=veles-web`
|
|
- `VELES_OIDC_REQUIRED_GROUPS=alpha,admin`
|
|
- `DATABASE_URL` from `kv/data/atlas/veles/veles-db`
|
|
- `VELES_SESSION_SECRET` from `kv/data/atlas/veles/app-secrets`
|
|
- `VELES_BYOK_ENCRYPTION_KEY` from `kv/data/atlas/veles/app-secrets`
|
|
|
|
User OpenAI API keys must stay in the Veles database encrypted with `VELES_BYOK_ENCRYPTION_KEY`; do not store per-user BYOK secrets in Vault.
|
|
|
|
## Simulation Jobs
|
|
|
|
The backend service account can create, watch, and delete Jobs only inside the `veles` namespace. Simulation pods should use service account `veles-sim`, set `automountServiceAccountToken: false`, and use:
|
|
|
|
```yaml
|
|
priorityClassName: veles-sim
|
|
nodeSelector:
|
|
veles.bstein.dev/simulation: "true"
|
|
tolerations:
|
|
- key: veles.bstein.dev/simulation
|
|
operator: Equal
|
|
value: "true"
|
|
effect: NoSchedule
|
|
```
|
|
|
|
## Staged Operator Steps
|
|
|
|
1. Join `titan-23`/Oceanus to Atlas as a worker.
|
|
2. Use Metis with `titan-23` in `METIS_FLASH_HOSTS`; the existing node secret placeholder uses `192.168.22.23`.
|
|
3. Confirm the node normalizer applies the Veles labels and taint.
|
|
4. Add Oceanus Longhorn disks at paths tagged by the Longhorn tag ensure job.
|
|
5. Let Vault policy reconciliation run, then unsuspend `veles-secrets-ensure-1`.
|
|
6. Unsuspend `veles-realm-ensure-1` in `services/keycloak` to create the realm/client secret.
|
|
7. Create the Harbor `veles` project or robot access before image automation is enabled in production.
|
|
8. Scale `veles-postgres`, then backend/frontend once app images exist.
|
|
|
|
## Assumptions
|
|
|
|
- `veles-oceanus-artifacts` is RWO for alpha; simulation workers should either run on Oceanus with the backend or stream logs to the backend, which owns writes.
|
|
- Postgres uses Longhorn backup recurring jobs off Oceanus. This is not a substitute for a tested restore drill.
|
|
- The Jenkins job skeleton points at the Veles repo but stays disabled until that repo provides a Jenkinsfile.
|