201 lines
6.0 KiB
Markdown
201 lines
6.0 KiB
Markdown
# soteria
|
|
|
|
Soteria is an in-cluster service for PVC backup and restore operations. The current production baseline focuses on Longhorn-backed PVCs and provides:
|
|
|
|
- Namespace-grouped PVC inventory for backup and restore selection.
|
|
- On-demand backup creation for Longhorn volumes.
|
|
- Restore into a new target PVC with conflict checks and best-effort cleanup on failure.
|
|
- A simple built-in UI suitable for publishing behind an authenticated ingress.
|
|
- Prometheus-format backup freshness telemetry for Grafana rollups.
|
|
|
|
For Longhorn, backups are crash-consistent at the volume level and delegated to the Longhorn control plane.
|
|
|
|
## Endpoints
|
|
|
|
Public endpoints:
|
|
|
|
- `GET /healthz`
|
|
- `GET /readyz`
|
|
- `GET /metrics`
|
|
|
|
Protected endpoints when `SOTERIA_AUTH_REQUIRED=true`:
|
|
|
|
- `GET /` UI console
|
|
- `GET /v1/whoami`
|
|
- `GET /v1/inventory`
|
|
- `GET /v1/backups?namespace=<ns>&pvc=<name>`
|
|
- `POST /v1/backup`
|
|
- `POST /v1/restores`
|
|
- `POST /v1/restore-test` legacy alias for `/v1/restores`
|
|
|
|
## API examples
|
|
|
|
### POST /v1/backup
|
|
|
|
```json
|
|
{
|
|
"namespace": "ai",
|
|
"pvc": "llm-cache",
|
|
"tags": ["namespace=ai", "service=llm"],
|
|
"dry_run": false
|
|
}
|
|
```
|
|
|
|
Longhorn response:
|
|
|
|
```json
|
|
{
|
|
"driver": "longhorn",
|
|
"volume": "pvc-1234abcd",
|
|
"backup": "soteria-backup-ai-llm-cache-20260412-153000",
|
|
"namespace": "ai",
|
|
"requested_by": "brad",
|
|
"dry_run": false
|
|
}
|
|
```
|
|
|
|
### GET /v1/inventory
|
|
|
|
Response shape:
|
|
|
|
```json
|
|
{
|
|
"generated_at": "2026-04-12T15:30:00Z",
|
|
"namespaces": [
|
|
{
|
|
"name": "ai",
|
|
"pvcs": [
|
|
{
|
|
"namespace": "ai",
|
|
"pvc": "llm-cache",
|
|
"volume": "pvc-1234abcd",
|
|
"storage_class": "longhorn",
|
|
"capacity": "50Gi",
|
|
"driver": "longhorn",
|
|
"last_backup_at": "2026-04-12T14:55:00Z",
|
|
"last_backup_age_hours": 0.58,
|
|
"backup_count": 14,
|
|
"healthy": true,
|
|
"health_reason": "fresh"
|
|
}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### GET /v1/backups
|
|
|
|
```text
|
|
/v1/backups?namespace=ai&pvc=llm-cache
|
|
```
|
|
|
|
Returns the resolved volume name and backup records so the UI or automation can select a restore source.
|
|
|
|
### POST /v1/restores
|
|
|
|
```json
|
|
{
|
|
"namespace": "ai",
|
|
"pvc": "llm-cache",
|
|
"snapshot": "latest",
|
|
"target_namespace": "ai",
|
|
"target_pvc": "restore-llm-cache",
|
|
"dry_run": false
|
|
}
|
|
```
|
|
|
|
Notes:
|
|
|
|
- `namespace` and `pvc` identify the source PVC.
|
|
- `target_pvc` is required.
|
|
- `target_namespace` defaults to `namespace`.
|
|
- Soteria refuses to overwrite an existing target PVC.
|
|
- If Longhorn volume creation succeeds but PVC creation fails, Soteria attempts to delete the just-created restore volume.
|
|
- You may provide `backup_url` directly instead of `snapshot`.
|
|
|
|
## Authentication and authorization
|
|
|
|
When `SOTERIA_AUTH_REQUIRED=true`, Soteria expects trusted auth headers from a fronting proxy such as `oauth2-proxy`:
|
|
|
|
- `X-Auth-Request-User`
|
|
- `X-Auth-Request-Email`
|
|
- `X-Auth-Request-Groups`
|
|
|
|
Allowed groups are configured with `SOTERIA_ALLOWED_GROUPS` and compared after normalizing leading `/` prefixes, so both `maintenance` and `/maintenance` are accepted.
|
|
|
|
Optional machine-to-machine access can be enabled with `SOTERIA_AUTH_BEARER_TOKENS`, which accepts a comma-separated list of bearer tokens.
|
|
|
|
## Prometheus metrics
|
|
|
|
Soteria exports Prometheus-format metrics at `GET /metrics`.
|
|
|
|
Implemented metrics:
|
|
|
|
- `soteria_backup_requests_total{driver,result}`
|
|
- `soteria_restore_requests_total{driver,result}`
|
|
- `soteria_authz_denials_total{reason}`
|
|
- `soteria_inventory_refresh_failures_total`
|
|
- `soteria_inventory_refresh_timestamp_seconds`
|
|
- `pvc_backup_age_hours{namespace,pvc,volume,driver}`
|
|
- `pvc_backup_health{namespace,pvc,volume,driver}`
|
|
- `pvc_backup_last_success_timestamp_seconds{namespace,pvc,volume,driver}`
|
|
- `pvc_backup_count{namespace,pvc,volume,driver}`
|
|
|
|
`pvc_backup_health` is `1` when the most recent successful backup is within `SOTERIA_BACKUP_MAX_AGE_HOURS`, otherwise `0`.
|
|
|
|
## Configuration
|
|
|
|
Environment variables:
|
|
|
|
- `SOTERIA_BACKUP_DRIVER` default `longhorn`, allowed `longhorn`, `restic`
|
|
- `SOTERIA_LONGHORN_URL` default `http://longhorn-backend.longhorn-system.svc:9500`
|
|
- `SOTERIA_LONGHORN_BACKUP_MODE` default `incremental`, allowed `incremental`, `full`
|
|
- `SOTERIA_RESTIC_REPOSITORY` required for restic driver
|
|
- `SOTERIA_RESTIC_SECRET_NAME` default `soteria-restic`
|
|
- `SOTERIA_SECRET_NAMESPACE` default service namespace
|
|
- `SOTERIA_RESTIC_IMAGE` default `restic/restic:0.16.4`
|
|
- `SOTERIA_RESTIC_BACKUP_ARGS` optional extra args for `restic backup`
|
|
- `SOTERIA_RESTIC_FORGET_ARGS` optional extra args for `restic forget`
|
|
- `SOTERIA_S3_ENDPOINT` optional S3-compatible endpoint
|
|
- `SOTERIA_S3_REGION` optional region
|
|
- `SOTERIA_JOB_TTL_SECONDS` default `86400`
|
|
- `SOTERIA_JOB_NODE_SELECTOR` optional comma-separated `key=value` list
|
|
- `SOTERIA_JOB_SERVICE_ACCOUNT` optional ServiceAccount for restic Jobs
|
|
- `SOTERIA_LISTEN_ADDR` default `:8080`
|
|
- `SOTERIA_AUTH_REQUIRED` default `false`
|
|
- `SOTERIA_ALLOWED_GROUPS` default `admin,maintenance`
|
|
- `SOTERIA_AUTH_BEARER_TOKENS` optional comma-separated bearer tokens
|
|
- `SOTERIA_BACKUP_MAX_AGE_HOURS` default `24`
|
|
- `SOTERIA_METRICS_REFRESH_SECONDS` default `300`
|
|
|
|
## Secrets
|
|
|
|
Create a secret named `soteria-restic` in the Soteria namespace, or set `SOTERIA_RESTIC_SECRET_NAME`, when using the restic driver. Required keys:
|
|
|
|
- `AWS_ACCESS_KEY_ID`
|
|
- `AWS_SECRET_ACCESS_KEY`
|
|
- `RESTIC_PASSWORD`
|
|
|
|
The service copies this secret into the target namespace per job and attaches an owner reference so it is cleaned up with the Job.
|
|
|
|
A template is in `deploy/secret-example.yaml`. Do not commit real credentials.
|
|
|
|
## Deployment
|
|
|
|
The `deploy/` folder includes Kustomize-ready manifests for namespace, RBAC, config, deployment, and service.
|
|
|
|
Apply with:
|
|
|
|
```sh
|
|
kubectl apply -k deploy
|
|
```
|
|
|
|
The example Service is annotated for Prometheus scraping of `/metrics`.
|
|
|
|
## Notes
|
|
|
|
- Longhorn inventory and metrics are based on discovered backup records per PVC.
|
|
- Restic backup and restore execution exists, but inventory-style telemetry is currently Longhorn-focused.
|
|
- For Atlas production, place Soteria behind an authenticated ingress and trust only proxy-injected auth headers.
|