2026-01-31 03:34:34 -03:00
# soteria
2026-04-12 11:09:49 -03:00
Soteria is an in-cluster service for PVC backup and restore operations. The current production baseline focuses on Longhorn-backed PVCs and provides:
2026-01-31 03:34:34 -03:00
2026-04-12 11:09:49 -03:00
- Namespace-grouped PVC inventory for backup and restore selection.
- On-demand backup creation for Longhorn volumes.
2026-04-12 14:32:39 -03:00
- Namespace-wide backup and restore batch execution.
2026-04-12 11:09:49 -03:00
- Restore into a new target PVC with conflict checks and best-effort cleanup on failure.
2026-04-12 14:32:39 -03:00
- Policy-based scheduled backups (per PVC or all PVCs in a namespace), persisted in-cluster.
2026-04-12 19:45:23 -03:00
- A built-in React + TypeScript UI (dark-mode default) suitable for publishing behind an authenticated ingress.
- Prometheus-format backup freshness and B2 consumption telemetry for Grafana rollups.
2026-01-31 03:34:34 -03:00
2026-04-12 11:09:49 -03:00
For Longhorn, backups are crash-consistent at the volume level and delegated to the Longhorn control plane.
2026-01-31 03:34:34 -03:00
2026-04-12 11:09:49 -03:00
## Endpoints
2026-01-31 03:34:34 -03:00
2026-04-12 11:09:49 -03:00
Public endpoints:
- `GET /healthz`
- `GET /readyz`
- `GET /metrics`
Protected endpoints when `SOTERIA_AUTH_REQUIRED=true` :
- `GET /` UI console
- `GET /v1/whoami`
- `GET /v1/inventory`
- `GET /v1/backups?namespace=<ns>&pvc=<name>`
- `POST /v1/backup`
2026-04-12 14:32:39 -03:00
- `POST /v1/backup/namespace`
2026-04-12 11:09:49 -03:00
- `POST /v1/restores`
2026-04-12 14:32:39 -03:00
- `POST /v1/restores/namespace`
2026-04-12 11:09:49 -03:00
- `POST /v1/restore-test` legacy alias for `/v1/restores`
2026-04-12 14:32:39 -03:00
- `GET /v1/policies`
- `POST /v1/policies`
- `DELETE /v1/policies/<policy-id>`
2026-04-12 19:45:23 -03:00
- `GET /v1/b2`
2026-04-12 11:09:49 -03:00
## API examples
### POST /v1/backup
2026-01-31 03:34:34 -03:00
```json
{
"namespace": "ai",
"pvc": "llm-cache",
"tags": ["namespace=ai", "service=llm"],
"dry_run": false
}
```
2026-04-12 11:09:49 -03:00
Longhorn response:
2026-01-31 03:34:34 -03:00
```json
{
2026-04-12 11:09:49 -03:00
"driver": "longhorn",
"volume": "pvc-1234abcd",
"backup": "soteria-backup-ai-llm-cache-20260412-153000",
2026-01-31 03:34:34 -03:00
"namespace": "ai",
2026-04-12 11:09:49 -03:00
"requested_by": "brad",
2026-01-31 03:34:34 -03:00
"dry_run": false
}
```
2026-04-12 11:09:49 -03:00
### GET /v1/inventory
Response shape:
```json
{
"generated_at": "2026-04-12T15:30:00Z",
"namespaces": [
{
"name": "ai",
"pvcs": [
{
"namespace": "ai",
"pvc": "llm-cache",
"volume": "pvc-1234abcd",
"storage_class": "longhorn",
"capacity": "50Gi",
"driver": "longhorn",
"last_backup_at": "2026-04-12T14:55:00Z",
"last_backup_age_hours": 0.58,
"backup_count": 14,
"healthy": true,
"health_reason": "fresh"
}
]
}
]
}
```
### GET /v1/backups
```text
/v1/backups?namespace=ai& pvc=llm-cache
```
Returns the resolved volume name and backup records so the UI or automation can select a restore source.
### POST /v1/restores
2026-01-31 03:34:34 -03:00
```json
{
"namespace": "ai",
2026-04-12 11:09:49 -03:00
"pvc": "llm-cache",
2026-01-31 03:34:34 -03:00
"snapshot": "latest",
2026-04-12 11:09:49 -03:00
"target_namespace": "ai",
"target_pvc": "restore-llm-cache",
2026-01-31 03:34:34 -03:00
"dry_run": false
}
```
2026-02-06 18:25:19 -03:00
Notes:
2026-04-12 11:09:49 -03:00
- `namespace` and `pvc` identify the source PVC.
- `target_pvc` is required.
- `target_namespace` defaults to `namespace` .
- Soteria refuses to overwrite an existing target PVC.
- If Longhorn volume creation succeeds but PVC creation fails, Soteria attempts to delete the just-created restore volume.
- You may provide `backup_url` directly instead of `snapshot` .
2026-04-12 14:32:39 -03:00
### POST /v1/backup/namespace
```json
{
"namespace": "ai",
"dry_run": false
}
```
Runs backup for every currently bound PVC in the namespace and returns a per-PVC result list.
### POST /v1/restores/namespace
```json
{
"namespace": "ai",
"target_namespace": "ai-restore",
"target_prefix": "restore-20260412-",
"snapshot": "",
"dry_run": true
}
```
Runs restore planning/execution for every bound PVC in the source namespace. `snapshot` is optional and blank means latest completed backup per PVC.
### Policy API
Create or update a policy:
```json
POST /v1/policies
{
"namespace": "ai",
"pvc": "llm-cache",
"interval_hours": 6,
"enabled": true
}
```
- Leave `pvc` empty to target all PVCs in that namespace.
- Policies are stored in secret `SOTERIA_POLICY_SECRET_NAME` under key `policies.json` .
2026-04-12 19:45:23 -03:00
### GET /v1/b2
Returns B2 account/bucket consumption based on S3-compatible object scans.
```json
{
"enabled": true,
"available": true,
"endpoint": "https://s3.us-west-004.backblazeb2.com",
"region": "us-west-004",
"scanned_at": "2026-04-12T16:00:00Z",
"scan_duration_ms": 824,
"total_objects": 1324,
"total_bytes": 18407542931,
"recent_objects_24h": 18,
"recent_bytes_24h": 12245812,
"buckets": [
{
"name": "atlas-backups",
"object_count": 1240,
"total_bytes": 18288473811,
"recent_objects_24h": 12,
"recent_bytes_24h": 8542198,
"last_modified_at": "2026-04-12T15:43:19Z"
}
]
}
```
Recent 24h values are an object-change proxy and do not represent full B2 billing egress totals.
2026-04-12 11:09:49 -03:00
## Authentication and authorization
When `SOTERIA_AUTH_REQUIRED=true` , Soteria expects trusted auth headers from a fronting proxy such as `oauth2-proxy` :
- `X-Auth-Request-User`
- `X-Auth-Request-Email`
- `X-Auth-Request-Groups`
2026-04-12 11:36:22 -03:00
- `X-Forwarded-User` (fallback)
- `X-Forwarded-Email` (fallback)
- `X-Forwarded-Groups` (fallback)
2026-04-12 11:09:49 -03:00
2026-04-12 11:36:22 -03:00
Allowed groups are configured with `SOTERIA_ALLOWED_GROUPS` and compared after normalizing leading `/` prefixes, so both `maintenance` and `/maintenance` are accepted. Group lists may be comma- or semicolon-separated.
2026-04-12 11:09:49 -03:00
Optional machine-to-machine access can be enabled with `SOTERIA_AUTH_BEARER_TOKENS` , which accepts a comma-separated list of bearer tokens.
## Prometheus metrics
Soteria exports Prometheus-format metrics at `GET /metrics` .
Implemented metrics:
- `soteria_backup_requests_total{driver,result}`
- `soteria_restore_requests_total{driver,result}`
2026-04-12 14:32:39 -03:00
- `soteria_policy_backups_total{result}`
- `soteria_namespace_backup_requests_total{driver,result}`
- `soteria_namespace_restore_requests_total{driver,result}`
2026-04-12 11:09:49 -03:00
- `soteria_authz_denials_total{reason}`
- `soteria_inventory_refresh_failures_total`
- `soteria_inventory_refresh_timestamp_seconds`
- `pvc_backup_age_hours{namespace,pvc,volume,driver}`
- `pvc_backup_health{namespace,pvc,volume,driver}`
2026-04-12 19:45:23 -03:00
- `pvc_backup_health_reason{namespace,pvc,volume,driver,reason}`
2026-04-12 11:09:49 -03:00
- `pvc_backup_last_success_timestamp_seconds{namespace,pvc,volume,driver}`
- `pvc_backup_count{namespace,pvc,volume,driver}`
2026-04-12 19:45:23 -03:00
- `pvc_backup_completed_count{namespace,pvc,volume,driver}`
- `pvc_backup_last_size_bytes{namespace,pvc,volume,driver}`
- `pvc_backup_total_size_bytes{namespace,pvc,volume,driver}`
- `soteria_b2_scan_success`
- `soteria_b2_scan_timestamp_seconds`
- `soteria_b2_scan_duration_seconds`
- `soteria_b2_account_objects`
- `soteria_b2_account_bytes`
- `soteria_b2_account_recent_objects_24h`
- `soteria_b2_account_recent_bytes_24h`
- `soteria_b2_bucket_objects{bucket}`
- `soteria_b2_bucket_bytes{bucket}`
- `soteria_b2_bucket_recent_objects_24h{bucket}`
- `soteria_b2_bucket_recent_bytes_24h{bucket}`
- `soteria_b2_bucket_last_modified_timestamp_seconds{bucket}`
2026-04-12 11:09:49 -03:00
`pvc_backup_health` is `1` when the most recent successful backup is within `SOTERIA_BACKUP_MAX_AGE_HOURS` , otherwise `0` .
2026-02-06 18:25:19 -03:00
2026-01-31 03:34:34 -03:00
## Configuration
Environment variables:
2026-04-12 11:09:49 -03:00
- `SOTERIA_BACKUP_DRIVER` default `longhorn` , allowed `longhorn` , `restic`
- `SOTERIA_LONGHORN_URL` default `http://longhorn-backend.longhorn-system.svc:9500`
- `SOTERIA_LONGHORN_BACKUP_MODE` default `incremental` , allowed `incremental` , `full`
- `SOTERIA_RESTIC_REPOSITORY` required for restic driver
- `SOTERIA_RESTIC_SECRET_NAME` default `soteria-restic`
- `SOTERIA_SECRET_NAMESPACE` default service namespace
- `SOTERIA_RESTIC_IMAGE` default `restic/restic:0.16.4`
- `SOTERIA_RESTIC_BACKUP_ARGS` optional extra args for `restic backup`
- `SOTERIA_RESTIC_FORGET_ARGS` optional extra args for `restic forget`
- `SOTERIA_S3_ENDPOINT` optional S3-compatible endpoint
- `SOTERIA_S3_REGION` optional region
- `SOTERIA_JOB_TTL_SECONDS` default `86400`
- `SOTERIA_JOB_NODE_SELECTOR` optional comma-separated `key=value` list
- `SOTERIA_JOB_SERVICE_ACCOUNT` optional ServiceAccount for restic Jobs
- `SOTERIA_LISTEN_ADDR` default `:8080`
- `SOTERIA_AUTH_REQUIRED` default `false`
- `SOTERIA_ALLOWED_GROUPS` default `admin,maintenance`
- `SOTERIA_AUTH_BEARER_TOKENS` optional comma-separated bearer tokens
- `SOTERIA_BACKUP_MAX_AGE_HOURS` default `24`
- `SOTERIA_METRICS_REFRESH_SECONDS` default `300`
2026-04-12 14:32:39 -03:00
- `SOTERIA_POLICY_EVAL_SECONDS` default `300`
- `SOTERIA_POLICY_SECRET_NAME` default `soteria-policies`
2026-04-12 19:45:23 -03:00
- `SOTERIA_B2_ENABLED` default `false` (auto-enabled if endpoint/secret are set)
- `SOTERIA_B2_ENDPOINT` optional S3-compatible endpoint (for B2, usually `https://s3.<region>.backblazeb2.com` )
- `SOTERIA_B2_REGION` optional region override (auto-inferred for Backblaze endpoint patterns)
- `SOTERIA_B2_BUCKETS` optional comma-separated bucket allowlist (defaults to scanning all accessible buckets)
- `SOTERIA_B2_ACCESS_KEY_ID` optional static key (can come from secret instead)
- `SOTERIA_B2_SECRET_ACCESS_KEY` optional static secret key (can come from secret instead)
- `SOTERIA_B2_SECRET_NAMESPACE` optional secret namespace (defaults to service namespace when secret name is set)
- `SOTERIA_B2_SECRET_NAME` optional secret containing B2 keys
- `SOTERIA_B2_ACCESS_KEY_FIELD` default `AWS_ACCESS_KEY_ID`
- `SOTERIA_B2_SECRET_KEY_FIELD` default `AWS_SECRET_ACCESS_KEY`
- `SOTERIA_B2_ENDPOINT_FIELD` default `AWS_ENDPOINTS`
- `SOTERIA_B2_SCAN_INTERVAL_SECONDS` default `900`
- `SOTERIA_B2_SCAN_TIMEOUT_SECONDS` default `120`
2026-01-31 03:34:34 -03:00
## Secrets
2026-04-12 11:09:49 -03:00
Create a secret named `soteria-restic` in the Soteria namespace, or set `SOTERIA_RESTIC_SECRET_NAME` , when using the restic driver. Required keys:
2026-01-31 03:34:34 -03:00
- `AWS_ACCESS_KEY_ID`
- `AWS_SECRET_ACCESS_KEY`
- `RESTIC_PASSWORD`
2026-04-12 11:09:49 -03:00
The service copies this secret into the target namespace per job and attaches an owner reference so it is cleaned up with the Job.
2026-01-31 03:34:34 -03:00
2026-04-12 19:45:23 -03:00
For B2 scanning, you can point Soteria at a secret via `SOTERIA_B2_SECRET_NAME` . Expected keys by default:
- `AWS_ACCESS_KEY_ID`
- `AWS_SECRET_ACCESS_KEY`
- `AWS_ENDPOINTS` (optional if `SOTERIA_B2_ENDPOINT` is set)
2026-04-12 11:09:49 -03:00
A template is in `deploy/secret-example.yaml` . Do not commit real credentials.
2026-01-31 03:34:34 -03:00
## Deployment
2026-04-12 11:09:49 -03:00
The `deploy/` folder includes Kustomize-ready manifests for namespace, RBAC, config, deployment, and service.
2026-01-31 03:34:34 -03:00
Apply with:
```sh
kubectl apply -k deploy
```
2026-04-12 11:09:49 -03:00
The example Service is annotated for Prometheus scraping of `/metrics` .
2026-01-31 03:34:34 -03:00
## Notes
2026-04-12 11:09:49 -03:00
- Longhorn inventory and metrics are based on discovered backup records per PVC.
2026-04-12 19:45:23 -03:00
- Inventory `Restore` buttons load source context into the restore planner; restore execution happens from the planner panel.
2026-04-12 14:32:39 -03:00
- Scheduled policy execution currently applies to Longhorn driver.
2026-04-12 11:09:49 -03:00
- Restic backup and restore execution exists, but inventory-style telemetry is currently Longhorn-focused.
- For Atlas production, place Soteria behind an authenticated ingress and trust only proxy-injected auth headers.