328 lines
11 KiB
Markdown
328 lines
11 KiB
Markdown
# soteria
|
|
|
|
Soteria is an in-cluster service for PVC backup and restore operations. The current production baseline focuses on Longhorn-backed PVCs and provides:
|
|
|
|
- Namespace-grouped PVC inventory for backup and restore selection.
|
|
- On-demand backup creation for Longhorn volumes.
|
|
- Namespace-wide backup and restore batch execution.
|
|
- Restore into a new target PVC with conflict checks and best-effort cleanup on failure.
|
|
- Policy-based scheduled backups (per PVC or all PVCs in a namespace), persisted in-cluster.
|
|
- A built-in React + TypeScript UI (dark-mode default) suitable for publishing behind an authenticated ingress.
|
|
- Prometheus-format backup freshness and B2 consumption telemetry for Grafana rollups.
|
|
|
|
For Longhorn, backups are crash-consistent at the volume level and delegated to the Longhorn control plane.
|
|
|
|
## Endpoints
|
|
|
|
Public endpoints:
|
|
|
|
- `GET /healthz`
|
|
- `GET /readyz`
|
|
- `GET /metrics`
|
|
|
|
Protected endpoints when `SOTERIA_AUTH_REQUIRED=true`:
|
|
|
|
- `GET /` UI console
|
|
- `GET /v1/whoami`
|
|
- `GET /v1/inventory`
|
|
- `GET /v1/backups?namespace=<ns>&pvc=<name>`
|
|
- `POST /v1/backup`
|
|
- `POST /v1/backup/namespace`
|
|
- `POST /v1/restores`
|
|
- `POST /v1/restores/namespace`
|
|
- `POST /v1/restore-test` legacy alias for `/v1/restores`
|
|
- `GET /v1/policies`
|
|
- `POST /v1/policies`
|
|
- `DELETE /v1/policies/<policy-id>`
|
|
- `GET /v1/b2`
|
|
|
|
## API examples
|
|
|
|
### POST /v1/backup
|
|
|
|
```json
|
|
{
|
|
"namespace": "ai",
|
|
"pvc": "llm-cache",
|
|
"tags": ["namespace=ai", "service=llm"],
|
|
"dry_run": false
|
|
}
|
|
```
|
|
|
|
Longhorn response:
|
|
|
|
```json
|
|
{
|
|
"driver": "longhorn",
|
|
"volume": "pvc-1234abcd",
|
|
"backup": "soteria-backup-ai-llm-cache-20260412-153000",
|
|
"namespace": "ai",
|
|
"requested_by": "brad",
|
|
"dry_run": false
|
|
}
|
|
```
|
|
|
|
### GET /v1/inventory
|
|
|
|
Response shape:
|
|
|
|
```json
|
|
{
|
|
"generated_at": "2026-04-12T15:30:00Z",
|
|
"namespaces": [
|
|
{
|
|
"name": "ai",
|
|
"pvcs": [
|
|
{
|
|
"namespace": "ai",
|
|
"pvc": "llm-cache",
|
|
"volume": "pvc-1234abcd",
|
|
"storage_class": "longhorn",
|
|
"capacity": "50Gi",
|
|
"driver": "longhorn",
|
|
"last_backup_at": "2026-04-12T14:55:00Z",
|
|
"last_backup_age_hours": 0.58,
|
|
"backup_count": 14,
|
|
"healthy": true,
|
|
"health_reason": "fresh"
|
|
}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### GET /v1/backups
|
|
|
|
```text
|
|
/v1/backups?namespace=ai&pvc=llm-cache
|
|
```
|
|
|
|
Returns the resolved volume name and backup records so the UI or automation can select a restore source.
|
|
|
|
### POST /v1/restores
|
|
|
|
```json
|
|
{
|
|
"namespace": "ai",
|
|
"pvc": "llm-cache",
|
|
"snapshot": "latest",
|
|
"target_namespace": "ai",
|
|
"target_pvc": "restore-llm-cache",
|
|
"dry_run": false
|
|
}
|
|
```
|
|
|
|
Notes:
|
|
|
|
- `namespace` and `pvc` identify the source PVC.
|
|
- `target_pvc` is required.
|
|
- `target_namespace` defaults to `namespace`.
|
|
- Soteria refuses to overwrite an existing target PVC.
|
|
- If Longhorn volume creation succeeds but PVC creation fails, Soteria attempts to delete the just-created restore volume.
|
|
- You may provide `backup_url` directly instead of `snapshot`.
|
|
|
|
### POST /v1/backup/namespace
|
|
|
|
```json
|
|
{
|
|
"namespace": "ai",
|
|
"dry_run": false
|
|
}
|
|
```
|
|
|
|
Runs backup for every currently bound PVC in the namespace and returns a per-PVC result list.
|
|
|
|
### POST /v1/restores/namespace
|
|
|
|
```json
|
|
{
|
|
"namespace": "ai",
|
|
"target_namespace": "ai-restore",
|
|
"target_prefix": "restore-20260412-",
|
|
"snapshot": "",
|
|
"dry_run": true
|
|
}
|
|
```
|
|
|
|
Runs restore planning/execution for every bound PVC in the source namespace. `snapshot` is optional and blank means latest completed backup per PVC.
|
|
|
|
### Policy API
|
|
|
|
Create or update a policy:
|
|
|
|
```json
|
|
POST /v1/policies
|
|
{
|
|
"namespace": "ai",
|
|
"pvc": "llm-cache",
|
|
"interval_hours": 6,
|
|
"enabled": true
|
|
}
|
|
```
|
|
|
|
- Leave `pvc` empty to target all PVCs in that namespace.
|
|
- Policies are stored in secret `SOTERIA_POLICY_SECRET_NAME` under key `policies.json`.
|
|
|
|
### GET /v1/b2
|
|
|
|
Returns B2 account/bucket consumption based on S3-compatible object scans.
|
|
|
|
```json
|
|
{
|
|
"enabled": true,
|
|
"available": true,
|
|
"endpoint": "https://s3.us-west-004.backblazeb2.com",
|
|
"region": "us-west-004",
|
|
"scanned_at": "2026-04-12T16:00:00Z",
|
|
"scan_duration_ms": 824,
|
|
"total_objects": 1324,
|
|
"total_bytes": 18407542931,
|
|
"recent_objects_24h": 18,
|
|
"recent_bytes_24h": 12245812,
|
|
"buckets": [
|
|
{
|
|
"name": "atlas-backups",
|
|
"object_count": 1240,
|
|
"total_bytes": 18288473811,
|
|
"recent_objects_24h": 12,
|
|
"recent_bytes_24h": 8542198,
|
|
"last_modified_at": "2026-04-12T15:43:19Z"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
Recent 24h values are an object-change proxy and do not represent full B2 billing egress totals.
|
|
|
|
## Authentication and authorization
|
|
|
|
When `SOTERIA_AUTH_REQUIRED=true`, Soteria expects trusted auth headers from a fronting proxy such as `oauth2-proxy`:
|
|
|
|
- `X-Auth-Request-User`
|
|
- `X-Auth-Request-Email`
|
|
- `X-Auth-Request-Groups`
|
|
- `X-Forwarded-User` (fallback)
|
|
- `X-Forwarded-Email` (fallback)
|
|
- `X-Forwarded-Groups` (fallback)
|
|
|
|
Allowed groups are configured with `SOTERIA_ALLOWED_GROUPS` and compared after normalizing leading `/` prefixes, so both `maintenance` and `/maintenance` are accepted. Group lists may be comma- or semicolon-separated.
|
|
|
|
Optional machine-to-machine access can be enabled with `SOTERIA_AUTH_BEARER_TOKENS`, which accepts a comma-separated list of bearer tokens.
|
|
|
|
## Prometheus metrics
|
|
|
|
Soteria exports Prometheus-format metrics at `GET /metrics`.
|
|
|
|
Implemented metrics:
|
|
|
|
- `soteria_backup_requests_total{driver,result}`
|
|
- `soteria_restore_requests_total{driver,result}`
|
|
- `soteria_policy_backups_total{result}`
|
|
- `soteria_namespace_backup_requests_total{driver,result}`
|
|
- `soteria_namespace_restore_requests_total{driver,result}`
|
|
- `soteria_authz_denials_total{reason}`
|
|
- `soteria_inventory_refresh_failures_total`
|
|
- `soteria_inventory_refresh_timestamp_seconds`
|
|
- `pvc_backup_age_hours{namespace,pvc,volume,driver}`
|
|
- `pvc_backup_health{namespace,pvc,volume,driver}`
|
|
- `pvc_backup_health_reason{namespace,pvc,volume,driver,reason}`
|
|
- `pvc_backup_last_success_timestamp_seconds{namespace,pvc,volume,driver}`
|
|
- `pvc_backup_count{namespace,pvc,volume,driver}`
|
|
- `pvc_backup_completed_count{namespace,pvc,volume,driver}`
|
|
- `pvc_backup_last_size_bytes{namespace,pvc,volume,driver}`
|
|
- `pvc_backup_total_size_bytes{namespace,pvc,volume,driver}`
|
|
- `soteria_b2_scan_success`
|
|
- `soteria_b2_scan_timestamp_seconds`
|
|
- `soteria_b2_scan_duration_seconds`
|
|
- `soteria_b2_account_objects`
|
|
- `soteria_b2_account_bytes`
|
|
- `soteria_b2_account_recent_objects_24h`
|
|
- `soteria_b2_account_recent_bytes_24h`
|
|
- `soteria_b2_bucket_objects{bucket}`
|
|
- `soteria_b2_bucket_bytes{bucket}`
|
|
- `soteria_b2_bucket_recent_objects_24h{bucket}`
|
|
- `soteria_b2_bucket_recent_bytes_24h{bucket}`
|
|
- `soteria_b2_bucket_last_modified_timestamp_seconds{bucket}`
|
|
|
|
`pvc_backup_health` is `1` when the most recent successful backup is within `SOTERIA_BACKUP_MAX_AGE_HOURS`, otherwise `0`.
|
|
|
|
## Configuration
|
|
|
|
Environment variables:
|
|
|
|
- `SOTERIA_BACKUP_DRIVER` default `longhorn`, allowed `longhorn`, `restic`
|
|
- `SOTERIA_LONGHORN_URL` default `http://longhorn-backend.longhorn-system.svc:9500`
|
|
- `SOTERIA_LONGHORN_BACKUP_MODE` default `incremental`, allowed `incremental`, `full`
|
|
- `SOTERIA_RESTIC_REPOSITORY` required for restic driver
|
|
- `SOTERIA_RESTIC_SECRET_NAME` default `soteria-restic`
|
|
- `SOTERIA_SECRET_NAMESPACE` default service namespace
|
|
- `SOTERIA_RESTIC_IMAGE` default `restic/restic:0.16.4`
|
|
- `SOTERIA_RESTIC_BACKUP_ARGS` optional extra args for `restic backup`
|
|
- `SOTERIA_RESTIC_FORGET_ARGS` optional extra args for `restic forget`
|
|
- `SOTERIA_S3_ENDPOINT` optional S3-compatible endpoint
|
|
- `SOTERIA_S3_REGION` optional region
|
|
- `SOTERIA_JOB_TTL_SECONDS` default `86400`
|
|
- `SOTERIA_JOB_NODE_SELECTOR` optional comma-separated `key=value` list
|
|
- `SOTERIA_JOB_SERVICE_ACCOUNT` optional ServiceAccount for restic Jobs
|
|
- `SOTERIA_LISTEN_ADDR` default `:8080`
|
|
- `SOTERIA_AUTH_REQUIRED` default `false`
|
|
- `SOTERIA_ALLOWED_GROUPS` default `admin,maintenance`
|
|
- `SOTERIA_AUTH_BEARER_TOKENS` optional comma-separated bearer tokens
|
|
- `SOTERIA_BACKUP_MAX_AGE_HOURS` default `24`
|
|
- `SOTERIA_METRICS_REFRESH_SECONDS` default `300`
|
|
- `SOTERIA_POLICY_EVAL_SECONDS` default `300`
|
|
- `SOTERIA_POLICY_SECRET_NAME` default `soteria-policies`
|
|
- `SOTERIA_USAGE_SECRET_NAME` default `soteria-backup-usage` (stores persisted restic size estimates)
|
|
- `SOTERIA_B2_ENABLED` default `false` (auto-enabled if endpoint/secret are set)
|
|
- `SOTERIA_B2_ENDPOINT` optional S3-compatible endpoint (for B2, usually `https://s3.<region>.backblazeb2.com`)
|
|
- `SOTERIA_B2_REGION` optional region override (auto-inferred for Backblaze endpoint patterns)
|
|
- `SOTERIA_B2_BUCKETS` optional comma-separated bucket allowlist (defaults to scanning all accessible buckets)
|
|
- `SOTERIA_B2_ACCESS_KEY_ID` optional static key (can come from secret instead)
|
|
- `SOTERIA_B2_SECRET_ACCESS_KEY` optional static secret key (can come from secret instead)
|
|
- `SOTERIA_B2_SECRET_NAMESPACE` optional secret namespace (defaults to service namespace when secret name is set)
|
|
- `SOTERIA_B2_SECRET_NAME` optional secret containing B2 keys
|
|
- `SOTERIA_B2_ACCESS_KEY_FIELD` default `AWS_ACCESS_KEY_ID`
|
|
- `SOTERIA_B2_SECRET_KEY_FIELD` default `AWS_SECRET_ACCESS_KEY`
|
|
- `SOTERIA_B2_ENDPOINT_FIELD` default `AWS_ENDPOINTS`
|
|
- `SOTERIA_B2_SCAN_INTERVAL_SECONDS` default `900`
|
|
- `SOTERIA_B2_SCAN_TIMEOUT_SECONDS` default `120`
|
|
|
|
## Secrets
|
|
|
|
Create a secret named `soteria-restic` in the Soteria namespace, or set `SOTERIA_RESTIC_SECRET_NAME`, when using the restic driver. Required keys:
|
|
|
|
- `AWS_ACCESS_KEY_ID`
|
|
- `AWS_SECRET_ACCESS_KEY`
|
|
- `RESTIC_PASSWORD`
|
|
|
|
The service copies this secret into the target namespace per job and attaches an owner reference so it is cleaned up with the Job.
|
|
|
|
For B2 scanning, you can point Soteria at a secret via `SOTERIA_B2_SECRET_NAME`. Expected keys by default:
|
|
|
|
- `AWS_ACCESS_KEY_ID`
|
|
- `AWS_SECRET_ACCESS_KEY`
|
|
- `AWS_ENDPOINTS` (optional if `SOTERIA_B2_ENDPOINT` is set)
|
|
|
|
A template is in `deploy/secret-example.yaml`. Do not commit real credentials.
|
|
|
|
## Deployment
|
|
|
|
The `deploy/` folder includes Kustomize-ready manifests for namespace, RBAC, config, deployment, and service.
|
|
|
|
Apply with:
|
|
|
|
```sh
|
|
kubectl apply -k deploy
|
|
```
|
|
|
|
The example Service is annotated for Prometheus scraping of `/metrics`.
|
|
|
|
## Notes
|
|
|
|
- Longhorn inventory and metrics are based on discovered backup records per PVC.
|
|
- Inventory `Restore` buttons load source context into the restore planner; restore execution happens from the planner panel.
|
|
- Scheduled backup policies apply to both Longhorn and restic drivers.
|
|
- Restic size telemetry is estimated from per-job upload summaries; with shared dedupe repositories those values are per-PVC attributions, not exact physical B2 ownership.
|
|
- For Atlas production, place Soteria behind an authenticated ingress and trust only proxy-injected auth headers.
|