7.6 KiB
soteria
Soteria is an in-cluster service for PVC backup and restore operations. The current production baseline focuses on Longhorn-backed PVCs and provides:
- Namespace-grouped PVC inventory for backup and restore selection.
- On-demand backup creation for Longhorn volumes.
- Namespace-wide backup and restore batch execution.
- Restore into a new target PVC with conflict checks and best-effort cleanup on failure.
- Policy-based scheduled backups (per PVC or all PVCs in a namespace), persisted in-cluster.
- A simple built-in UI suitable for publishing behind an authenticated ingress.
- Prometheus-format backup freshness telemetry for Grafana rollups.
For Longhorn, backups are crash-consistent at the volume level and delegated to the Longhorn control plane.
Endpoints
Public endpoints:
GET /healthzGET /readyzGET /metrics
Protected endpoints when SOTERIA_AUTH_REQUIRED=true:
GET /UI consoleGET /v1/whoamiGET /v1/inventoryGET /v1/backups?namespace=<ns>&pvc=<name>POST /v1/backupPOST /v1/backup/namespacePOST /v1/restoresPOST /v1/restores/namespacePOST /v1/restore-testlegacy alias for/v1/restoresGET /v1/policiesPOST /v1/policiesDELETE /v1/policies/<policy-id>
API examples
POST /v1/backup
{
"namespace": "ai",
"pvc": "llm-cache",
"tags": ["namespace=ai", "service=llm"],
"dry_run": false
}
Longhorn response:
{
"driver": "longhorn",
"volume": "pvc-1234abcd",
"backup": "soteria-backup-ai-llm-cache-20260412-153000",
"namespace": "ai",
"requested_by": "brad",
"dry_run": false
}
GET /v1/inventory
Response shape:
{
"generated_at": "2026-04-12T15:30:00Z",
"namespaces": [
{
"name": "ai",
"pvcs": [
{
"namespace": "ai",
"pvc": "llm-cache",
"volume": "pvc-1234abcd",
"storage_class": "longhorn",
"capacity": "50Gi",
"driver": "longhorn",
"last_backup_at": "2026-04-12T14:55:00Z",
"last_backup_age_hours": 0.58,
"backup_count": 14,
"healthy": true,
"health_reason": "fresh"
}
]
}
]
}
GET /v1/backups
/v1/backups?namespace=ai&pvc=llm-cache
Returns the resolved volume name and backup records so the UI or automation can select a restore source.
POST /v1/restores
{
"namespace": "ai",
"pvc": "llm-cache",
"snapshot": "latest",
"target_namespace": "ai",
"target_pvc": "restore-llm-cache",
"dry_run": false
}
Notes:
namespaceandpvcidentify the source PVC.target_pvcis required.target_namespacedefaults tonamespace.- Soteria refuses to overwrite an existing target PVC.
- If Longhorn volume creation succeeds but PVC creation fails, Soteria attempts to delete the just-created restore volume.
- You may provide
backup_urldirectly instead ofsnapshot.
POST /v1/backup/namespace
{
"namespace": "ai",
"dry_run": false
}
Runs backup for every currently bound PVC in the namespace and returns a per-PVC result list.
POST /v1/restores/namespace
{
"namespace": "ai",
"target_namespace": "ai-restore",
"target_prefix": "restore-20260412-",
"snapshot": "",
"dry_run": true
}
Runs restore planning/execution for every bound PVC in the source namespace. snapshot is optional and blank means latest completed backup per PVC.
Policy API
Create or update a policy:
POST /v1/policies
{
"namespace": "ai",
"pvc": "llm-cache",
"interval_hours": 6,
"enabled": true
}
- Leave
pvcempty to target all PVCs in that namespace. - Policies are stored in secret
SOTERIA_POLICY_SECRET_NAMEunder keypolicies.json.
Authentication and authorization
When SOTERIA_AUTH_REQUIRED=true, Soteria expects trusted auth headers from a fronting proxy such as oauth2-proxy:
X-Auth-Request-UserX-Auth-Request-EmailX-Auth-Request-GroupsX-Forwarded-User(fallback)X-Forwarded-Email(fallback)X-Forwarded-Groups(fallback)
Allowed groups are configured with SOTERIA_ALLOWED_GROUPS and compared after normalizing leading / prefixes, so both maintenance and /maintenance are accepted. Group lists may be comma- or semicolon-separated.
Optional machine-to-machine access can be enabled with SOTERIA_AUTH_BEARER_TOKENS, which accepts a comma-separated list of bearer tokens.
Prometheus metrics
Soteria exports Prometheus-format metrics at GET /metrics.
Implemented metrics:
soteria_backup_requests_total{driver,result}soteria_restore_requests_total{driver,result}soteria_policy_backups_total{result}soteria_namespace_backup_requests_total{driver,result}soteria_namespace_restore_requests_total{driver,result}soteria_authz_denials_total{reason}soteria_inventory_refresh_failures_totalsoteria_inventory_refresh_timestamp_secondspvc_backup_age_hours{namespace,pvc,volume,driver}pvc_backup_health{namespace,pvc,volume,driver}pvc_backup_last_success_timestamp_seconds{namespace,pvc,volume,driver}pvc_backup_count{namespace,pvc,volume,driver}
pvc_backup_health is 1 when the most recent successful backup is within SOTERIA_BACKUP_MAX_AGE_HOURS, otherwise 0.
Configuration
Environment variables:
SOTERIA_BACKUP_DRIVERdefaultlonghorn, allowedlonghorn,resticSOTERIA_LONGHORN_URLdefaulthttp://longhorn-backend.longhorn-system.svc:9500SOTERIA_LONGHORN_BACKUP_MODEdefaultincremental, allowedincremental,fullSOTERIA_RESTIC_REPOSITORYrequired for restic driverSOTERIA_RESTIC_SECRET_NAMEdefaultsoteria-resticSOTERIA_SECRET_NAMESPACEdefault service namespaceSOTERIA_RESTIC_IMAGEdefaultrestic/restic:0.16.4SOTERIA_RESTIC_BACKUP_ARGSoptional extra args forrestic backupSOTERIA_RESTIC_FORGET_ARGSoptional extra args forrestic forgetSOTERIA_S3_ENDPOINToptional S3-compatible endpointSOTERIA_S3_REGIONoptional regionSOTERIA_JOB_TTL_SECONDSdefault86400SOTERIA_JOB_NODE_SELECTORoptional comma-separatedkey=valuelistSOTERIA_JOB_SERVICE_ACCOUNToptional ServiceAccount for restic JobsSOTERIA_LISTEN_ADDRdefault:8080SOTERIA_AUTH_REQUIREDdefaultfalseSOTERIA_ALLOWED_GROUPSdefaultadmin,maintenanceSOTERIA_AUTH_BEARER_TOKENSoptional comma-separated bearer tokensSOTERIA_BACKUP_MAX_AGE_HOURSdefault24SOTERIA_METRICS_REFRESH_SECONDSdefault300SOTERIA_POLICY_EVAL_SECONDSdefault300SOTERIA_POLICY_SECRET_NAMEdefaultsoteria-policies
Secrets
Create a secret named soteria-restic in the Soteria namespace, or set SOTERIA_RESTIC_SECRET_NAME, when using the restic driver. Required keys:
AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYRESTIC_PASSWORD
The service copies this secret into the target namespace per job and attaches an owner reference so it is cleaned up with the Job.
A template is in deploy/secret-example.yaml. Do not commit real credentials.
Deployment
The deploy/ folder includes Kustomize-ready manifests for namespace, RBAC, config, deployment, and service.
Apply with:
kubectl apply -k deploy
The example Service is annotated for Prometheus scraping of /metrics.
Notes
- Longhorn inventory and metrics are based on discovered backup records per PVC.
- Scheduled policy execution currently applies to Longhorn driver.
- Restic backup and restore execution exists, but inventory-style telemetry is currently Longhorn-focused.
- For Atlas production, place Soteria behind an authenticated ingress and trust only proxy-injected auth headers.