# soteria Soteria is an in-cluster service for PVC backup and restore operations. The current production baseline focuses on Longhorn-backed PVCs and provides: - Namespace-grouped PVC inventory for backup and restore selection. - On-demand backup creation for Longhorn volumes. - Namespace-wide backup and restore batch execution. - Restore into a new target PVC with conflict checks and best-effort cleanup on failure. - Policy-based scheduled backups (per PVC or all PVCs in a namespace), persisted in-cluster. - A simple built-in UI suitable for publishing behind an authenticated ingress. - Prometheus-format backup freshness telemetry for Grafana rollups. For Longhorn, backups are crash-consistent at the volume level and delegated to the Longhorn control plane. ## Endpoints Public endpoints: - `GET /healthz` - `GET /readyz` - `GET /metrics` Protected endpoints when `SOTERIA_AUTH_REQUIRED=true`: - `GET /` UI console - `GET /v1/whoami` - `GET /v1/inventory` - `GET /v1/backups?namespace=&pvc=` - `POST /v1/backup` - `POST /v1/backup/namespace` - `POST /v1/restores` - `POST /v1/restores/namespace` - `POST /v1/restore-test` legacy alias for `/v1/restores` - `GET /v1/policies` - `POST /v1/policies` - `DELETE /v1/policies/` ## API examples ### POST /v1/backup ```json { "namespace": "ai", "pvc": "llm-cache", "tags": ["namespace=ai", "service=llm"], "dry_run": false } ``` Longhorn response: ```json { "driver": "longhorn", "volume": "pvc-1234abcd", "backup": "soteria-backup-ai-llm-cache-20260412-153000", "namespace": "ai", "requested_by": "brad", "dry_run": false } ``` ### GET /v1/inventory Response shape: ```json { "generated_at": "2026-04-12T15:30:00Z", "namespaces": [ { "name": "ai", "pvcs": [ { "namespace": "ai", "pvc": "llm-cache", "volume": "pvc-1234abcd", "storage_class": "longhorn", "capacity": "50Gi", "driver": "longhorn", "last_backup_at": "2026-04-12T14:55:00Z", "last_backup_age_hours": 0.58, "backup_count": 14, "healthy": true, "health_reason": "fresh" } ] } ] } ``` ### GET /v1/backups ```text /v1/backups?namespace=ai&pvc=llm-cache ``` Returns the resolved volume name and backup records so the UI or automation can select a restore source. ### POST /v1/restores ```json { "namespace": "ai", "pvc": "llm-cache", "snapshot": "latest", "target_namespace": "ai", "target_pvc": "restore-llm-cache", "dry_run": false } ``` Notes: - `namespace` and `pvc` identify the source PVC. - `target_pvc` is required. - `target_namespace` defaults to `namespace`. - Soteria refuses to overwrite an existing target PVC. - If Longhorn volume creation succeeds but PVC creation fails, Soteria attempts to delete the just-created restore volume. - You may provide `backup_url` directly instead of `snapshot`. ### POST /v1/backup/namespace ```json { "namespace": "ai", "dry_run": false } ``` Runs backup for every currently bound PVC in the namespace and returns a per-PVC result list. ### POST /v1/restores/namespace ```json { "namespace": "ai", "target_namespace": "ai-restore", "target_prefix": "restore-20260412-", "snapshot": "", "dry_run": true } ``` Runs restore planning/execution for every bound PVC in the source namespace. `snapshot` is optional and blank means latest completed backup per PVC. ### Policy API Create or update a policy: ```json POST /v1/policies { "namespace": "ai", "pvc": "llm-cache", "interval_hours": 6, "enabled": true } ``` - Leave `pvc` empty to target all PVCs in that namespace. - Policies are stored in secret `SOTERIA_POLICY_SECRET_NAME` under key `policies.json`. ## Authentication and authorization When `SOTERIA_AUTH_REQUIRED=true`, Soteria expects trusted auth headers from a fronting proxy such as `oauth2-proxy`: - `X-Auth-Request-User` - `X-Auth-Request-Email` - `X-Auth-Request-Groups` - `X-Forwarded-User` (fallback) - `X-Forwarded-Email` (fallback) - `X-Forwarded-Groups` (fallback) Allowed groups are configured with `SOTERIA_ALLOWED_GROUPS` and compared after normalizing leading `/` prefixes, so both `maintenance` and `/maintenance` are accepted. Group lists may be comma- or semicolon-separated. Optional machine-to-machine access can be enabled with `SOTERIA_AUTH_BEARER_TOKENS`, which accepts a comma-separated list of bearer tokens. ## Prometheus metrics Soteria exports Prometheus-format metrics at `GET /metrics`. Implemented metrics: - `soteria_backup_requests_total{driver,result}` - `soteria_restore_requests_total{driver,result}` - `soteria_policy_backups_total{result}` - `soteria_namespace_backup_requests_total{driver,result}` - `soteria_namespace_restore_requests_total{driver,result}` - `soteria_authz_denials_total{reason}` - `soteria_inventory_refresh_failures_total` - `soteria_inventory_refresh_timestamp_seconds` - `pvc_backup_age_hours{namespace,pvc,volume,driver}` - `pvc_backup_health{namespace,pvc,volume,driver}` - `pvc_backup_last_success_timestamp_seconds{namespace,pvc,volume,driver}` - `pvc_backup_count{namespace,pvc,volume,driver}` `pvc_backup_health` is `1` when the most recent successful backup is within `SOTERIA_BACKUP_MAX_AGE_HOURS`, otherwise `0`. ## Configuration Environment variables: - `SOTERIA_BACKUP_DRIVER` default `longhorn`, allowed `longhorn`, `restic` - `SOTERIA_LONGHORN_URL` default `http://longhorn-backend.longhorn-system.svc:9500` - `SOTERIA_LONGHORN_BACKUP_MODE` default `incremental`, allowed `incremental`, `full` - `SOTERIA_RESTIC_REPOSITORY` required for restic driver - `SOTERIA_RESTIC_SECRET_NAME` default `soteria-restic` - `SOTERIA_SECRET_NAMESPACE` default service namespace - `SOTERIA_RESTIC_IMAGE` default `restic/restic:0.16.4` - `SOTERIA_RESTIC_BACKUP_ARGS` optional extra args for `restic backup` - `SOTERIA_RESTIC_FORGET_ARGS` optional extra args for `restic forget` - `SOTERIA_S3_ENDPOINT` optional S3-compatible endpoint - `SOTERIA_S3_REGION` optional region - `SOTERIA_JOB_TTL_SECONDS` default `86400` - `SOTERIA_JOB_NODE_SELECTOR` optional comma-separated `key=value` list - `SOTERIA_JOB_SERVICE_ACCOUNT` optional ServiceAccount for restic Jobs - `SOTERIA_LISTEN_ADDR` default `:8080` - `SOTERIA_AUTH_REQUIRED` default `false` - `SOTERIA_ALLOWED_GROUPS` default `admin,maintenance` - `SOTERIA_AUTH_BEARER_TOKENS` optional comma-separated bearer tokens - `SOTERIA_BACKUP_MAX_AGE_HOURS` default `24` - `SOTERIA_METRICS_REFRESH_SECONDS` default `300` - `SOTERIA_POLICY_EVAL_SECONDS` default `300` - `SOTERIA_POLICY_SECRET_NAME` default `soteria-policies` ## Secrets Create a secret named `soteria-restic` in the Soteria namespace, or set `SOTERIA_RESTIC_SECRET_NAME`, when using the restic driver. Required keys: - `AWS_ACCESS_KEY_ID` - `AWS_SECRET_ACCESS_KEY` - `RESTIC_PASSWORD` The service copies this secret into the target namespace per job and attaches an owner reference so it is cleaned up with the Job. A template is in `deploy/secret-example.yaml`. Do not commit real credentials. ## Deployment The `deploy/` folder includes Kustomize-ready manifests for namespace, RBAC, config, deployment, and service. Apply with: ```sh kubectl apply -k deploy ``` The example Service is annotated for Prometheus scraping of `/metrics`. ## Notes - Longhorn inventory and metrics are based on discovered backup records per PVC. - Scheduled policy execution currently applies to Longhorn driver. - Restic backup and restore execution exists, but inventory-style telemetry is currently Longhorn-focused. - For Atlas production, place Soteria behind an authenticated ingress and trust only proxy-injected auth headers.