# soteria Soteria is an in-cluster service for PVC backup and restore operations. The current production baseline focuses on Longhorn-backed PVCs and provides: - Namespace-grouped PVC inventory for backup and restore selection. - On-demand backup creation for Longhorn volumes. - Namespace-wide backup and restore batch execution. - Restore into a new target PVC with conflict checks and best-effort cleanup on failure. - Policy-based scheduled backups (per PVC or all PVCs in a namespace), persisted in-cluster. - A built-in React + TypeScript UI (dark-mode default) suitable for publishing behind an authenticated ingress. - Prometheus-format backup freshness and B2 consumption telemetry for Grafana rollups. For Longhorn, backups are crash-consistent at the volume level and delegated to the Longhorn control plane. ## Endpoints Public endpoints: - `GET /healthz` - `GET /readyz` - `GET /metrics` Protected endpoints when `SOTERIA_AUTH_REQUIRED=true`: - `GET /` UI console - `GET /v1/whoami` - `GET /v1/inventory` - `GET /v1/backups?namespace=&pvc=` - `POST /v1/backup` - `POST /v1/backup/namespace` - `POST /v1/restores` - `POST /v1/restores/namespace` - `POST /v1/restore-test` legacy alias for `/v1/restores` - `GET /v1/policies` - `POST /v1/policies` - `DELETE /v1/policies/` - `GET /v1/b2` ## API examples ### POST /v1/backup ```json { "namespace": "ai", "pvc": "llm-cache", "tags": ["namespace=ai", "service=llm"], "dry_run": false } ``` Longhorn response: ```json { "driver": "longhorn", "volume": "pvc-1234abcd", "backup": "soteria-backup-ai-llm-cache-20260412-153000", "namespace": "ai", "requested_by": "brad", "dry_run": false } ``` ### GET /v1/inventory Response shape: ```json { "generated_at": "2026-04-12T15:30:00Z", "namespaces": [ { "name": "ai", "pvcs": [ { "namespace": "ai", "pvc": "llm-cache", "volume": "pvc-1234abcd", "storage_class": "longhorn", "capacity": "50Gi", "driver": "longhorn", "last_backup_at": "2026-04-12T14:55:00Z", "last_backup_age_hours": 0.58, "backup_count": 14, "healthy": true, "health_reason": "fresh" } ] } ] } ``` ### GET /v1/backups ```text /v1/backups?namespace=ai&pvc=llm-cache ``` Returns the resolved volume name and backup records so the UI or automation can select a restore source. ### POST /v1/restores ```json { "namespace": "ai", "pvc": "llm-cache", "snapshot": "latest", "target_namespace": "ai", "target_pvc": "restore-llm-cache", "dry_run": false } ``` Notes: - `namespace` and `pvc` identify the source PVC. - `target_pvc` is required. - `target_namespace` defaults to `namespace`. - Soteria refuses to overwrite an existing target PVC. - If Longhorn volume creation succeeds but PVC creation fails, Soteria attempts to delete the just-created restore volume. - You may provide `backup_url` directly instead of `snapshot`. ### POST /v1/backup/namespace ```json { "namespace": "ai", "dry_run": false } ``` Runs backup for every currently bound PVC in the namespace and returns a per-PVC result list. ### POST /v1/restores/namespace ```json { "namespace": "ai", "target_namespace": "ai-restore", "target_prefix": "restore-20260412-", "snapshot": "", "dry_run": true } ``` Runs restore planning/execution for every bound PVC in the source namespace. `snapshot` is optional and blank means latest completed backup per PVC. ### Policy API Create or update a policy: ```json POST /v1/policies { "namespace": "ai", "pvc": "llm-cache", "interval_hours": 6, "enabled": true } ``` - Leave `pvc` empty to target all PVCs in that namespace. - Policies are stored in secret `SOTERIA_POLICY_SECRET_NAME` under key `policies.json`. ### GET /v1/b2 Returns B2 account/bucket consumption based on S3-compatible object scans. ```json { "enabled": true, "available": true, "endpoint": "https://s3.us-west-004.backblazeb2.com", "region": "us-west-004", "scanned_at": "2026-04-12T16:00:00Z", "scan_duration_ms": 824, "total_objects": 1324, "total_bytes": 18407542931, "recent_objects_24h": 18, "recent_bytes_24h": 12245812, "buckets": [ { "name": "atlas-backups", "object_count": 1240, "total_bytes": 18288473811, "recent_objects_24h": 12, "recent_bytes_24h": 8542198, "last_modified_at": "2026-04-12T15:43:19Z" } ] } ``` Recent 24h values are an object-change proxy and do not represent full B2 billing egress totals. ## Authentication and authorization When `SOTERIA_AUTH_REQUIRED=true`, Soteria expects trusted auth headers from a fronting proxy such as `oauth2-proxy`: - `X-Auth-Request-User` - `X-Auth-Request-Email` - `X-Auth-Request-Groups` - `X-Forwarded-User` (fallback) - `X-Forwarded-Email` (fallback) - `X-Forwarded-Groups` (fallback) Allowed groups are configured with `SOTERIA_ALLOWED_GROUPS` and compared after normalizing leading `/` prefixes, so both `maintenance` and `/maintenance` are accepted. Group lists may be comma- or semicolon-separated. Optional machine-to-machine access can be enabled with `SOTERIA_AUTH_BEARER_TOKENS`, which accepts a comma-separated list of bearer tokens. ## Prometheus metrics Soteria exports Prometheus-format metrics at `GET /metrics`. Implemented metrics: - `soteria_backup_requests_total{driver,result}` - `soteria_restore_requests_total{driver,result}` - `soteria_policy_backups_total{result}` - `soteria_namespace_backup_requests_total{driver,result}` - `soteria_namespace_restore_requests_total{driver,result}` - `soteria_authz_denials_total{reason}` - `soteria_inventory_refresh_failures_total` - `soteria_inventory_refresh_timestamp_seconds` - `pvc_backup_age_hours{namespace,pvc,volume,driver}` - `pvc_backup_health{namespace,pvc,volume,driver}` - `pvc_backup_health_reason{namespace,pvc,volume,driver,reason}` - `pvc_backup_last_success_timestamp_seconds{namespace,pvc,volume,driver}` - `pvc_backup_count{namespace,pvc,volume,driver}` - `pvc_backup_completed_count{namespace,pvc,volume,driver}` - `pvc_backup_last_size_bytes{namespace,pvc,volume,driver}` - `pvc_backup_total_size_bytes{namespace,pvc,volume,driver}` - `soteria_b2_scan_success` - `soteria_b2_scan_timestamp_seconds` - `soteria_b2_scan_duration_seconds` - `soteria_b2_account_objects` - `soteria_b2_account_bytes` - `soteria_b2_account_recent_objects_24h` - `soteria_b2_account_recent_bytes_24h` - `soteria_b2_bucket_objects{bucket}` - `soteria_b2_bucket_bytes{bucket}` - `soteria_b2_bucket_recent_objects_24h{bucket}` - `soteria_b2_bucket_recent_bytes_24h{bucket}` - `soteria_b2_bucket_last_modified_timestamp_seconds{bucket}` `pvc_backup_health` is `1` when the most recent successful backup is within `SOTERIA_BACKUP_MAX_AGE_HOURS`, otherwise `0`. ## Configuration Environment variables: - `SOTERIA_BACKUP_DRIVER` default `longhorn`, allowed `longhorn`, `restic` - `SOTERIA_LONGHORN_URL` default `http://longhorn-backend.longhorn-system.svc:9500` - `SOTERIA_LONGHORN_BACKUP_MODE` default `incremental`, allowed `incremental`, `full` - `SOTERIA_RESTIC_REPOSITORY` required for restic driver - `SOTERIA_RESTIC_SECRET_NAME` default `soteria-restic` - `SOTERIA_SECRET_NAMESPACE` default service namespace - `SOTERIA_RESTIC_IMAGE` default `restic/restic:0.16.4` - `SOTERIA_RESTIC_BACKUP_ARGS` optional extra args for `restic backup` - `SOTERIA_RESTIC_FORGET_ARGS` optional extra args for `restic forget` - `SOTERIA_S3_ENDPOINT` optional S3-compatible endpoint - `SOTERIA_S3_REGION` optional region - `SOTERIA_JOB_TTL_SECONDS` default `86400` - `SOTERIA_JOB_NODE_SELECTOR` optional comma-separated `key=value` list - `SOTERIA_JOB_SERVICE_ACCOUNT` optional ServiceAccount for restic Jobs - `SOTERIA_LISTEN_ADDR` default `:8080` - `SOTERIA_AUTH_REQUIRED` default `false` - `SOTERIA_ALLOWED_GROUPS` default `admin,maintenance` - `SOTERIA_AUTH_BEARER_TOKENS` optional comma-separated bearer tokens - `SOTERIA_BACKUP_MAX_AGE_HOURS` default `24` - `SOTERIA_METRICS_REFRESH_SECONDS` default `300` - `SOTERIA_POLICY_EVAL_SECONDS` default `300` - `SOTERIA_POLICY_SECRET_NAME` default `soteria-policies` - `SOTERIA_B2_ENABLED` default `false` (auto-enabled if endpoint/secret are set) - `SOTERIA_B2_ENDPOINT` optional S3-compatible endpoint (for B2, usually `https://s3..backblazeb2.com`) - `SOTERIA_B2_REGION` optional region override (auto-inferred for Backblaze endpoint patterns) - `SOTERIA_B2_BUCKETS` optional comma-separated bucket allowlist (defaults to scanning all accessible buckets) - `SOTERIA_B2_ACCESS_KEY_ID` optional static key (can come from secret instead) - `SOTERIA_B2_SECRET_ACCESS_KEY` optional static secret key (can come from secret instead) - `SOTERIA_B2_SECRET_NAMESPACE` optional secret namespace (defaults to service namespace when secret name is set) - `SOTERIA_B2_SECRET_NAME` optional secret containing B2 keys - `SOTERIA_B2_ACCESS_KEY_FIELD` default `AWS_ACCESS_KEY_ID` - `SOTERIA_B2_SECRET_KEY_FIELD` default `AWS_SECRET_ACCESS_KEY` - `SOTERIA_B2_ENDPOINT_FIELD` default `AWS_ENDPOINTS` - `SOTERIA_B2_SCAN_INTERVAL_SECONDS` default `900` - `SOTERIA_B2_SCAN_TIMEOUT_SECONDS` default `120` ## Secrets Create a secret named `soteria-restic` in the Soteria namespace, or set `SOTERIA_RESTIC_SECRET_NAME`, when using the restic driver. Required keys: - `AWS_ACCESS_KEY_ID` - `AWS_SECRET_ACCESS_KEY` - `RESTIC_PASSWORD` The service copies this secret into the target namespace per job and attaches an owner reference so it is cleaned up with the Job. For B2 scanning, you can point Soteria at a secret via `SOTERIA_B2_SECRET_NAME`. Expected keys by default: - `AWS_ACCESS_KEY_ID` - `AWS_SECRET_ACCESS_KEY` - `AWS_ENDPOINTS` (optional if `SOTERIA_B2_ENDPOINT` is set) A template is in `deploy/secret-example.yaml`. Do not commit real credentials. ## Deployment The `deploy/` folder includes Kustomize-ready manifests for namespace, RBAC, config, deployment, and service. Apply with: ```sh kubectl apply -k deploy ``` The example Service is annotated for Prometheus scraping of `/metrics`. ## Notes - Longhorn inventory and metrics are based on discovered backup records per PVC. - Inventory `Restore` buttons load source context into the restore planner; restore execution happens from the planner panel. - Scheduled policy execution currently applies to Longhorn driver. - Restic backup and restore execution exists, but inventory-style telemetry is currently Longhorn-focused. - For Atlas production, place Soteria behind an authenticated ingress and trust only proxy-injected auth headers.