soteria

Soteria is an in-cluster service for PVC backup and restore operations. The current production baseline focuses on Longhorn-backed PVCs and provides:

  • Namespace-grouped PVC inventory for backup and restore selection.
  • On-demand backup creation for Longhorn volumes.
  • Namespace-wide backup and restore batch execution.
  • Restore into a new target PVC with conflict checks and best-effort cleanup on failure.
  • Policy-based scheduled backups (per PVC or all PVCs in a namespace), persisted in-cluster.
  • A built-in React + TypeScript UI (dark-mode default) suitable for publishing behind an authenticated ingress.
  • Prometheus-format backup freshness and B2 consumption telemetry for Grafana rollups.

For Longhorn, backups are crash-consistent at the volume level and delegated to the Longhorn control plane.

Endpoints

Public endpoints:

  • GET /healthz
  • GET /readyz
  • GET /metrics

Protected endpoints when SOTERIA_AUTH_REQUIRED=true:

  • GET / UI console
  • GET /v1/whoami
  • GET /v1/inventory
  • GET /v1/backups?namespace=<ns>&pvc=<name>
  • POST /v1/backup
  • POST /v1/backup/namespace
  • POST /v1/restores
  • POST /v1/restores/namespace
  • POST /v1/restore-test legacy alias for /v1/restores
  • GET /v1/policies
  • POST /v1/policies
  • DELETE /v1/policies/<policy-id>
  • GET /v1/b2

API examples

POST /v1/backup

{
  "namespace": "ai",
  "pvc": "llm-cache",
  "tags": ["namespace=ai", "service=llm"],
  "dry_run": false
}

Longhorn response:

{
  "driver": "longhorn",
  "volume": "pvc-1234abcd",
  "backup": "soteria-backup-ai-llm-cache-20260412-153000",
  "namespace": "ai",
  "requested_by": "brad",
  "dry_run": false
}

GET /v1/inventory

Response shape:

{
  "generated_at": "2026-04-12T15:30:00Z",
  "namespaces": [
    {
      "name": "ai",
      "pvcs": [
        {
          "namespace": "ai",
          "pvc": "llm-cache",
          "volume": "pvc-1234abcd",
          "storage_class": "longhorn",
          "capacity": "50Gi",
          "driver": "longhorn",
          "last_backup_at": "2026-04-12T14:55:00Z",
          "last_backup_age_hours": 0.58,
          "backup_count": 14,
          "healthy": true,
          "health_reason": "fresh"
        }
      ]
    }
  ]
}

GET /v1/backups

/v1/backups?namespace=ai&pvc=llm-cache

Returns the resolved volume name and backup records so the UI or automation can select a restore source.

POST /v1/restores

{
  "namespace": "ai",
  "pvc": "llm-cache",
  "snapshot": "latest",
  "target_namespace": "ai",
  "target_pvc": "restore-llm-cache",
  "dry_run": false
}

Notes:

  • namespace and pvc identify the source PVC.
  • target_pvc is required.
  • target_namespace defaults to namespace.
  • Soteria refuses to overwrite an existing target PVC.
  • If Longhorn volume creation succeeds but PVC creation fails, Soteria attempts to delete the just-created restore volume.
  • You may provide backup_url directly instead of snapshot.

POST /v1/backup/namespace

{
  "namespace": "ai",
  "dry_run": false
}

Runs backup for every currently bound PVC in the namespace and returns a per-PVC result list.

POST /v1/restores/namespace

{
  "namespace": "ai",
  "target_namespace": "ai-restore",
  "target_prefix": "restore-20260412-",
  "snapshot": "",
  "dry_run": true
}

Runs restore planning/execution for every bound PVC in the source namespace. snapshot is optional and blank means latest completed backup per PVC.

Policy API

Create or update a policy:

POST /v1/policies
{
  "namespace": "ai",
  "pvc": "llm-cache",
  "interval_hours": 6,
  "enabled": true
}
  • Leave pvc empty to target all PVCs in that namespace.
  • Policies are stored in secret SOTERIA_POLICY_SECRET_NAME under key policies.json.

GET /v1/b2

Returns B2 account/bucket consumption based on S3-compatible object scans.

{
  "enabled": true,
  "available": true,
  "endpoint": "https://s3.us-west-004.backblazeb2.com",
  "region": "us-west-004",
  "scanned_at": "2026-04-12T16:00:00Z",
  "scan_duration_ms": 824,
  "total_objects": 1324,
  "total_bytes": 18407542931,
  "recent_objects_24h": 18,
  "recent_bytes_24h": 12245812,
  "buckets": [
    {
      "name": "atlas-backups",
      "object_count": 1240,
      "total_bytes": 18288473811,
      "recent_objects_24h": 12,
      "recent_bytes_24h": 8542198,
      "last_modified_at": "2026-04-12T15:43:19Z"
    }
  ]
}

Recent 24h values are an object-change proxy and do not represent full B2 billing egress totals.

Authentication and authorization

When SOTERIA_AUTH_REQUIRED=true, Soteria expects trusted auth headers from a fronting proxy such as oauth2-proxy:

  • X-Auth-Request-User
  • X-Auth-Request-Email
  • X-Auth-Request-Groups
  • X-Forwarded-User (fallback)
  • X-Forwarded-Email (fallback)
  • X-Forwarded-Groups (fallback)

Allowed groups are configured with SOTERIA_ALLOWED_GROUPS and compared after normalizing leading / prefixes, so both maintenance and /maintenance are accepted. Group lists may be comma- or semicolon-separated.

Optional machine-to-machine access can be enabled with SOTERIA_AUTH_BEARER_TOKENS, which accepts a comma-separated list of bearer tokens.

Prometheus metrics

Soteria exports Prometheus-format metrics at GET /metrics.

Implemented metrics:

  • soteria_backup_requests_total{driver,result}
  • soteria_restore_requests_total{driver,result}
  • soteria_policy_backups_total{result}
  • soteria_namespace_backup_requests_total{driver,result}
  • soteria_namespace_restore_requests_total{driver,result}
  • soteria_authz_denials_total{reason}
  • soteria_inventory_refresh_failures_total
  • soteria_inventory_refresh_timestamp_seconds
  • pvc_backup_age_hours{namespace,pvc,volume,driver}
  • pvc_backup_health{namespace,pvc,volume,driver}
  • pvc_backup_health_reason{namespace,pvc,volume,driver,reason}
  • pvc_backup_last_success_timestamp_seconds{namespace,pvc,volume,driver}
  • pvc_backup_count{namespace,pvc,volume,driver}
  • pvc_backup_completed_count{namespace,pvc,volume,driver}
  • pvc_backup_last_size_bytes{namespace,pvc,volume,driver}
  • pvc_backup_total_size_bytes{namespace,pvc,volume,driver}
  • soteria_b2_scan_success
  • soteria_b2_scan_timestamp_seconds
  • soteria_b2_scan_duration_seconds
  • soteria_b2_account_objects
  • soteria_b2_account_bytes
  • soteria_b2_account_recent_objects_24h
  • soteria_b2_account_recent_bytes_24h
  • soteria_b2_bucket_objects{bucket}
  • soteria_b2_bucket_bytes{bucket}
  • soteria_b2_bucket_recent_objects_24h{bucket}
  • soteria_b2_bucket_recent_bytes_24h{bucket}
  • soteria_b2_bucket_last_modified_timestamp_seconds{bucket}

pvc_backup_health is 1 when the most recent successful backup is within SOTERIA_BACKUP_MAX_AGE_HOURS, otherwise 0.

Configuration

Environment variables:

  • SOTERIA_BACKUP_DRIVER default longhorn, allowed longhorn, restic
  • SOTERIA_LONGHORN_URL default http://longhorn-backend.longhorn-system.svc:9500
  • SOTERIA_LONGHORN_BACKUP_MODE default incremental, allowed incremental, full
  • SOTERIA_RESTIC_REPOSITORY required for restic driver
  • SOTERIA_RESTIC_SECRET_NAME default soteria-restic
  • SOTERIA_SECRET_NAMESPACE default service namespace
  • SOTERIA_RESTIC_IMAGE default restic/restic:0.16.4
  • SOTERIA_RESTIC_BACKUP_ARGS optional extra args for restic backup
  • SOTERIA_RESTIC_FORGET_ARGS optional extra args for restic forget
  • SOTERIA_S3_ENDPOINT optional S3-compatible endpoint
  • SOTERIA_S3_REGION optional region
  • SOTERIA_JOB_TTL_SECONDS default 86400
  • SOTERIA_JOB_NODE_SELECTOR optional comma-separated key=value list
  • SOTERIA_JOB_SERVICE_ACCOUNT optional ServiceAccount for restic Jobs
  • SOTERIA_LISTEN_ADDR default :8080
  • SOTERIA_AUTH_REQUIRED default false
  • SOTERIA_ALLOWED_GROUPS default admin,maintenance
  • SOTERIA_AUTH_BEARER_TOKENS optional comma-separated bearer tokens
  • SOTERIA_BACKUP_MAX_AGE_HOURS default 24
  • SOTERIA_METRICS_REFRESH_SECONDS default 300
  • SOTERIA_POLICY_EVAL_SECONDS default 300
  • SOTERIA_POLICY_SECRET_NAME default soteria-policies
  • SOTERIA_B2_ENABLED default false (auto-enabled if endpoint/secret are set)
  • SOTERIA_B2_ENDPOINT optional S3-compatible endpoint (for B2, usually https://s3.<region>.backblazeb2.com)
  • SOTERIA_B2_REGION optional region override (auto-inferred for Backblaze endpoint patterns)
  • SOTERIA_B2_BUCKETS optional comma-separated bucket allowlist (defaults to scanning all accessible buckets)
  • SOTERIA_B2_ACCESS_KEY_ID optional static key (can come from secret instead)
  • SOTERIA_B2_SECRET_ACCESS_KEY optional static secret key (can come from secret instead)
  • SOTERIA_B2_SECRET_NAMESPACE optional secret namespace (defaults to service namespace when secret name is set)
  • SOTERIA_B2_SECRET_NAME optional secret containing B2 keys
  • SOTERIA_B2_ACCESS_KEY_FIELD default AWS_ACCESS_KEY_ID
  • SOTERIA_B2_SECRET_KEY_FIELD default AWS_SECRET_ACCESS_KEY
  • SOTERIA_B2_ENDPOINT_FIELD default AWS_ENDPOINTS
  • SOTERIA_B2_SCAN_INTERVAL_SECONDS default 900
  • SOTERIA_B2_SCAN_TIMEOUT_SECONDS default 120

Secrets

Create a secret named soteria-restic in the Soteria namespace, or set SOTERIA_RESTIC_SECRET_NAME, when using the restic driver. Required keys:

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
  • RESTIC_PASSWORD

The service copies this secret into the target namespace per job and attaches an owner reference so it is cleaned up with the Job.

For B2 scanning, you can point Soteria at a secret via SOTERIA_B2_SECRET_NAME. Expected keys by default:

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
  • AWS_ENDPOINTS (optional if SOTERIA_B2_ENDPOINT is set)

A template is in deploy/secret-example.yaml. Do not commit real credentials.

Deployment

The deploy/ folder includes Kustomize-ready manifests for namespace, RBAC, config, deployment, and service.

Apply with:

kubectl apply -k deploy

The example Service is annotated for Prometheus scraping of /metrics.

Notes

  • Longhorn inventory and metrics are based on discovered backup records per PVC.
  • Inventory Restore buttons load source context into the restore planner; restore execution happens from the planner panel.
  • Scheduled policy execution currently applies to Longhorn driver.
  • Restic backup and restore execution exists, but inventory-style telemetry is currently Longhorn-focused.
  • For Atlas production, place Soteria behind an authenticated ingress and trust only proxy-injected auth headers.
Description
atlas cluster backup manager tool
Readme 1.1 MiB
Languages
Go 85.9%
TypeScript 8.6%
Python 2.9%
CSS 1.5%
Shell 0.8%
Other 0.3%