soteria/README.md

6.0 KiB

soteria

Soteria is an in-cluster service for PVC backup and restore operations. The current production baseline focuses on Longhorn-backed PVCs and provides:

  • Namespace-grouped PVC inventory for backup and restore selection.
  • On-demand backup creation for Longhorn volumes.
  • Restore into a new target PVC with conflict checks and best-effort cleanup on failure.
  • A simple built-in UI suitable for publishing behind an authenticated ingress.
  • Prometheus-format backup freshness telemetry for Grafana rollups.

For Longhorn, backups are crash-consistent at the volume level and delegated to the Longhorn control plane.

Endpoints

Public endpoints:

  • GET /healthz
  • GET /readyz
  • GET /metrics

Protected endpoints when SOTERIA_AUTH_REQUIRED=true:

  • GET / UI console
  • GET /v1/whoami
  • GET /v1/inventory
  • GET /v1/backups?namespace=<ns>&pvc=<name>
  • POST /v1/backup
  • POST /v1/restores
  • POST /v1/restore-test legacy alias for /v1/restores

API examples

POST /v1/backup

{
  "namespace": "ai",
  "pvc": "llm-cache",
  "tags": ["namespace=ai", "service=llm"],
  "dry_run": false
}

Longhorn response:

{
  "driver": "longhorn",
  "volume": "pvc-1234abcd",
  "backup": "soteria-backup-ai-llm-cache-20260412-153000",
  "namespace": "ai",
  "requested_by": "brad",
  "dry_run": false
}

GET /v1/inventory

Response shape:

{
  "generated_at": "2026-04-12T15:30:00Z",
  "namespaces": [
    {
      "name": "ai",
      "pvcs": [
        {
          "namespace": "ai",
          "pvc": "llm-cache",
          "volume": "pvc-1234abcd",
          "storage_class": "longhorn",
          "capacity": "50Gi",
          "driver": "longhorn",
          "last_backup_at": "2026-04-12T14:55:00Z",
          "last_backup_age_hours": 0.58,
          "backup_count": 14,
          "healthy": true,
          "health_reason": "fresh"
        }
      ]
    }
  ]
}

GET /v1/backups

/v1/backups?namespace=ai&pvc=llm-cache

Returns the resolved volume name and backup records so the UI or automation can select a restore source.

POST /v1/restores

{
  "namespace": "ai",
  "pvc": "llm-cache",
  "snapshot": "latest",
  "target_namespace": "ai",
  "target_pvc": "restore-llm-cache",
  "dry_run": false
}

Notes:

  • namespace and pvc identify the source PVC.
  • target_pvc is required.
  • target_namespace defaults to namespace.
  • Soteria refuses to overwrite an existing target PVC.
  • If Longhorn volume creation succeeds but PVC creation fails, Soteria attempts to delete the just-created restore volume.
  • You may provide backup_url directly instead of snapshot.

Authentication and authorization

When SOTERIA_AUTH_REQUIRED=true, Soteria expects trusted auth headers from a fronting proxy such as oauth2-proxy:

  • X-Auth-Request-User
  • X-Auth-Request-Email
  • X-Auth-Request-Groups

Allowed groups are configured with SOTERIA_ALLOWED_GROUPS and compared after normalizing leading / prefixes, so both maintenance and /maintenance are accepted.

Optional machine-to-machine access can be enabled with SOTERIA_AUTH_BEARER_TOKENS, which accepts a comma-separated list of bearer tokens.

Prometheus metrics

Soteria exports Prometheus-format metrics at GET /metrics.

Implemented metrics:

  • soteria_backup_requests_total{driver,result}
  • soteria_restore_requests_total{driver,result}
  • soteria_authz_denials_total{reason}
  • soteria_inventory_refresh_failures_total
  • soteria_inventory_refresh_timestamp_seconds
  • pvc_backup_age_hours{namespace,pvc,volume,driver}
  • pvc_backup_health{namespace,pvc,volume,driver}
  • pvc_backup_last_success_timestamp_seconds{namespace,pvc,volume,driver}
  • pvc_backup_count{namespace,pvc,volume,driver}

pvc_backup_health is 1 when the most recent successful backup is within SOTERIA_BACKUP_MAX_AGE_HOURS, otherwise 0.

Configuration

Environment variables:

  • SOTERIA_BACKUP_DRIVER default longhorn, allowed longhorn, restic
  • SOTERIA_LONGHORN_URL default http://longhorn-backend.longhorn-system.svc:9500
  • SOTERIA_LONGHORN_BACKUP_MODE default incremental, allowed incremental, full
  • SOTERIA_RESTIC_REPOSITORY required for restic driver
  • SOTERIA_RESTIC_SECRET_NAME default soteria-restic
  • SOTERIA_SECRET_NAMESPACE default service namespace
  • SOTERIA_RESTIC_IMAGE default restic/restic:0.16.4
  • SOTERIA_RESTIC_BACKUP_ARGS optional extra args for restic backup
  • SOTERIA_RESTIC_FORGET_ARGS optional extra args for restic forget
  • SOTERIA_S3_ENDPOINT optional S3-compatible endpoint
  • SOTERIA_S3_REGION optional region
  • SOTERIA_JOB_TTL_SECONDS default 86400
  • SOTERIA_JOB_NODE_SELECTOR optional comma-separated key=value list
  • SOTERIA_JOB_SERVICE_ACCOUNT optional ServiceAccount for restic Jobs
  • SOTERIA_LISTEN_ADDR default :8080
  • SOTERIA_AUTH_REQUIRED default false
  • SOTERIA_ALLOWED_GROUPS default admin,maintenance
  • SOTERIA_AUTH_BEARER_TOKENS optional comma-separated bearer tokens
  • SOTERIA_BACKUP_MAX_AGE_HOURS default 24
  • SOTERIA_METRICS_REFRESH_SECONDS default 300

Secrets

Create a secret named soteria-restic in the Soteria namespace, or set SOTERIA_RESTIC_SECRET_NAME, when using the restic driver. Required keys:

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
  • RESTIC_PASSWORD

The service copies this secret into the target namespace per job and attaches an owner reference so it is cleaned up with the Job.

A template is in deploy/secret-example.yaml. Do not commit real credentials.

Deployment

The deploy/ folder includes Kustomize-ready manifests for namespace, RBAC, config, deployment, and service.

Apply with:

kubectl apply -k deploy

The example Service is annotated for Prometheus scraping of /metrics.

Notes

  • Longhorn inventory and metrics are based on discovered backup records per PVC.
  • Restic backup and restore execution exists, but inventory-style telemetry is currently Longhorn-focused.
  • For Atlas production, place Soteria behind an authenticated ingress and trust only proxy-injected auth headers.