From 105f88d89cf0ce16e4ab412def2252d854810d7e Mon Sep 17 00:00:00 2001 From: codex Date: Fri, 19 Jun 2026 15:46:22 -0300 Subject: [PATCH] docs: shorten soteria README --- README.md | 331 +++++------------------------------------------------- 1 file changed, 25 insertions(+), 306 deletions(-) diff --git a/README.md b/README.md index fbfc6e9..68a81f0 100644 --- a/README.md +++ b/README.md @@ -1,327 +1,46 @@ # soteria -Soteria is an in-cluster service for PVC backup and restore operations. The current production baseline focuses on Longhorn-backed PVCs and provides: +Soteria is the backup and restore console for Atlas PVCs. -- Namespace-grouped PVC inventory for backup and restore selection. -- On-demand backup creation for Longhorn volumes. -- Namespace-wide backup and restore batch execution. -- Restore into a new target PVC with conflict checks and best-effort cleanup on failure. -- Policy-based scheduled backups (per PVC or all PVCs in a namespace), persisted in-cluster. -- A built-in React + TypeScript UI (dark-mode default) suitable for publishing behind an authenticated ingress. -- Prometheus-format backup freshness and B2 consumption telemetry for Grafana rollups. +Right now it is mainly built around Longhorn. It lists bound PVCs, starts +backups, restores a backup into a new PVC, runs namespace-wide backup/restore +jobs, and exposes backup health metrics for Grafana. It also has a small React +UI so the common restore path does not require remembering the API by hand. -For Longhorn, backups are crash-consistent at the volume level and delegated to the Longhorn control plane. +Soteria never overwrites an existing target PVC. Restore work is meant to be +explicit and reversible. -## Endpoints +## How it works -Public endpoints: +The service runs in-cluster and talks to Kubernetes plus the Longhorn backend. +For each PVC it resolves the backing volume, asks Longhorn to snapshot/backup +it, and records enough inventory for humans and dashboards to see whether the +backup is fresh. -- `GET /healthz` -- `GET /readyz` -- `GET /metrics` +Policies are stored in a Kubernetes secret and evaluated on a timer. Metrics are +published at `/metrics`; the UI and API share the same backend. -Protected endpoints when `SOTERIA_AUTH_REQUIRED=true`: +Main endpoints: -- `GET /` UI console -- `GET /v1/whoami` +- `GET /healthz`, `GET /readyz`, `GET /metrics` - `GET /v1/inventory` - `GET /v1/backups?namespace=&pvc=` - `POST /v1/backup` - `POST /v1/backup/namespace` - `POST /v1/restores` - `POST /v1/restores/namespace` -- `POST /v1/restore-test` legacy alias for `/v1/restores` -- `GET /v1/policies` -- `POST /v1/policies` -- `DELETE /v1/policies/` +- `GET|POST|DELETE /v1/policies` - `GET /v1/b2` -## API examples +When auth is enabled, Soteria expects trusted headers from the fronting proxy and +checks `SOTERIA_ALLOWED_GROUPS`. -### POST /v1/backup +## Development -```json -{ - "namespace": "ai", - "pvc": "llm-cache", - "tags": ["namespace=ai", "service=llm"], - "dry_run": false -} +```bash +go test ./... +./scripts/check.sh ``` -Longhorn response: - -```json -{ - "driver": "longhorn", - "volume": "pvc-1234abcd", - "backup": "soteria-backup-ai-llm-cache-20260412-153000", - "namespace": "ai", - "requested_by": "brad", - "dry_run": false -} -``` - -### GET /v1/inventory - -Response shape: - -```json -{ - "generated_at": "2026-04-12T15:30:00Z", - "namespaces": [ - { - "name": "ai", - "pvcs": [ - { - "namespace": "ai", - "pvc": "llm-cache", - "volume": "pvc-1234abcd", - "storage_class": "longhorn", - "capacity": "50Gi", - "driver": "longhorn", - "last_backup_at": "2026-04-12T14:55:00Z", - "last_backup_age_hours": 0.58, - "backup_count": 14, - "healthy": true, - "health_reason": "fresh" - } - ] - } - ] -} -``` - -### GET /v1/backups - -```text -/v1/backups?namespace=ai&pvc=llm-cache -``` - -Returns the resolved volume name and backup records so the UI or automation can select a restore source. - -### POST /v1/restores - -```json -{ - "namespace": "ai", - "pvc": "llm-cache", - "snapshot": "latest", - "target_namespace": "ai", - "target_pvc": "restore-llm-cache", - "dry_run": false -} -``` - -Notes: - -- `namespace` and `pvc` identify the source PVC. -- `target_pvc` is required. -- `target_namespace` defaults to `namespace`. -- Soteria refuses to overwrite an existing target PVC. -- If Longhorn volume creation succeeds but PVC creation fails, Soteria attempts to delete the just-created restore volume. -- You may provide `backup_url` directly instead of `snapshot`. - -### POST /v1/backup/namespace - -```json -{ - "namespace": "ai", - "dry_run": false -} -``` - -Runs backup for every currently bound PVC in the namespace and returns a per-PVC result list. - -### POST /v1/restores/namespace - -```json -{ - "namespace": "ai", - "target_namespace": "ai-restore", - "target_prefix": "restore-20260412-", - "snapshot": "", - "dry_run": true -} -``` - -Runs restore planning/execution for every bound PVC in the source namespace. `snapshot` is optional and blank means latest completed backup per PVC. - -### Policy API - -Create or update a policy: - -```json -POST /v1/policies -{ - "namespace": "ai", - "pvc": "llm-cache", - "interval_hours": 6, - "enabled": true -} -``` - -- Leave `pvc` empty to target all PVCs in that namespace. -- Policies are stored in secret `SOTERIA_POLICY_SECRET_NAME` under key `policies.json`. - -### GET /v1/b2 - -Returns B2 account/bucket consumption based on S3-compatible object scans. - -```json -{ - "enabled": true, - "available": true, - "endpoint": "https://s3.us-west-004.backblazeb2.com", - "region": "us-west-004", - "scanned_at": "2026-04-12T16:00:00Z", - "scan_duration_ms": 824, - "total_objects": 1324, - "total_bytes": 18407542931, - "recent_objects_24h": 18, - "recent_bytes_24h": 12245812, - "buckets": [ - { - "name": "atlas-backups", - "object_count": 1240, - "total_bytes": 18288473811, - "recent_objects_24h": 12, - "recent_bytes_24h": 8542198, - "last_modified_at": "2026-04-12T15:43:19Z" - } - ] -} -``` - -Recent 24h values are an object-change proxy and do not represent full B2 billing egress totals. - -## Authentication and authorization - -When `SOTERIA_AUTH_REQUIRED=true`, Soteria expects trusted auth headers from a fronting proxy such as `oauth2-proxy`: - -- `X-Auth-Request-User` -- `X-Auth-Request-Email` -- `X-Auth-Request-Groups` -- `X-Forwarded-User` (fallback) -- `X-Forwarded-Email` (fallback) -- `X-Forwarded-Groups` (fallback) - -Allowed groups are configured with `SOTERIA_ALLOWED_GROUPS` and compared after normalizing leading `/` prefixes, so both `maintenance` and `/maintenance` are accepted. Group lists may be comma- or semicolon-separated. - -Optional machine-to-machine access can be enabled with `SOTERIA_AUTH_BEARER_TOKENS`, which accepts a comma-separated list of bearer tokens. - -## Prometheus metrics - -Soteria exports Prometheus-format metrics at `GET /metrics`. - -Implemented metrics: - -- `soteria_backup_requests_total{driver,result}` -- `soteria_restore_requests_total{driver,result}` -- `soteria_policy_backups_total{result}` -- `soteria_namespace_backup_requests_total{driver,result}` -- `soteria_namespace_restore_requests_total{driver,result}` -- `soteria_authz_denials_total{reason}` -- `soteria_inventory_refresh_failures_total` -- `soteria_inventory_refresh_timestamp_seconds` -- `pvc_backup_age_hours{namespace,pvc,volume,driver}` -- `pvc_backup_health{namespace,pvc,volume,driver}` -- `pvc_backup_health_reason{namespace,pvc,volume,driver,reason}` -- `pvc_backup_last_success_timestamp_seconds{namespace,pvc,volume,driver}` -- `pvc_backup_count{namespace,pvc,volume,driver}` -- `pvc_backup_completed_count{namespace,pvc,volume,driver}` -- `pvc_backup_last_size_bytes{namespace,pvc,volume,driver}` -- `pvc_backup_total_size_bytes{namespace,pvc,volume,driver}` -- `soteria_b2_scan_success` -- `soteria_b2_scan_timestamp_seconds` -- `soteria_b2_scan_duration_seconds` -- `soteria_b2_account_objects` -- `soteria_b2_account_bytes` -- `soteria_b2_account_recent_objects_24h` -- `soteria_b2_account_recent_bytes_24h` -- `soteria_b2_bucket_objects{bucket}` -- `soteria_b2_bucket_bytes{bucket}` -- `soteria_b2_bucket_recent_objects_24h{bucket}` -- `soteria_b2_bucket_recent_bytes_24h{bucket}` -- `soteria_b2_bucket_last_modified_timestamp_seconds{bucket}` - -`pvc_backup_health` is `1` when the most recent successful backup is within `SOTERIA_BACKUP_MAX_AGE_HOURS`, otherwise `0`. - -## Configuration - -Environment variables: - -- `SOTERIA_BACKUP_DRIVER` default `longhorn`, allowed `longhorn`, `restic` -- `SOTERIA_LONGHORN_URL` default `http://longhorn-backend.longhorn-system.svc:9500` -- `SOTERIA_LONGHORN_BACKUP_MODE` default `incremental`, allowed `incremental`, `full` -- `SOTERIA_RESTIC_REPOSITORY` required for restic driver -- `SOTERIA_RESTIC_SECRET_NAME` default `soteria-restic` -- `SOTERIA_SECRET_NAMESPACE` default service namespace -- `SOTERIA_RESTIC_IMAGE` default `restic/restic:0.16.4` -- `SOTERIA_RESTIC_BACKUP_ARGS` optional extra args for `restic backup` -- `SOTERIA_RESTIC_FORGET_ARGS` optional extra args for `restic forget` -- `SOTERIA_S3_ENDPOINT` optional S3-compatible endpoint -- `SOTERIA_S3_REGION` optional region -- `SOTERIA_JOB_TTL_SECONDS` default `86400` -- `SOTERIA_JOB_NODE_SELECTOR` optional comma-separated `key=value` list -- `SOTERIA_JOB_SERVICE_ACCOUNT` optional ServiceAccount for restic Jobs -- `SOTERIA_LISTEN_ADDR` default `:8080` -- `SOTERIA_AUTH_REQUIRED` default `false` -- `SOTERIA_ALLOWED_GROUPS` default `admin,maintenance` -- `SOTERIA_AUTH_BEARER_TOKENS` optional comma-separated bearer tokens -- `SOTERIA_BACKUP_MAX_AGE_HOURS` default `24` -- `SOTERIA_METRICS_REFRESH_SECONDS` default `300` -- `SOTERIA_POLICY_EVAL_SECONDS` default `300` -- `SOTERIA_POLICY_SECRET_NAME` default `soteria-policies` -- `SOTERIA_USAGE_SECRET_NAME` default `soteria-backup-usage` (stores persisted restic size estimates) -- `SOTERIA_B2_ENABLED` default `false` (auto-enabled if endpoint/secret are set) -- `SOTERIA_B2_ENDPOINT` optional S3-compatible endpoint (for B2, usually `https://s3..backblazeb2.com`) -- `SOTERIA_B2_REGION` optional region override (auto-inferred for Backblaze endpoint patterns) -- `SOTERIA_B2_BUCKETS` optional comma-separated bucket allowlist (defaults to scanning all accessible buckets) -- `SOTERIA_B2_ACCESS_KEY_ID` optional static key (can come from secret instead) -- `SOTERIA_B2_SECRET_ACCESS_KEY` optional static secret key (can come from secret instead) -- `SOTERIA_B2_SECRET_NAMESPACE` optional secret namespace (defaults to service namespace when secret name is set) -- `SOTERIA_B2_SECRET_NAME` optional secret containing B2 keys -- `SOTERIA_B2_ACCESS_KEY_FIELD` default `AWS_ACCESS_KEY_ID` -- `SOTERIA_B2_SECRET_KEY_FIELD` default `AWS_SECRET_ACCESS_KEY` -- `SOTERIA_B2_ENDPOINT_FIELD` default `AWS_ENDPOINTS` -- `SOTERIA_B2_SCAN_INTERVAL_SECONDS` default `900` -- `SOTERIA_B2_SCAN_TIMEOUT_SECONDS` default `120` - -## Secrets - -Create a secret named `soteria-restic` in the Soteria namespace, or set `SOTERIA_RESTIC_SECRET_NAME`, when using the restic driver. Required keys: - -- `AWS_ACCESS_KEY_ID` -- `AWS_SECRET_ACCESS_KEY` -- `RESTIC_PASSWORD` - -The service copies this secret into the target namespace per job and attaches an owner reference so it is cleaned up with the Job. - -For B2 scanning, you can point Soteria at a secret via `SOTERIA_B2_SECRET_NAME`. Expected keys by default: - -- `AWS_ACCESS_KEY_ID` -- `AWS_SECRET_ACCESS_KEY` -- `AWS_ENDPOINTS` (optional if `SOTERIA_B2_ENDPOINT` is set) - -A template is in `deploy/secret-example.yaml`. Do not commit real credentials. - -## Deployment - -The `deploy/` folder includes Kustomize-ready manifests for namespace, RBAC, config, deployment, and service. - -Apply with: - -```sh -kubectl apply -k deploy -``` - -The example Service is annotated for Prometheus scraping of `/metrics`. - -## Notes - -- Longhorn inventory and metrics are based on discovered backup records per PVC. -- Inventory `Restore` buttons load source context into the restore planner; restore execution happens from the planner panel. -- Scheduled backup policies apply to both Longhorn and restic drivers. -- Restic size telemetry is estimated from per-job upload summaries; with shared dedupe repositories those values are per-PVC attributions, not exact physical B2 ownership. -- For Atlas production, place Soteria behind an authenticated ingress and trust only proxy-injected auth headers. +The local deploy manifests live in `deploy/`. Production wiring should still go +through the Flux repo, not one-off cluster edits.