docs: shorten soteria README

This commit is contained in:
codex 2026-06-19 15:46:22 -03:00
parent cb476165b5
commit 105f88d89c

331
README.md
View File

@ -1,327 +1,46 @@
# soteria
Soteria is an in-cluster service for PVC backup and restore operations. The current production baseline focuses on Longhorn-backed PVCs and provides:
Soteria is the backup and restore console for Atlas PVCs.
- Namespace-grouped PVC inventory for backup and restore selection.
- On-demand backup creation for Longhorn volumes.
- Namespace-wide backup and restore batch execution.
- Restore into a new target PVC with conflict checks and best-effort cleanup on failure.
- Policy-based scheduled backups (per PVC or all PVCs in a namespace), persisted in-cluster.
- A built-in React + TypeScript UI (dark-mode default) suitable for publishing behind an authenticated ingress.
- Prometheus-format backup freshness and B2 consumption telemetry for Grafana rollups.
Right now it is mainly built around Longhorn. It lists bound PVCs, starts
backups, restores a backup into a new PVC, runs namespace-wide backup/restore
jobs, and exposes backup health metrics for Grafana. It also has a small React
UI so the common restore path does not require remembering the API by hand.
For Longhorn, backups are crash-consistent at the volume level and delegated to the Longhorn control plane.
Soteria never overwrites an existing target PVC. Restore work is meant to be
explicit and reversible.
## Endpoints
## How it works
Public endpoints:
The service runs in-cluster and talks to Kubernetes plus the Longhorn backend.
For each PVC it resolves the backing volume, asks Longhorn to snapshot/backup
it, and records enough inventory for humans and dashboards to see whether the
backup is fresh.
- `GET /healthz`
- `GET /readyz`
- `GET /metrics`
Policies are stored in a Kubernetes secret and evaluated on a timer. Metrics are
published at `/metrics`; the UI and API share the same backend.
Protected endpoints when `SOTERIA_AUTH_REQUIRED=true`:
Main endpoints:
- `GET /` UI console
- `GET /v1/whoami`
- `GET /healthz`, `GET /readyz`, `GET /metrics`
- `GET /v1/inventory`
- `GET /v1/backups?namespace=<ns>&pvc=<name>`
- `POST /v1/backup`
- `POST /v1/backup/namespace`
- `POST /v1/restores`
- `POST /v1/restores/namespace`
- `POST /v1/restore-test` legacy alias for `/v1/restores`
- `GET /v1/policies`
- `POST /v1/policies`
- `DELETE /v1/policies/<policy-id>`
- `GET|POST|DELETE /v1/policies`
- `GET /v1/b2`
## API examples
When auth is enabled, Soteria expects trusted headers from the fronting proxy and
checks `SOTERIA_ALLOWED_GROUPS`.
### POST /v1/backup
## Development
```json
{
"namespace": "ai",
"pvc": "llm-cache",
"tags": ["namespace=ai", "service=llm"],
"dry_run": false
}
```bash
go test ./...
./scripts/check.sh
```
Longhorn response:
```json
{
"driver": "longhorn",
"volume": "pvc-1234abcd",
"backup": "soteria-backup-ai-llm-cache-20260412-153000",
"namespace": "ai",
"requested_by": "brad",
"dry_run": false
}
```
### GET /v1/inventory
Response shape:
```json
{
"generated_at": "2026-04-12T15:30:00Z",
"namespaces": [
{
"name": "ai",
"pvcs": [
{
"namespace": "ai",
"pvc": "llm-cache",
"volume": "pvc-1234abcd",
"storage_class": "longhorn",
"capacity": "50Gi",
"driver": "longhorn",
"last_backup_at": "2026-04-12T14:55:00Z",
"last_backup_age_hours": 0.58,
"backup_count": 14,
"healthy": true,
"health_reason": "fresh"
}
]
}
]
}
```
### GET /v1/backups
```text
/v1/backups?namespace=ai&pvc=llm-cache
```
Returns the resolved volume name and backup records so the UI or automation can select a restore source.
### POST /v1/restores
```json
{
"namespace": "ai",
"pvc": "llm-cache",
"snapshot": "latest",
"target_namespace": "ai",
"target_pvc": "restore-llm-cache",
"dry_run": false
}
```
Notes:
- `namespace` and `pvc` identify the source PVC.
- `target_pvc` is required.
- `target_namespace` defaults to `namespace`.
- Soteria refuses to overwrite an existing target PVC.
- If Longhorn volume creation succeeds but PVC creation fails, Soteria attempts to delete the just-created restore volume.
- You may provide `backup_url` directly instead of `snapshot`.
### POST /v1/backup/namespace
```json
{
"namespace": "ai",
"dry_run": false
}
```
Runs backup for every currently bound PVC in the namespace and returns a per-PVC result list.
### POST /v1/restores/namespace
```json
{
"namespace": "ai",
"target_namespace": "ai-restore",
"target_prefix": "restore-20260412-",
"snapshot": "",
"dry_run": true
}
```
Runs restore planning/execution for every bound PVC in the source namespace. `snapshot` is optional and blank means latest completed backup per PVC.
### Policy API
Create or update a policy:
```json
POST /v1/policies
{
"namespace": "ai",
"pvc": "llm-cache",
"interval_hours": 6,
"enabled": true
}
```
- Leave `pvc` empty to target all PVCs in that namespace.
- Policies are stored in secret `SOTERIA_POLICY_SECRET_NAME` under key `policies.json`.
### GET /v1/b2
Returns B2 account/bucket consumption based on S3-compatible object scans.
```json
{
"enabled": true,
"available": true,
"endpoint": "https://s3.us-west-004.backblazeb2.com",
"region": "us-west-004",
"scanned_at": "2026-04-12T16:00:00Z",
"scan_duration_ms": 824,
"total_objects": 1324,
"total_bytes": 18407542931,
"recent_objects_24h": 18,
"recent_bytes_24h": 12245812,
"buckets": [
{
"name": "atlas-backups",
"object_count": 1240,
"total_bytes": 18288473811,
"recent_objects_24h": 12,
"recent_bytes_24h": 8542198,
"last_modified_at": "2026-04-12T15:43:19Z"
}
]
}
```
Recent 24h values are an object-change proxy and do not represent full B2 billing egress totals.
## Authentication and authorization
When `SOTERIA_AUTH_REQUIRED=true`, Soteria expects trusted auth headers from a fronting proxy such as `oauth2-proxy`:
- `X-Auth-Request-User`
- `X-Auth-Request-Email`
- `X-Auth-Request-Groups`
- `X-Forwarded-User` (fallback)
- `X-Forwarded-Email` (fallback)
- `X-Forwarded-Groups` (fallback)
Allowed groups are configured with `SOTERIA_ALLOWED_GROUPS` and compared after normalizing leading `/` prefixes, so both `maintenance` and `/maintenance` are accepted. Group lists may be comma- or semicolon-separated.
Optional machine-to-machine access can be enabled with `SOTERIA_AUTH_BEARER_TOKENS`, which accepts a comma-separated list of bearer tokens.
## Prometheus metrics
Soteria exports Prometheus-format metrics at `GET /metrics`.
Implemented metrics:
- `soteria_backup_requests_total{driver,result}`
- `soteria_restore_requests_total{driver,result}`
- `soteria_policy_backups_total{result}`
- `soteria_namespace_backup_requests_total{driver,result}`
- `soteria_namespace_restore_requests_total{driver,result}`
- `soteria_authz_denials_total{reason}`
- `soteria_inventory_refresh_failures_total`
- `soteria_inventory_refresh_timestamp_seconds`
- `pvc_backup_age_hours{namespace,pvc,volume,driver}`
- `pvc_backup_health{namespace,pvc,volume,driver}`
- `pvc_backup_health_reason{namespace,pvc,volume,driver,reason}`
- `pvc_backup_last_success_timestamp_seconds{namespace,pvc,volume,driver}`
- `pvc_backup_count{namespace,pvc,volume,driver}`
- `pvc_backup_completed_count{namespace,pvc,volume,driver}`
- `pvc_backup_last_size_bytes{namespace,pvc,volume,driver}`
- `pvc_backup_total_size_bytes{namespace,pvc,volume,driver}`
- `soteria_b2_scan_success`
- `soteria_b2_scan_timestamp_seconds`
- `soteria_b2_scan_duration_seconds`
- `soteria_b2_account_objects`
- `soteria_b2_account_bytes`
- `soteria_b2_account_recent_objects_24h`
- `soteria_b2_account_recent_bytes_24h`
- `soteria_b2_bucket_objects{bucket}`
- `soteria_b2_bucket_bytes{bucket}`
- `soteria_b2_bucket_recent_objects_24h{bucket}`
- `soteria_b2_bucket_recent_bytes_24h{bucket}`
- `soteria_b2_bucket_last_modified_timestamp_seconds{bucket}`
`pvc_backup_health` is `1` when the most recent successful backup is within `SOTERIA_BACKUP_MAX_AGE_HOURS`, otherwise `0`.
## Configuration
Environment variables:
- `SOTERIA_BACKUP_DRIVER` default `longhorn`, allowed `longhorn`, `restic`
- `SOTERIA_LONGHORN_URL` default `http://longhorn-backend.longhorn-system.svc:9500`
- `SOTERIA_LONGHORN_BACKUP_MODE` default `incremental`, allowed `incremental`, `full`
- `SOTERIA_RESTIC_REPOSITORY` required for restic driver
- `SOTERIA_RESTIC_SECRET_NAME` default `soteria-restic`
- `SOTERIA_SECRET_NAMESPACE` default service namespace
- `SOTERIA_RESTIC_IMAGE` default `restic/restic:0.16.4`
- `SOTERIA_RESTIC_BACKUP_ARGS` optional extra args for `restic backup`
- `SOTERIA_RESTIC_FORGET_ARGS` optional extra args for `restic forget`
- `SOTERIA_S3_ENDPOINT` optional S3-compatible endpoint
- `SOTERIA_S3_REGION` optional region
- `SOTERIA_JOB_TTL_SECONDS` default `86400`
- `SOTERIA_JOB_NODE_SELECTOR` optional comma-separated `key=value` list
- `SOTERIA_JOB_SERVICE_ACCOUNT` optional ServiceAccount for restic Jobs
- `SOTERIA_LISTEN_ADDR` default `:8080`
- `SOTERIA_AUTH_REQUIRED` default `false`
- `SOTERIA_ALLOWED_GROUPS` default `admin,maintenance`
- `SOTERIA_AUTH_BEARER_TOKENS` optional comma-separated bearer tokens
- `SOTERIA_BACKUP_MAX_AGE_HOURS` default `24`
- `SOTERIA_METRICS_REFRESH_SECONDS` default `300`
- `SOTERIA_POLICY_EVAL_SECONDS` default `300`
- `SOTERIA_POLICY_SECRET_NAME` default `soteria-policies`
- `SOTERIA_USAGE_SECRET_NAME` default `soteria-backup-usage` (stores persisted restic size estimates)
- `SOTERIA_B2_ENABLED` default `false` (auto-enabled if endpoint/secret are set)
- `SOTERIA_B2_ENDPOINT` optional S3-compatible endpoint (for B2, usually `https://s3.<region>.backblazeb2.com`)
- `SOTERIA_B2_REGION` optional region override (auto-inferred for Backblaze endpoint patterns)
- `SOTERIA_B2_BUCKETS` optional comma-separated bucket allowlist (defaults to scanning all accessible buckets)
- `SOTERIA_B2_ACCESS_KEY_ID` optional static key (can come from secret instead)
- `SOTERIA_B2_SECRET_ACCESS_KEY` optional static secret key (can come from secret instead)
- `SOTERIA_B2_SECRET_NAMESPACE` optional secret namespace (defaults to service namespace when secret name is set)
- `SOTERIA_B2_SECRET_NAME` optional secret containing B2 keys
- `SOTERIA_B2_ACCESS_KEY_FIELD` default `AWS_ACCESS_KEY_ID`
- `SOTERIA_B2_SECRET_KEY_FIELD` default `AWS_SECRET_ACCESS_KEY`
- `SOTERIA_B2_ENDPOINT_FIELD` default `AWS_ENDPOINTS`
- `SOTERIA_B2_SCAN_INTERVAL_SECONDS` default `900`
- `SOTERIA_B2_SCAN_TIMEOUT_SECONDS` default `120`
## Secrets
Create a secret named `soteria-restic` in the Soteria namespace, or set `SOTERIA_RESTIC_SECRET_NAME`, when using the restic driver. Required keys:
- `AWS_ACCESS_KEY_ID`
- `AWS_SECRET_ACCESS_KEY`
- `RESTIC_PASSWORD`
The service copies this secret into the target namespace per job and attaches an owner reference so it is cleaned up with the Job.
For B2 scanning, you can point Soteria at a secret via `SOTERIA_B2_SECRET_NAME`. Expected keys by default:
- `AWS_ACCESS_KEY_ID`
- `AWS_SECRET_ACCESS_KEY`
- `AWS_ENDPOINTS` (optional if `SOTERIA_B2_ENDPOINT` is set)
A template is in `deploy/secret-example.yaml`. Do not commit real credentials.
## Deployment
The `deploy/` folder includes Kustomize-ready manifests for namespace, RBAC, config, deployment, and service.
Apply with:
```sh
kubectl apply -k deploy
```
The example Service is annotated for Prometheus scraping of `/metrics`.
## Notes
- Longhorn inventory and metrics are based on discovered backup records per PVC.
- Inventory `Restore` buttons load source context into the restore planner; restore execution happens from the planner panel.
- Scheduled backup policies apply to both Longhorn and restic drivers.
- Restic size telemetry is estimated from per-job upload summaries; with shared dedupe repositories those values are per-PVC attributions, not exact physical B2 ownership.
- For Atlas production, place Soteria behind an authenticated ingress and trust only proxy-injected auth headers.
The local deploy manifests live in `deploy/`. Production wiring should still go
through the Flux repo, not one-off cluster edits.