docs: shorten soteria README
This commit is contained in:
parent
cb476165b5
commit
105f88d89c
331
README.md
331
README.md
@ -1,327 +1,46 @@
|
||||
# soteria
|
||||
|
||||
Soteria is an in-cluster service for PVC backup and restore operations. The current production baseline focuses on Longhorn-backed PVCs and provides:
|
||||
Soteria is the backup and restore console for Atlas PVCs.
|
||||
|
||||
- Namespace-grouped PVC inventory for backup and restore selection.
|
||||
- On-demand backup creation for Longhorn volumes.
|
||||
- Namespace-wide backup and restore batch execution.
|
||||
- Restore into a new target PVC with conflict checks and best-effort cleanup on failure.
|
||||
- Policy-based scheduled backups (per PVC or all PVCs in a namespace), persisted in-cluster.
|
||||
- A built-in React + TypeScript UI (dark-mode default) suitable for publishing behind an authenticated ingress.
|
||||
- Prometheus-format backup freshness and B2 consumption telemetry for Grafana rollups.
|
||||
Right now it is mainly built around Longhorn. It lists bound PVCs, starts
|
||||
backups, restores a backup into a new PVC, runs namespace-wide backup/restore
|
||||
jobs, and exposes backup health metrics for Grafana. It also has a small React
|
||||
UI so the common restore path does not require remembering the API by hand.
|
||||
|
||||
For Longhorn, backups are crash-consistent at the volume level and delegated to the Longhorn control plane.
|
||||
Soteria never overwrites an existing target PVC. Restore work is meant to be
|
||||
explicit and reversible.
|
||||
|
||||
## Endpoints
|
||||
## How it works
|
||||
|
||||
Public endpoints:
|
||||
The service runs in-cluster and talks to Kubernetes plus the Longhorn backend.
|
||||
For each PVC it resolves the backing volume, asks Longhorn to snapshot/backup
|
||||
it, and records enough inventory for humans and dashboards to see whether the
|
||||
backup is fresh.
|
||||
|
||||
- `GET /healthz`
|
||||
- `GET /readyz`
|
||||
- `GET /metrics`
|
||||
Policies are stored in a Kubernetes secret and evaluated on a timer. Metrics are
|
||||
published at `/metrics`; the UI and API share the same backend.
|
||||
|
||||
Protected endpoints when `SOTERIA_AUTH_REQUIRED=true`:
|
||||
Main endpoints:
|
||||
|
||||
- `GET /` UI console
|
||||
- `GET /v1/whoami`
|
||||
- `GET /healthz`, `GET /readyz`, `GET /metrics`
|
||||
- `GET /v1/inventory`
|
||||
- `GET /v1/backups?namespace=<ns>&pvc=<name>`
|
||||
- `POST /v1/backup`
|
||||
- `POST /v1/backup/namespace`
|
||||
- `POST /v1/restores`
|
||||
- `POST /v1/restores/namespace`
|
||||
- `POST /v1/restore-test` legacy alias for `/v1/restores`
|
||||
- `GET /v1/policies`
|
||||
- `POST /v1/policies`
|
||||
- `DELETE /v1/policies/<policy-id>`
|
||||
- `GET|POST|DELETE /v1/policies`
|
||||
- `GET /v1/b2`
|
||||
|
||||
## API examples
|
||||
When auth is enabled, Soteria expects trusted headers from the fronting proxy and
|
||||
checks `SOTERIA_ALLOWED_GROUPS`.
|
||||
|
||||
### POST /v1/backup
|
||||
## Development
|
||||
|
||||
```json
|
||||
{
|
||||
"namespace": "ai",
|
||||
"pvc": "llm-cache",
|
||||
"tags": ["namespace=ai", "service=llm"],
|
||||
"dry_run": false
|
||||
}
|
||||
```bash
|
||||
go test ./...
|
||||
./scripts/check.sh
|
||||
```
|
||||
|
||||
Longhorn response:
|
||||
|
||||
```json
|
||||
{
|
||||
"driver": "longhorn",
|
||||
"volume": "pvc-1234abcd",
|
||||
"backup": "soteria-backup-ai-llm-cache-20260412-153000",
|
||||
"namespace": "ai",
|
||||
"requested_by": "brad",
|
||||
"dry_run": false
|
||||
}
|
||||
```
|
||||
|
||||
### GET /v1/inventory
|
||||
|
||||
Response shape:
|
||||
|
||||
```json
|
||||
{
|
||||
"generated_at": "2026-04-12T15:30:00Z",
|
||||
"namespaces": [
|
||||
{
|
||||
"name": "ai",
|
||||
"pvcs": [
|
||||
{
|
||||
"namespace": "ai",
|
||||
"pvc": "llm-cache",
|
||||
"volume": "pvc-1234abcd",
|
||||
"storage_class": "longhorn",
|
||||
"capacity": "50Gi",
|
||||
"driver": "longhorn",
|
||||
"last_backup_at": "2026-04-12T14:55:00Z",
|
||||
"last_backup_age_hours": 0.58,
|
||||
"backup_count": 14,
|
||||
"healthy": true,
|
||||
"health_reason": "fresh"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### GET /v1/backups
|
||||
|
||||
```text
|
||||
/v1/backups?namespace=ai&pvc=llm-cache
|
||||
```
|
||||
|
||||
Returns the resolved volume name and backup records so the UI or automation can select a restore source.
|
||||
|
||||
### POST /v1/restores
|
||||
|
||||
```json
|
||||
{
|
||||
"namespace": "ai",
|
||||
"pvc": "llm-cache",
|
||||
"snapshot": "latest",
|
||||
"target_namespace": "ai",
|
||||
"target_pvc": "restore-llm-cache",
|
||||
"dry_run": false
|
||||
}
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- `namespace` and `pvc` identify the source PVC.
|
||||
- `target_pvc` is required.
|
||||
- `target_namespace` defaults to `namespace`.
|
||||
- Soteria refuses to overwrite an existing target PVC.
|
||||
- If Longhorn volume creation succeeds but PVC creation fails, Soteria attempts to delete the just-created restore volume.
|
||||
- You may provide `backup_url` directly instead of `snapshot`.
|
||||
|
||||
### POST /v1/backup/namespace
|
||||
|
||||
```json
|
||||
{
|
||||
"namespace": "ai",
|
||||
"dry_run": false
|
||||
}
|
||||
```
|
||||
|
||||
Runs backup for every currently bound PVC in the namespace and returns a per-PVC result list.
|
||||
|
||||
### POST /v1/restores/namespace
|
||||
|
||||
```json
|
||||
{
|
||||
"namespace": "ai",
|
||||
"target_namespace": "ai-restore",
|
||||
"target_prefix": "restore-20260412-",
|
||||
"snapshot": "",
|
||||
"dry_run": true
|
||||
}
|
||||
```
|
||||
|
||||
Runs restore planning/execution for every bound PVC in the source namespace. `snapshot` is optional and blank means latest completed backup per PVC.
|
||||
|
||||
### Policy API
|
||||
|
||||
Create or update a policy:
|
||||
|
||||
```json
|
||||
POST /v1/policies
|
||||
{
|
||||
"namespace": "ai",
|
||||
"pvc": "llm-cache",
|
||||
"interval_hours": 6,
|
||||
"enabled": true
|
||||
}
|
||||
```
|
||||
|
||||
- Leave `pvc` empty to target all PVCs in that namespace.
|
||||
- Policies are stored in secret `SOTERIA_POLICY_SECRET_NAME` under key `policies.json`.
|
||||
|
||||
### GET /v1/b2
|
||||
|
||||
Returns B2 account/bucket consumption based on S3-compatible object scans.
|
||||
|
||||
```json
|
||||
{
|
||||
"enabled": true,
|
||||
"available": true,
|
||||
"endpoint": "https://s3.us-west-004.backblazeb2.com",
|
||||
"region": "us-west-004",
|
||||
"scanned_at": "2026-04-12T16:00:00Z",
|
||||
"scan_duration_ms": 824,
|
||||
"total_objects": 1324,
|
||||
"total_bytes": 18407542931,
|
||||
"recent_objects_24h": 18,
|
||||
"recent_bytes_24h": 12245812,
|
||||
"buckets": [
|
||||
{
|
||||
"name": "atlas-backups",
|
||||
"object_count": 1240,
|
||||
"total_bytes": 18288473811,
|
||||
"recent_objects_24h": 12,
|
||||
"recent_bytes_24h": 8542198,
|
||||
"last_modified_at": "2026-04-12T15:43:19Z"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Recent 24h values are an object-change proxy and do not represent full B2 billing egress totals.
|
||||
|
||||
## Authentication and authorization
|
||||
|
||||
When `SOTERIA_AUTH_REQUIRED=true`, Soteria expects trusted auth headers from a fronting proxy such as `oauth2-proxy`:
|
||||
|
||||
- `X-Auth-Request-User`
|
||||
- `X-Auth-Request-Email`
|
||||
- `X-Auth-Request-Groups`
|
||||
- `X-Forwarded-User` (fallback)
|
||||
- `X-Forwarded-Email` (fallback)
|
||||
- `X-Forwarded-Groups` (fallback)
|
||||
|
||||
Allowed groups are configured with `SOTERIA_ALLOWED_GROUPS` and compared after normalizing leading `/` prefixes, so both `maintenance` and `/maintenance` are accepted. Group lists may be comma- or semicolon-separated.
|
||||
|
||||
Optional machine-to-machine access can be enabled with `SOTERIA_AUTH_BEARER_TOKENS`, which accepts a comma-separated list of bearer tokens.
|
||||
|
||||
## Prometheus metrics
|
||||
|
||||
Soteria exports Prometheus-format metrics at `GET /metrics`.
|
||||
|
||||
Implemented metrics:
|
||||
|
||||
- `soteria_backup_requests_total{driver,result}`
|
||||
- `soteria_restore_requests_total{driver,result}`
|
||||
- `soteria_policy_backups_total{result}`
|
||||
- `soteria_namespace_backup_requests_total{driver,result}`
|
||||
- `soteria_namespace_restore_requests_total{driver,result}`
|
||||
- `soteria_authz_denials_total{reason}`
|
||||
- `soteria_inventory_refresh_failures_total`
|
||||
- `soteria_inventory_refresh_timestamp_seconds`
|
||||
- `pvc_backup_age_hours{namespace,pvc,volume,driver}`
|
||||
- `pvc_backup_health{namespace,pvc,volume,driver}`
|
||||
- `pvc_backup_health_reason{namespace,pvc,volume,driver,reason}`
|
||||
- `pvc_backup_last_success_timestamp_seconds{namespace,pvc,volume,driver}`
|
||||
- `pvc_backup_count{namespace,pvc,volume,driver}`
|
||||
- `pvc_backup_completed_count{namespace,pvc,volume,driver}`
|
||||
- `pvc_backup_last_size_bytes{namespace,pvc,volume,driver}`
|
||||
- `pvc_backup_total_size_bytes{namespace,pvc,volume,driver}`
|
||||
- `soteria_b2_scan_success`
|
||||
- `soteria_b2_scan_timestamp_seconds`
|
||||
- `soteria_b2_scan_duration_seconds`
|
||||
- `soteria_b2_account_objects`
|
||||
- `soteria_b2_account_bytes`
|
||||
- `soteria_b2_account_recent_objects_24h`
|
||||
- `soteria_b2_account_recent_bytes_24h`
|
||||
- `soteria_b2_bucket_objects{bucket}`
|
||||
- `soteria_b2_bucket_bytes{bucket}`
|
||||
- `soteria_b2_bucket_recent_objects_24h{bucket}`
|
||||
- `soteria_b2_bucket_recent_bytes_24h{bucket}`
|
||||
- `soteria_b2_bucket_last_modified_timestamp_seconds{bucket}`
|
||||
|
||||
`pvc_backup_health` is `1` when the most recent successful backup is within `SOTERIA_BACKUP_MAX_AGE_HOURS`, otherwise `0`.
|
||||
|
||||
## Configuration
|
||||
|
||||
Environment variables:
|
||||
|
||||
- `SOTERIA_BACKUP_DRIVER` default `longhorn`, allowed `longhorn`, `restic`
|
||||
- `SOTERIA_LONGHORN_URL` default `http://longhorn-backend.longhorn-system.svc:9500`
|
||||
- `SOTERIA_LONGHORN_BACKUP_MODE` default `incremental`, allowed `incremental`, `full`
|
||||
- `SOTERIA_RESTIC_REPOSITORY` required for restic driver
|
||||
- `SOTERIA_RESTIC_SECRET_NAME` default `soteria-restic`
|
||||
- `SOTERIA_SECRET_NAMESPACE` default service namespace
|
||||
- `SOTERIA_RESTIC_IMAGE` default `restic/restic:0.16.4`
|
||||
- `SOTERIA_RESTIC_BACKUP_ARGS` optional extra args for `restic backup`
|
||||
- `SOTERIA_RESTIC_FORGET_ARGS` optional extra args for `restic forget`
|
||||
- `SOTERIA_S3_ENDPOINT` optional S3-compatible endpoint
|
||||
- `SOTERIA_S3_REGION` optional region
|
||||
- `SOTERIA_JOB_TTL_SECONDS` default `86400`
|
||||
- `SOTERIA_JOB_NODE_SELECTOR` optional comma-separated `key=value` list
|
||||
- `SOTERIA_JOB_SERVICE_ACCOUNT` optional ServiceAccount for restic Jobs
|
||||
- `SOTERIA_LISTEN_ADDR` default `:8080`
|
||||
- `SOTERIA_AUTH_REQUIRED` default `false`
|
||||
- `SOTERIA_ALLOWED_GROUPS` default `admin,maintenance`
|
||||
- `SOTERIA_AUTH_BEARER_TOKENS` optional comma-separated bearer tokens
|
||||
- `SOTERIA_BACKUP_MAX_AGE_HOURS` default `24`
|
||||
- `SOTERIA_METRICS_REFRESH_SECONDS` default `300`
|
||||
- `SOTERIA_POLICY_EVAL_SECONDS` default `300`
|
||||
- `SOTERIA_POLICY_SECRET_NAME` default `soteria-policies`
|
||||
- `SOTERIA_USAGE_SECRET_NAME` default `soteria-backup-usage` (stores persisted restic size estimates)
|
||||
- `SOTERIA_B2_ENABLED` default `false` (auto-enabled if endpoint/secret are set)
|
||||
- `SOTERIA_B2_ENDPOINT` optional S3-compatible endpoint (for B2, usually `https://s3.<region>.backblazeb2.com`)
|
||||
- `SOTERIA_B2_REGION` optional region override (auto-inferred for Backblaze endpoint patterns)
|
||||
- `SOTERIA_B2_BUCKETS` optional comma-separated bucket allowlist (defaults to scanning all accessible buckets)
|
||||
- `SOTERIA_B2_ACCESS_KEY_ID` optional static key (can come from secret instead)
|
||||
- `SOTERIA_B2_SECRET_ACCESS_KEY` optional static secret key (can come from secret instead)
|
||||
- `SOTERIA_B2_SECRET_NAMESPACE` optional secret namespace (defaults to service namespace when secret name is set)
|
||||
- `SOTERIA_B2_SECRET_NAME` optional secret containing B2 keys
|
||||
- `SOTERIA_B2_ACCESS_KEY_FIELD` default `AWS_ACCESS_KEY_ID`
|
||||
- `SOTERIA_B2_SECRET_KEY_FIELD` default `AWS_SECRET_ACCESS_KEY`
|
||||
- `SOTERIA_B2_ENDPOINT_FIELD` default `AWS_ENDPOINTS`
|
||||
- `SOTERIA_B2_SCAN_INTERVAL_SECONDS` default `900`
|
||||
- `SOTERIA_B2_SCAN_TIMEOUT_SECONDS` default `120`
|
||||
|
||||
## Secrets
|
||||
|
||||
Create a secret named `soteria-restic` in the Soteria namespace, or set `SOTERIA_RESTIC_SECRET_NAME`, when using the restic driver. Required keys:
|
||||
|
||||
- `AWS_ACCESS_KEY_ID`
|
||||
- `AWS_SECRET_ACCESS_KEY`
|
||||
- `RESTIC_PASSWORD`
|
||||
|
||||
The service copies this secret into the target namespace per job and attaches an owner reference so it is cleaned up with the Job.
|
||||
|
||||
For B2 scanning, you can point Soteria at a secret via `SOTERIA_B2_SECRET_NAME`. Expected keys by default:
|
||||
|
||||
- `AWS_ACCESS_KEY_ID`
|
||||
- `AWS_SECRET_ACCESS_KEY`
|
||||
- `AWS_ENDPOINTS` (optional if `SOTERIA_B2_ENDPOINT` is set)
|
||||
|
||||
A template is in `deploy/secret-example.yaml`. Do not commit real credentials.
|
||||
|
||||
## Deployment
|
||||
|
||||
The `deploy/` folder includes Kustomize-ready manifests for namespace, RBAC, config, deployment, and service.
|
||||
|
||||
Apply with:
|
||||
|
||||
```sh
|
||||
kubectl apply -k deploy
|
||||
```
|
||||
|
||||
The example Service is annotated for Prometheus scraping of `/metrics`.
|
||||
|
||||
## Notes
|
||||
|
||||
- Longhorn inventory and metrics are based on discovered backup records per PVC.
|
||||
- Inventory `Restore` buttons load source context into the restore planner; restore execution happens from the planner panel.
|
||||
- Scheduled backup policies apply to both Longhorn and restic drivers.
|
||||
- Restic size telemetry is estimated from per-job upload summaries; with shared dedupe repositories those values are per-PVC attributions, not exact physical B2 ownership.
|
||||
- For Atlas production, place Soteria behind an authenticated ingress and trust only proxy-injected auth headers.
|
||||
The local deploy manifests live in `deploy/`. Production wiring should still go
|
||||
through the Flux repo, not one-off cluster edits.
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user