61 lines
5.4 KiB
Markdown
61 lines
5.4 KiB
Markdown
# Metis
|
||
|
||
Metis produces fully configured recovery SD cards for any node in the lab (RPi 4/5 workers, control plane Pis, amd64 nodes like tethys, titan-db, titan-jh, future titan-20/21, and non-cluster hosts). Goal: 1 command + insert SD → node rejoins with identical identity, network, k3s role/labels/taints, and pre-baked log/GC drop-ins.
|
||
|
||
## Objectives
|
||
- Cross-platform (Linux + Windows) CLI/GUI with dead-simple UX.
|
||
- Pull class-specific golden images from Harbor (or other artifact store), inject per-node config, and write/verify SD cards.
|
||
- Minimal image set via node classes; inject per-node deltas at burn time.
|
||
- Idempotent bootstraps: hostname/IP, k3s server/agent setup, labels/taints, journald/log GC drop-ins, Longhorn mount validation, SSH keys/users.
|
||
- Works offline once artifacts are cached; verifies hashes/signatures before writing.
|
||
|
||
## Planned high-level workflow
|
||
1) Select target node (from inventory) + target disk.
|
||
2) Tool downloads/caches the right golden image for that node class.
|
||
3) Injects per-node config (net, k3s tokens/roles/labels/taints, SSH keys, runtime drop-ins, Longhorn mount metadata) and writes SD.
|
||
4) Verifies write; prints next-step: "insert and power on." No manual follow-up.
|
||
|
||
## Early design notes
|
||
- Implemented in Go for easy static builds and a lightweight GUI (e.g., Fyne or Wails) plus CLI.
|
||
- Inventory-driven: node classes (rpi5-ubuntu, rpi4-armbian-longhorn, rpi4-armbian-std, control-plane, amd64-agents, external hosts).
|
||
- Extensible per-node hooks for special hardware (Longhorn HDD UUIDs on titan-13/15/17/19; future titan-20/21; oceanus/titan-23; tethys/titan-jh/titan-db).
|
||
- Secure defaults: hash checking for downloaded images; avoids ever printing secrets; prepares k3s tokens/certs/keys via sealed source.
|
||
|
||
## Repo layout (initial)
|
||
- `cmd/` – CLI/GUI entrypoints
|
||
- `pkg/` – shared lib (inventory, imaging, injectors, platform abstraction)
|
||
- `docs/` – user/operator docs (this will stay light; working notes live in AGENTS.md untracked)
|
||
- `AGENTS.md` – local, untracked working notes (do not add to git)
|
||
|
||
## Current modes
|
||
- `metis plan --inventory inv.yaml --node titan-13 --device /dev/sdz --cache /tmp/metis-cache` prints the burn plan (respects `--boot/--root` or `METIS_*` envs for injection steps).
|
||
- `metis burn ... --yes` downloads/verifies the golden image, writes it (dd for `/dev/*`, file copy otherwise), and injects node config when mounts are provided.
|
||
- Pass `--boot /mnt/boot --root /mnt/root` (or set `METIS_BOOT_PATH`/`METIS_ROOT_PATH`) to drop hostname, k3s config, ssh keys, NoCloud user-data, and a debug `etc/metis/node.json` into the mounted card. If unset, injection is skipped (write-only).
|
||
- `--auto-mount` attempts to mount `/dev/*` partitions (or loop images) automatically for injection on Linux (requires privileges).
|
||
- `metis image --inventory inv.yaml --node titan-13 --output artifacts/titan-13.img` produces a fully injected raw image artifact without writing to removable media.
|
||
- `metis serve` runs the operator-facing Metis service:
|
||
- web UI for build/flash workflows
|
||
- Prometheus metrics on `/metrics`
|
||
- internal sentinel snapshot + watch endpoints
|
||
- Container images are split for gentler cluster operation:
|
||
- `metis` carries the flash/build toolchain and is intended to run on `titan-22`
|
||
- `metis-sentinel` stays slim for the DaemonSet that samples node facts
|
||
- Class overlays: define `boot_overlay`/`root_overlay` on a class to merge static files into boot/root at burn time (e.g., cloud-init/netplan drop-ins, GPU driver configs). Per-node config still injects hostname/IP/k3s/SSH/Longhorn.
|
||
- Linux loop-mount helper (losetup/mount) exists for automation; wiring into CLI burn is next. Windows writer/GUI stub forthcoming.
|
||
- Vault: Metis can read per-node secrets from `secret/data/nodes/<hostname>` using VAULT_ADDR plus either VAULT_TOKEN or AppRole (VAULT_ROLE_ID/VAULT_SECRET_ID). Expected fields: ssh_password, k3s_token, cloud_init, extra map.
|
||
- Sentinel: `metis-sentinel` collects host facts and can either print them, write local history, or push them into the Metis service. The intended deployment shape is a DaemonSet on cluster nodes plus an Ariadne-triggered Metis watch that recomputes recommended class targets and drift history.
|
||
- Facts aggregation: `metis facts --inventory inv.yaml --snapshots ./snapshots` reads sentinel snapshot JSON files and prints per-class drift summary (kernels, containerd, k3s, package samples). Use exported ConfigMaps or `METIS_SENTINEL_OUT` history as input.
|
||
- `metis config --inventory inv.yaml --node titan-13` prints the merged node config (hostname/IP/k3s labels/taints/Longhorn UUIDs).
|
||
|
||
## Service direction
|
||
- Deployed UI protected by Atlas SSO headers (`admin` / `maintenance`)
|
||
- Default flash host support for `titan-22`
|
||
- Recent build / flash / sentinel change history
|
||
- Ariadne-driven sentinel watch cadence
|
||
- Prometheus/Grafana visibility for Metis runs and tests
|
||
- CI test metrics share the `ariadne_ci_*` series and are distinguished by `repo="metis"`
|
||
|
||
Current deployment note: the service can fetch and verify the rpi4 base image from an official URL via `METIS_IMAGE_RPI4_ARMBIAN_LONGHORN` and `METIS_IMAGE_RPI4_ARMBIAN_LONGHORN_SHA256`, then cache it locally on the flash host. A mirrored Harbor-backed base image is still preferable long term, but it is no longer a prerequisite for Texas-side builds.
|
||
|
||
Next steps: publish the service images, add the SCM remote/repo for Metis, and broaden inventory coverage beyond the current Titan recovery classes.
|