# Metis Metis produces fully configured recovery SD cards for any node in the lab (RPi 4/5 workers, control plane Pis, amd64 nodes like tethys, titan-db, titan-jh, future titan-20/21, and non-cluster hosts). Goal: 1 command + insert SD → node rejoins with identical identity, network, k3s role/labels/taints, and pre-baked log/GC drop-ins. ## Objectives - Cross-platform (Linux + Windows) CLI/GUI with dead-simple UX. - Pull class-specific golden images from Harbor (or other artifact store), inject per-node config, and write/verify SD cards. - Minimal image set via node classes; inject per-node deltas at burn time. - Idempotent bootstraps: hostname/IP, k3s server/agent setup, labels/taints, journald/log GC drop-ins, Longhorn and USB scratch mount validation, SSH keys/users. - Works offline once artifacts are cached; verifies hashes/signatures before writing. ## Planned high-level workflow 1) Select target node (from inventory) + target disk. 2) Tool downloads/caches the right golden image for that node class. 3) Injects per-node config (net, k3s tokens/roles/labels/taints, SSH keys, runtime drop-ins, Longhorn mount metadata, USB scratch bind layout) and writes SD. 4) Verifies write; prints next-step: "insert and power on." No manual follow-up. ## Early design notes - Implemented in Go for easy static builds and a lightweight GUI (e.g., Fyne or Wails) plus CLI. - Inventory-driven: node classes (rpi5-ubuntu, rpi4-armbian-longhorn, rpi4-armbian-std, control-plane, amd64-agents, external hosts). - Extensible per-node hooks for special hardware (Longhorn HDD UUIDs on titan-13/15/17/19; future titan-20/21; oceanus/titan-23; tethys/titan-jh/titan-db). - Secure defaults: hash checking for downloaded images; avoids ever printing secrets; prepares k3s tokens/certs/keys via sealed source. ## Repo layout (initial) - `cmd/` – CLI/GUI entrypoints - `pkg/` – shared lib (inventory, imaging, injectors, platform abstraction) - `docs/` – user/operator docs (this will stay light; working notes live in AGENTS.md untracked) - `AGENTS.md` – local, untracked working notes (do not add to git) ## Current modes - `metis plan --inventory inv.yaml --node titan-13 --device /dev/sdz --cache /tmp/metis-cache` prints the burn plan (respects `--boot/--root` or `METIS_*` envs for injection steps). - `metis burn ... --yes` downloads/verifies the golden image, writes it (dd for `/dev/*`, file copy otherwise), and injects node config when mounts are provided. - Pass `--boot /mnt/boot --root /mnt/root` (or set `METIS_BOOT_PATH`/`METIS_ROOT_PATH`) to drop hostname, k3s config, ssh keys, NoCloud user-data, and a debug `etc/metis/node.json` into the mounted card. If unset, injection is skipped (write-only). - `--auto-mount` attempts to mount `/dev/*` partitions (or loop images) automatically for injection on Linux (requires privileges). - `metis image --inventory inv.yaml --node titan-13 --output artifacts/titan-13.img` produces a fully injected raw image artifact without writing to removable media. - `metis serve` runs the operator-facing Metis service: - web UI for build/flash workflows - Prometheus metrics on `/metrics` - internal sentinel snapshot + watch endpoints - Container images are split for gentler cluster operation: - `metis` carries the flash/build toolchain and is intended to run on `titan-22` - `metis-sentinel` stays slim for the DaemonSet that samples node facts - Class overlays: define `boot_overlay`/`root_overlay` on a class to merge static files into boot/root at burn time (e.g., cloud-init/netplan drop-ins, GPU driver configs). Per-node config still injects hostname/IP/k3s/SSH/Longhorn. - Linux loop-mount helper (losetup/mount) exists for automation; wiring into CLI burn is next. Windows writer/GUI stub forthcoming. - Vault: Metis can read per-node secrets from `kv/data/atlas/nodes/` using VAULT_ADDR plus either VAULT_TOKEN or AppRole (VAULT_ROLE_ID/VAULT_SECRET_ID). Expected fields: ssh_password, k3s_token, cloud_init, extra map. - Sentinel: `metis-sentinel` collects host facts and can either print them, write local history, or push them into the Metis service. The intended deployment shape is a DaemonSet on cluster nodes plus an Ariadne-triggered Metis watch that recomputes recommended class targets and drift history. - Facts aggregation: `metis facts --inventory inv.yaml --snapshots ./snapshots` reads sentinel snapshot JSON files and prints per-class drift summary (kernels, containerd, k3s, package samples). Use exported ConfigMaps or `METIS_SENTINEL_OUT` history as input. - `metis config --inventory inv.yaml --node titan-13` prints the merged node config (hostname/IP/k3s labels/taints/Longhorn UUIDs and optional USB scratch metadata). ## Service direction - Deployed UI protected by Atlas SSO headers (`admin` / `maintenance`) - Default flash host support for `titan-22` - Recent build / flash / sentinel change history - Ariadne-driven sentinel watch cadence - Prometheus/Grafana visibility for Metis runs and tests - CI test metrics share the `ariadne_ci_*` series and are distinguished by `repo="metis"` Current deployment note: the service can fetch and verify the rpi4 base image from an official URL via `METIS_IMAGE_RPI4_ARMBIAN_LONGHORN` and `METIS_IMAGE_RPI4_ARMBIAN_LONGHORN_SHA256`, then cache it locally on the flash host. A mirrored Harbor-backed base image is still preferable long term, but it is no longer a prerequisite for Texas-side builds. Next steps: publish the service images, add the SCM remote/repo for Metis, and broaden inventory coverage beyond the current Titan recovery classes.