metis/README.md

61 lines
5.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Metis
Metis produces fully configured recovery SD cards for any node in the lab (RPi 4/5 workers, control plane Pis, amd64 nodes like tethys, titan-db, titan-jh, future titan-20/21, and non-cluster hosts). Goal: 1 command + insert SD → node rejoins with identical identity, network, k3s role/labels/taints, and pre-baked log/GC drop-ins.
## Objectives
- Cross-platform (Linux + Windows) CLI/GUI with dead-simple UX.
- Pull class-specific golden images from Harbor (or other artifact store), inject per-node config, and write/verify SD cards.
- Minimal image set via node classes; inject per-node deltas at burn time.
- Idempotent bootstraps: hostname/IP, k3s server/agent setup, labels/taints, journald/log GC drop-ins, Longhorn and USB scratch mount validation, SSH keys/users.
- Works offline once artifacts are cached; verifies hashes/signatures before writing.
## Planned high-level workflow
1) Select target node (from inventory) + target disk.
2) Tool downloads/caches the right golden image for that node class.
3) Injects per-node config (net, k3s tokens/roles/labels/taints, SSH keys, runtime drop-ins, Longhorn mount metadata, USB scratch bind layout) and writes SD.
4) Verifies write; prints next-step: "insert and power on." No manual follow-up.
## Early design notes
- Implemented in Go for easy static builds and a lightweight GUI (e.g., Fyne or Wails) plus CLI.
- Inventory-driven: node classes (rpi5-ubuntu, rpi4-armbian-longhorn, rpi4-armbian-std, control-plane, amd64-agents, external hosts).
- Extensible per-node hooks for special hardware (Longhorn HDD UUIDs on titan-13/15/17/19; future titan-20/21; oceanus/titan-23; tethys/titan-jh/titan-db).
- Secure defaults: hash checking for downloaded images; avoids ever printing secrets; prepares k3s tokens/certs/keys via sealed source.
## Repo layout (initial)
- `cmd/` CLI/GUI entrypoints
- `pkg/` shared lib (inventory, imaging, injectors, platform abstraction)
- `docs/` user/operator docs (this will stay light; working notes live in AGENTS.md untracked)
- `AGENTS.md` local, untracked working notes (do not add to git)
## Current modes
- `metis plan --inventory inv.yaml --node titan-13 --device /dev/sdz --cache /tmp/metis-cache` prints the burn plan (respects `--boot/--root` or `METIS_*` envs for injection steps).
- `metis burn ... --yes` downloads/verifies the golden image, writes it (dd for `/dev/*`, file copy otherwise), and injects node config when mounts are provided.
- Pass `--boot /mnt/boot --root /mnt/root` (or set `METIS_BOOT_PATH`/`METIS_ROOT_PATH`) to drop hostname, k3s config, ssh keys, NoCloud user-data, and a debug `etc/metis/node.json` into the mounted card. If unset, injection is skipped (write-only).
- `--auto-mount` attempts to mount `/dev/*` partitions (or loop images) automatically for injection on Linux (requires privileges).
- `metis image --inventory inv.yaml --node titan-13 --output artifacts/titan-13.img` produces a fully injected raw image artifact without writing to removable media.
- `metis serve` runs the operator-facing Metis service:
- web UI for build/flash workflows
- Prometheus metrics on `/metrics`
- internal sentinel snapshot + watch endpoints
- Container images are split for gentler cluster operation:
- `metis` carries the flash/build toolchain and is intended to run on `titan-22`
- `metis-sentinel` stays slim for the DaemonSet that samples node facts
- Class overlays: define `boot_overlay`/`root_overlay` on a class to merge static files into boot/root at burn time (e.g., cloud-init/netplan drop-ins, GPU driver configs). Per-node config still injects hostname/IP/k3s/SSH/Longhorn.
- Linux loop-mount helper (losetup/mount) exists for automation; wiring into CLI burn is next. Windows writer/GUI stub forthcoming.
- Vault: Metis can read per-node secrets from `kv/data/atlas/nodes/<hostname>` using VAULT_ADDR plus either VAULT_TOKEN or AppRole (VAULT_ROLE_ID/VAULT_SECRET_ID). Expected fields: atlas_password, root_password, k3s_token, cloud_init, extra map.
- Sentinel: `metis-sentinel` collects host facts and can either print them, write local history, or push them into the Metis service. The intended deployment shape is a DaemonSet on cluster nodes plus an Ariadne-triggered Metis watch that recomputes recommended class targets and drift history.
- Facts aggregation: `metis facts --inventory inv.yaml --snapshots ./snapshots` reads sentinel snapshot JSON files and prints per-class drift summary (kernels, containerd, k3s, package samples). Use exported ConfigMaps or `METIS_SENTINEL_OUT` history as input.
- `metis config --inventory inv.yaml --node titan-13` prints the merged node config (hostname/IP/k3s labels/taints/Longhorn UUIDs and optional USB scratch metadata).
## Service direction
- Deployed UI protected by Atlas SSO headers (`admin` / `maintenance`)
- Default flash host support for `titan-22`
- Recent build / flash / sentinel change history
- Ariadne-driven sentinel watch cadence
- Prometheus/Grafana visibility for Metis runs and tests
- CI test metrics share the `ariadne_ci_*` series and are distinguished by `repo="metis"`
Current deployment note: the service can fetch and verify the rpi4 base image from an official URL via `METIS_IMAGE_RPI4_ARMBIAN_LONGHORN` and `METIS_IMAGE_RPI4_ARMBIAN_LONGHORN_SHA256`, then cache it locally on the flash host. A mirrored Harbor-backed base image is still preferable long term, but it is no longer a prerequisite for Texas-side builds.
Next steps: publish the service images, add the SCM remote/repo for Metis, and broaden inventory coverage beyond the current Titan recovery classes.