metis/README.md

5.4 KiB
Raw Blame History

Metis

Metis produces fully configured recovery SD cards for any node in the lab (RPi 4/5 workers, control plane Pis, amd64 nodes like tethys, titan-db, titan-jh, future titan-20/21, and non-cluster hosts). Goal: 1 command + insert SD → node rejoins with identical identity, network, k3s role/labels/taints, and pre-baked log/GC drop-ins.

Objectives

  • Cross-platform (Linux + Windows) CLI/GUI with dead-simple UX.
  • Pull class-specific golden images from Harbor (or other artifact store), inject per-node config, and write/verify SD cards.
  • Minimal image set via node classes; inject per-node deltas at burn time.
  • Idempotent bootstraps: hostname/IP, k3s server/agent setup, labels/taints, journald/log GC drop-ins, Longhorn mount validation, SSH keys/users.
  • Works offline once artifacts are cached; verifies hashes/signatures before writing.

Planned high-level workflow

  1. Select target node (from inventory) + target disk.
  2. Tool downloads/caches the right golden image for that node class.
  3. Injects per-node config (net, k3s tokens/roles/labels/taints, SSH keys, runtime drop-ins, Longhorn mount metadata) and writes SD.
  4. Verifies write; prints next-step: "insert and power on." No manual follow-up.

Early design notes

  • Implemented in Go for easy static builds and a lightweight GUI (e.g., Fyne or Wails) plus CLI.
  • Inventory-driven: node classes (rpi5-ubuntu, rpi4-armbian-longhorn, rpi4-armbian-std, control-plane, amd64-agents, external hosts).
  • Extensible per-node hooks for special hardware (Longhorn HDD UUIDs on titan-13/15/17/19; future titan-20/21; oceanus/titan-23; tethys/titan-jh/titan-db).
  • Secure defaults: hash checking for downloaded images; avoids ever printing secrets; prepares k3s tokens/certs/keys via sealed source.

Repo layout (initial)

  • cmd/ CLI/GUI entrypoints
  • pkg/ shared lib (inventory, imaging, injectors, platform abstraction)
  • docs/ user/operator docs (this will stay light; working notes live in AGENTS.md untracked)
  • AGENTS.md local, untracked working notes (do not add to git)

Current modes

  • metis plan --inventory inv.yaml --node titan-13 --device /dev/sdz --cache /tmp/metis-cache prints the burn plan (respects --boot/--root or METIS_* envs for injection steps).
  • metis burn ... --yes downloads/verifies the golden image, writes it (dd for /dev/*, file copy otherwise), and injects node config when mounts are provided.
    • Pass --boot /mnt/boot --root /mnt/root (or set METIS_BOOT_PATH/METIS_ROOT_PATH) to drop hostname, k3s config, ssh keys, NoCloud user-data, and a debug etc/metis/node.json into the mounted card. If unset, injection is skipped (write-only).
    • --auto-mount attempts to mount /dev/* partitions (or loop images) automatically for injection on Linux (requires privileges).
  • metis image --inventory inv.yaml --node titan-13 --output artifacts/titan-13.img produces a fully injected raw image artifact without writing to removable media.
  • metis serve runs the operator-facing Metis service:
    • web UI for build/flash workflows
    • Prometheus metrics on /metrics
    • internal sentinel snapshot + watch endpoints
  • Container images are split for gentler cluster operation:
    • metis carries the flash/build toolchain and is intended to run on titan-22
    • metis-sentinel stays slim for the DaemonSet that samples node facts
  • Class overlays: define boot_overlay/root_overlay on a class to merge static files into boot/root at burn time (e.g., cloud-init/netplan drop-ins, GPU driver configs). Per-node config still injects hostname/IP/k3s/SSH/Longhorn.
  • Linux loop-mount helper (losetup/mount) exists for automation; wiring into CLI burn is next. Windows writer/GUI stub forthcoming.
  • Vault: Metis can read per-node secrets from secret/data/nodes/<hostname> using VAULT_ADDR plus either VAULT_TOKEN or AppRole (VAULT_ROLE_ID/VAULT_SECRET_ID). Expected fields: ssh_password, k3s_token, cloud_init, extra map.
  • Sentinel: metis-sentinel collects host facts and can either print them, write local history, or push them into the Metis service. The intended deployment shape is a DaemonSet on cluster nodes plus an Ariadne-triggered Metis watch that recomputes recommended class targets and drift history.
  • Facts aggregation: metis facts --inventory inv.yaml --snapshots ./snapshots reads sentinel snapshot JSON files and prints per-class drift summary (kernels, containerd, k3s, package samples). Use exported ConfigMaps or METIS_SENTINEL_OUT history as input.
  • metis config --inventory inv.yaml --node titan-13 prints the merged node config (hostname/IP/k3s labels/taints/Longhorn UUIDs).

Service direction

  • Deployed UI protected by Atlas SSO headers (admin / maintenance)
  • Default flash host support for titan-22
  • Recent build / flash / sentinel change history
  • Ariadne-driven sentinel watch cadence
  • Prometheus/Grafana visibility for Metis runs and tests
    • CI test metrics share the ariadne_ci_* series and are distinguished by repo="metis"

Current deployment note: the service can fetch and verify the rpi4 base image from an official URL via METIS_IMAGE_RPI4_ARMBIAN_LONGHORN and METIS_IMAGE_RPI4_ARMBIAN_LONGHORN_SHA256, then cache it locally on the flash host. A mirrored Harbor-backed base image is still preferable long term, but it is no longer a prerequisite for Texas-side builds.

Next steps: publish the service images, add the SCM remote/repo for Metis, and broaden inventory coverage beyond the current Titan recovery classes.