metis/docs/titan-rpi4-remote-replacement.md

3.4 KiB

Titan rpi4 Remote Replacement

This is the low-touch replacement flow for titan-13 and titan-19 when the person onsite can only:

  1. insert an SD card into the flashing machine
  2. swap the card into the Pi
  3. power-cycle the Pi

The remote operator does everything else.

What the image does by itself

After the stale Kubernetes node object is deleted and the replacement image is flashed, the booted Pi is expected to do the rest automatically:

  • bring up SSH on port 2277
  • set the node hostname
  • bring up the node's static 192.168.22.x address on end0
  • mount /mnt/astreae and /mnt/asteria
  • start open-iscsi
  • start k3s-agent
  • rejoin the cluster with the baked-in node token and server URL

Version clarification

As of March 31, 2026, the live cluster reports:

  • control plane: k3s v1.33.3+k3s1
  • healthy rpi4 Longhorn workers (titan-15, titan-17): k3s v1.31.5+k3s1

The 6.6.63 and 6.12.41 numbers are Linux kernel versions, not Kubernetes versions.

Kubernetes' official version skew policy says a kubelet may be up to three minor versions older than the kube-apiserver, so 1.31 workers against a 1.33 control plane are supported today:

The replacement images intentionally keep the rpi4 worker k3s version aligned with the healthy HDD-backed rpi4 workers to avoid introducing a Kubernetes minor change during node recovery.

Remote flashing flow

Run these commands from the machine that has the metis repo and your SSH access.

1. Build the image and delete the stale node object

cd ~/Development/metis
./scripts/prepare_titan_rpi4_replacement.sh titan-13 titan-22
./scripts/prepare_titan_rpi4_replacement.sh titan-19 titan-22

This does all of the following:

  • fetches the current cluster node token from titan-0a
  • deletes the stale Kubernetes Node object
  • builds the replacement image under artifacts/
  • copies it to titan-22:/tmp/metis-images/

2. Ask the onsite helper to insert the SD card into titan-22

When the card is inserted, identify the target device:

./scripts/remote_sd_candidates.sh titan-22

3. Flash the card remotely

./scripts/remote_flash_titan_image.sh titan-22 titan-13 /dev/sdX
./scripts/remote_flash_titan_image.sh titan-22 titan-19 /dev/sdY

The remote machine will ask for its sudo password during the flash.

4. Ask the onsite helper to swap the card and power-cycle the Pi

That should be the end of the onsite work.

5. Validate remotely

kubectl get nodes -w
kubectl -n longhorn-system get nodes.longhorn.io
kubectl -n longhorn-system get replicas.longhorn.io -o wide | grep 'titan-13\|titan-19'
ssh titan-13
ssh titan-19

USB boot

Raspberry Pi 4 supports USB mass storage boot via its EEPROM bootloader:

That means the same general recovery image approach can be used on a USB device instead of an SD card.

For this cluster, the safer rollout is:

  1. first recover titan-13 and titan-19 to known-good SD cards
  2. pilot USB boot on one non-critical rpi4
  3. only then migrate the Longhorn HDD-backed rpi4s

USB boot is attractive for wear reduction, but it adds EEPROM boot-order, adapter, and power-delivery variables. The emergency replacement process above should stay SD-based until the USB path has been tested on your actual hardware.