metis/docs/titan-rpi4-remote-replacement.md

114 lines
3.4 KiB
Markdown
Raw Normal View History

# Titan rpi4 Remote Replacement
This is the low-touch replacement flow for `titan-13` and `titan-19` when the
person onsite can only:
1. insert an SD card into the flashing machine
2. swap the card into the Pi
3. power-cycle the Pi
The remote operator does everything else.
## What the image does by itself
After the stale Kubernetes node object is deleted and the replacement image is
flashed, the booted Pi is expected to do the rest automatically:
- bring up SSH on port `2277`
- set the node hostname
- bring up the node's static `192.168.22.x` address on `end0`
- mount `/mnt/astreae` and `/mnt/asteria`
- start `open-iscsi`
- start `k3s-agent`
- rejoin the cluster with the baked-in node token and server URL
## Version clarification
As of **March 31, 2026**, the live cluster reports:
- control plane: `k3s v1.33.3+k3s1`
- healthy rpi4 Longhorn workers (`titan-15`, `titan-17`): `k3s v1.31.5+k3s1`
The `6.6.63` and `6.12.41` numbers are Linux kernel versions, not Kubernetes
versions.
Kubernetes' official version skew policy says a `kubelet` may be up to three
minor versions older than the `kube-apiserver`, so `1.31` workers against a
`1.33` control plane are supported today:
- https://kubernetes.io/releases/version-skew-policy/
The replacement images intentionally keep the rpi4 worker `k3s` version aligned
with the healthy HDD-backed rpi4 workers to avoid introducing a Kubernetes minor
change during node recovery.
## Remote flashing flow
Run these commands from the machine that has the `metis` repo and your SSH
access.
### 1. Build the image and delete the stale node object
```bash
cd ~/Development/metis
./scripts/prepare_titan_rpi4_replacement.sh titan-13 titan-22
./scripts/prepare_titan_rpi4_replacement.sh titan-19 titan-22
```
This does all of the following:
- fetches the current cluster node token from `titan-0a`
- deletes the stale Kubernetes `Node` object
- builds the replacement image under `artifacts/`
- copies it to `titan-22:/tmp/metis-images/`
### 2. Ask the onsite helper to insert the SD card into `titan-22`
When the card is inserted, identify the target device:
```bash
./scripts/remote_sd_candidates.sh titan-22
```
### 3. Flash the card remotely
```bash
./scripts/remote_flash_titan_image.sh titan-22 titan-13 /dev/sdX
./scripts/remote_flash_titan_image.sh titan-22 titan-19 /dev/sdY
```
The remote machine will ask for its `sudo` password during the flash.
### 4. Ask the onsite helper to swap the card and power-cycle the Pi
That should be the end of the onsite work.
### 5. Validate remotely
```bash
kubectl get nodes -w
kubectl -n longhorn-system get nodes.longhorn.io
kubectl -n longhorn-system get replicas.longhorn.io -o wide | grep 'titan-13\|titan-19'
ssh titan-13
ssh titan-19
```
## USB boot
Raspberry Pi 4 supports USB mass storage boot via its EEPROM bootloader:
- https://www.raspberrypi.com/documentation/computers/raspberry-pi.html#usb-mass-storage-boot
That means the same general recovery image approach can be used on a USB device
instead of an SD card.
For this cluster, the safer rollout is:
1. first recover `titan-13` and `titan-19` to known-good SD cards
2. pilot USB boot on one non-critical rpi4
3. only then migrate the Longhorn HDD-backed rpi4s
USB boot is attractive for wear reduction, but it adds EEPROM boot-order,
adapter, and power-delivery variables. The emergency replacement process above
should stay SD-based until the USB path has been tested on your actual hardware.