114 lines
3.4 KiB
Markdown
114 lines
3.4 KiB
Markdown
# Titan rpi4 Remote Replacement
|
|
|
|
This is the low-touch replacement flow for `titan-13` and `titan-19` when the
|
|
person onsite can only:
|
|
|
|
1. insert an SD card into the flashing machine
|
|
2. swap the card into the Pi
|
|
3. power-cycle the Pi
|
|
|
|
The remote operator does everything else.
|
|
|
|
## What the image does by itself
|
|
|
|
After the stale Kubernetes node object is deleted and the replacement image is
|
|
flashed, the booted Pi is expected to do the rest automatically:
|
|
|
|
- bring up SSH on port `2277`
|
|
- set the node hostname
|
|
- bring up the node's static `192.168.22.x` address on `end0`
|
|
- mount `/mnt/astreae` and `/mnt/asteria`
|
|
- start `open-iscsi`
|
|
- start `k3s-agent`
|
|
- rejoin the cluster with the baked-in node token and server URL
|
|
|
|
## Version clarification
|
|
|
|
As of **March 31, 2026**, the live cluster reports:
|
|
|
|
- control plane: `k3s v1.33.3+k3s1`
|
|
- healthy rpi4 Longhorn workers (`titan-15`, `titan-17`): `k3s v1.31.5+k3s1`
|
|
|
|
The `6.6.63` and `6.12.41` numbers are Linux kernel versions, not Kubernetes
|
|
versions.
|
|
|
|
Kubernetes' official version skew policy says a `kubelet` may be up to three
|
|
minor versions older than the `kube-apiserver`, so `1.31` workers against a
|
|
`1.33` control plane are supported today:
|
|
|
|
- https://kubernetes.io/releases/version-skew-policy/
|
|
|
|
The replacement images intentionally keep the rpi4 worker `k3s` version aligned
|
|
with the healthy HDD-backed rpi4 workers to avoid introducing a Kubernetes minor
|
|
change during node recovery.
|
|
|
|
## Remote flashing flow
|
|
|
|
Run these commands from the machine that has the `metis` repo and your SSH
|
|
access.
|
|
|
|
### 1. Build the image and delete the stale node object
|
|
|
|
```bash
|
|
cd ~/Development/metis
|
|
./scripts/prepare_titan_rpi4_replacement.sh titan-13 titan-22
|
|
./scripts/prepare_titan_rpi4_replacement.sh titan-19 titan-22
|
|
```
|
|
|
|
This does all of the following:
|
|
|
|
- fetches the current cluster node token from `titan-0a`
|
|
- deletes the stale Kubernetes `Node` object
|
|
- builds the replacement image under `artifacts/`
|
|
- copies it to `titan-22:/tmp/metis-images/`
|
|
|
|
### 2. Ask the onsite helper to insert the SD card into `titan-22`
|
|
|
|
When the card is inserted, identify the target device:
|
|
|
|
```bash
|
|
./scripts/remote_sd_candidates.sh titan-22
|
|
```
|
|
|
|
### 3. Flash the card remotely
|
|
|
|
```bash
|
|
./scripts/remote_flash_titan_image.sh titan-22 titan-13 /dev/sdX
|
|
./scripts/remote_flash_titan_image.sh titan-22 titan-19 /dev/sdY
|
|
```
|
|
|
|
The remote machine will ask for its `sudo` password during the flash.
|
|
|
|
### 4. Ask the onsite helper to swap the card and power-cycle the Pi
|
|
|
|
That should be the end of the onsite work.
|
|
|
|
### 5. Validate remotely
|
|
|
|
```bash
|
|
kubectl get nodes -w
|
|
kubectl -n longhorn-system get nodes.longhorn.io
|
|
kubectl -n longhorn-system get replicas.longhorn.io -o wide | grep 'titan-13\|titan-19'
|
|
ssh titan-13
|
|
ssh titan-19
|
|
```
|
|
|
|
## USB boot
|
|
|
|
Raspberry Pi 4 supports USB mass storage boot via its EEPROM bootloader:
|
|
|
|
- https://www.raspberrypi.com/documentation/computers/raspberry-pi.html#usb-mass-storage-boot
|
|
|
|
That means the same general recovery image approach can be used on a USB device
|
|
instead of an SD card.
|
|
|
|
For this cluster, the safer rollout is:
|
|
|
|
1. first recover `titan-13` and `titan-19` to known-good SD cards
|
|
2. pilot USB boot on one non-critical rpi4
|
|
3. only then migrate the Longhorn HDD-backed rpi4s
|
|
|
|
USB boot is attractive for wear reduction, but it adds EEPROM boot-order,
|
|
adapter, and power-delivery variables. The emergency replacement process above
|
|
should stay SD-based until the USB path has been tested on your actual hardware.
|