metis/docs/titan-rpi4-remote-replacement.md

# Titan rpi4 Remote Replacement

This is the low-touch replacement flow for `titan-13` and `titan-19` when the
person onsite can only:

1. insert an SD card into the flashing machine
2. swap the card into the Pi
3. power-cycle the Pi

The remote operator does everything else.

## What the image does by itself

After the stale Kubernetes node object is deleted and the replacement image is
flashed, the booted Pi is expected to do the rest automatically:

- bring up SSH on port `2277`
- set the node hostname
- bring up the node's static `192.168.22.x` address on `end0`
- mount `/mnt/astreae` and `/mnt/asteria`
- start `open-iscsi`
- start `k3s-agent`
- rejoin the cluster with the baked-in node token and server URL

## Version clarification

As of **March 31, 2026**, the live cluster reports:

- control plane: `k3s v1.33.3+k3s1`
- healthy rpi4 Longhorn workers (`titan-15`, `titan-17`): `k3s v1.31.5+k3s1`

The `6.6.63` and `6.12.41` numbers are Linux kernel versions, not Kubernetes
versions.

Kubernetes' official version skew policy says a `kubelet` may be up to three
minor versions older than the `kube-apiserver`, so `1.31` workers against a
`1.33` control plane are supported today:

- https://kubernetes.io/releases/version-skew-policy/

The replacement images intentionally keep the rpi4 worker `k3s` version aligned
with the healthy HDD-backed rpi4 workers to avoid introducing a Kubernetes minor
change during node recovery.

## Remote flashing flow

Run these commands from the machine that has the `metis` repo and your SSH
access.

### 1. Build the image and delete the stale node object

```bash
cd ~/Development/metis
./scripts/prepare_titan_rpi4_replacement.sh titan-13 titan-22
./scripts/prepare_titan_rpi4_replacement.sh titan-19 titan-22
```

This does all of the following:

- fetches the current cluster node token from `titan-0a`
- deletes the stale Kubernetes `Node` object
- builds the replacement image under `artifacts/`
- copies it to `titan-22:/tmp/metis-images/`

### 2. Ask the onsite helper to insert the SD card into `titan-22`

When the card is inserted, identify the target device:

```bash
./scripts/remote_sd_candidates.sh titan-22
```

### 3. Flash the card remotely

```bash
./scripts/remote_flash_titan_image.sh titan-22 titan-13 /dev/sdX
./scripts/remote_flash_titan_image.sh titan-22 titan-19 /dev/sdY
```

The remote machine will ask for its `sudo` password during the flash.

### 4. Ask the onsite helper to swap the card and power-cycle the Pi

That should be the end of the onsite work.

### 5. Validate remotely

```bash
kubectl get nodes -w
kubectl -n longhorn-system get nodes.longhorn.io
kubectl -n longhorn-system get replicas.longhorn.io -o wide | grep 'titan-13\|titan-19'
ssh titan-13
ssh titan-19
```

## USB boot

Raspberry Pi 4 supports USB mass storage boot via its EEPROM bootloader:

- https://www.raspberrypi.com/documentation/computers/raspberry-pi.html#usb-mass-storage-boot

That means the same general recovery image approach can be used on a USB device
instead of an SD card.

For this cluster, the safer rollout is:

1. first recover `titan-13` and `titan-19` to known-good SD cards
2. pilot USB boot on one non-critical rpi4
3. only then migrate the Longhorn HDD-backed rpi4s

USB boot is attractive for wear reduction, but it adds EEPROM boot-order,
adapter, and power-delivery variables. The emergency replacement process above
should stay SD-based until the USB path has been tested on your actual hardware.
feat: add metis service and autonomous recovery path 2026-03-31 14:52:50 -03:00			`# Titan rpi4 Remote Replacement`

			This is the low-touch replacement flow for `titan-13` and `titan-19` when the
			`person onsite can only:`

			`1. insert an SD card into the flashing machine`
			`2. swap the card into the Pi`
			`3. power-cycle the Pi`

			`The remote operator does everything else.`

			`## What the image does by itself`

			`After the stale Kubernetes node object is deleted and the replacement image is`
			`flashed, the booted Pi is expected to do the rest automatically:`

			- bring up SSH on port `2277`
			`- set the node hostname`
			- bring up the node's static `192.168.22.x` address on `end0`
			- mount `/mnt/astreae` and `/mnt/asteria`
			- start `open-iscsi`
			- start `k3s-agent`
			`- rejoin the cluster with the baked-in node token and server URL`

			`## Version clarification`

			`As of March 31, 2026, the live cluster reports:`

			- control plane: `k3s v1.33.3+k3s1`
			- healthy rpi4 Longhorn workers (`titan-15`, `titan-17`): `k3s v1.31.5+k3s1`

			The `6.6.63` and `6.12.41` numbers are Linux kernel versions, not Kubernetes
			`versions.`

			Kubernetes' official version skew policy says a `kubelet` may be up to three
			minor versions older than the `kube-apiserver`, so `1.31` workers against a
			`1.33` control plane are supported today:

			`- https://kubernetes.io/releases/version-skew-policy/`

			The replacement images intentionally keep the rpi4 worker `k3s` version aligned
			`with the healthy HDD-backed rpi4 workers to avoid introducing a Kubernetes minor`
			`change during node recovery.`

			`## Remote flashing flow`

			Run these commands from the machine that has the `metis` repo and your SSH
			`access.`

			`### 1. Build the image and delete the stale node object`

			```bash
			`cd ~/Development/metis`
			`./scripts/prepare_titan_rpi4_replacement.sh titan-13 titan-22`
			`./scripts/prepare_titan_rpi4_replacement.sh titan-19 titan-22`
			```

			`This does all of the following:`

			- fetches the current cluster node token from `titan-0a`
			- deletes the stale Kubernetes `Node` object
			- builds the replacement image under `artifacts/`
			- copies it to `titan-22:/tmp/metis-images/`

			### 2. Ask the onsite helper to insert the SD card into `titan-22`

			`When the card is inserted, identify the target device:`

			```bash
			`./scripts/remote_sd_candidates.sh titan-22`
			```

			`### 3. Flash the card remotely`

			```bash
			`./scripts/remote_flash_titan_image.sh titan-22 titan-13 /dev/sdX`
			`./scripts/remote_flash_titan_image.sh titan-22 titan-19 /dev/sdY`
			```

			The remote machine will ask for its `sudo` password during the flash.

			`### 4. Ask the onsite helper to swap the card and power-cycle the Pi`

			`That should be the end of the onsite work.`

			`### 5. Validate remotely`

			```bash
			`kubectl get nodes -w`
			`kubectl -n longhorn-system get nodes.longhorn.io`
			`kubectl -n longhorn-system get replicas.longhorn.io -o wide \| grep 'titan-13\\|titan-19'`
			`ssh titan-13`
			`ssh titan-19`
			```

			`## USB boot`

			`Raspberry Pi 4 supports USB mass storage boot via its EEPROM bootloader:`

			`- https://www.raspberrypi.com/documentation/computers/raspberry-pi.html#usb-mass-storage-boot`

			`That means the same general recovery image approach can be used on a USB device`
			`instead of an SD card.`

			`For this cluster, the safer rollout is:`

			1. first recover `titan-13` and `titan-19` to known-good SD cards
			`2. pilot USB boot on one non-critical rpi4`
			`3. only then migrate the Longhorn HDD-backed rpi4s`

			`USB boot is attractive for wear reduction, but it adds EEPROM boot-order,`
			`adapter, and power-delivery variables. The emergency replacement process above`
			`should stay SD-based until the USB path has been tested on your actual hardware.`