PSA: Forbidden Router - Make Two

Not a question, blog maybe?

I don’t exactly run a variation of “The Forbidden Router” (yt: 1, 2, 3), but if you do, or if you’re considering, these mussings might be of interest, you might have the same concerns.

At the end of the day, you’re free (not forbidden) to make your own personal hell out of all the things your router needs or is involved with and you’re free to enjoy your own suffering.

TL;DR::

  • Make two boxes - VRRP is good - By itself, VRRP is only somewhat useful, but what it allows for is much easier rebuilding / testing which is confidence boosting.
  • Test prolonged power off (not just reboot).
  • Test/Do backups / restores / reinstalls.
  • Expect issues, expect that resolving them takes more time, more effort.
  • Expect complications.
  • Do as I say, not as I do (or do it anyway).

The setup

Hardware I use for routing (… or routing++ - more below) is an ASRock N3150 ITX + 8 GB RAM + 256GB SATA SSD + USB3.0 RTL8153 Gigabit nic for WAN (…+ ZigBee dongle + ancient 4GB USB flash drive for LUKS and stuff… ). This is all connected to my ISP provided cable modem (Virgin Media Ireland, 1Gbps / 50Mbps), and to a pair of unifi switches + some unifi u6-lr/u6-mesh for wifi)

I procured this hardware back in 2015, but it’s latest incarnation / using it for routing started in 2020 … and have been using it mostly untouched since, when I upgraded from 500Mbps to 1Gbps.

Bare metal OS installed is Debian + Docker, with some twists.

Debian has been upgraded to testing and is running unattended upgrades incl. the kernel. The storage was partitoned with GPT such that there’s an ESP/EFI partition (/boot/efi), XBOOTLDR (ie. /boot partition for kernels and initramfs) and a LUKS encrypted partition with LVM on top, and BTRFS. There’s very few packages installed overall, kernel, systemd, bash, git, tmux, vim, tailscale, docker (w/compose), There’s no “virtualization”, unless you count Docker.

Key for unlocking LUKS / root partition is stored on the USB key, and used on boot. There’s other keys I store elsewhere - this means local sata ssd that has some credentials is throw-away-able safely, even if malfunctioning, but a $5 wrench attack, is always going to be a security risk, additionally I can unlock the router on boot through wan and lan over ssh (from initramfs) in case the rarely used ancient 4GB flash stick with the unlock key dies.

WAN DHCP + VLANs / Bridges / LAN / MACVLAN static addressing is handled by systemd-networkd. My ISP gives me a publicly routable IPv4 only.

NAT and firewalling are handled by iptables rules (I’m old school, and my ISP doesn’t do ipv6+routable ipv4 - they only do DS-Lite or no ipv6 – I picked no ipv6).

Tailscale runs with ssh enabled, and there’s traditional openssh exposed on LAN only.

I use snapper to take snapshots on a filesystem, and I wrote some shell scripts that systemd runs that look at recent snapper snapshots, and btrfs send them through zstd to Google Drive.

Unattended upgrades run all the time, and they restart some things along the way. I use systemd-boot and kexec to upgrade kernels. About twice a month at around 4am on a weekend I get 10-30 seconds of downtime as the old kernel shuts down and new kernel starts up.

In Docker land, I have a directory /root/services/ with a docker-compose.yml and some other stuff next to it, and I’m running dnsmasq (used for DHCP only on a MACVLAN interface), adguard (for DNS filtering), nginx (for TLS proxy-ing), caddy (yay auto cert renewal), zigbee2mqtt…, thelounge (irc client).

On the host there basically only the root user + mostly unprivileged containers.

The Experience

It worked flawlessly, minor touchups only, for about a year. Once in a while ISP would cut out, we had some brief power cuts, I made some changes along the way as one does, but nothing major.

I highly recommend the setup.

I suffered a 45 minutes power cut yesterday and it was a “major” event - router didn’t come back on its own.

We had brief power cuts earlier this year, which passed “normally”, but we haven’t had one in a while. We have some construction/digging happening nearby - they’re widening the street and introducing new dedicated cycling lanes - and moving manholes and pipes around.

This outage started at around 2pm, both me and my significant other work from home (opposite sides of the house), and were in a middle of a meeting when video froze for both of us.

What ensued was 10 minutes of pandemonium. I got yelled at for “breaking stuff and internet not working when power goes out, and my stupid computer stuff”, and then also barked at by the dog, and beeped at by various home fire/security/ alarms which started beeping to tell me I’m out of power, and were additionally causing the dog bark, which was making my significant other yell at the dog to keep quiet and at me for letting the dog bark, who was joining into the barking because of beeping and yelling – he was helpfully trying to let us know things aren’t OK.

So, I put my phone into hotspot mode (5G yay), handed it to my girlfriend, told her where the USB battery is, and then got yelled at more because there was no power or internet, and ran off with the dog and coffee (and no phone?!?) into the garden. At least there’s no beep every minute.

45 minutes later, I noticed kitchen lights came on. … not all WIFI came back, and what did didn’t work. Fixed network didn’t work, I survived more being yelled at and retreated to have corporate meetings from the dining room table (phone hotspot range) while sufferring only being yelled at periodically for “my being incompetent and my perfect internet setup still not working” - despite me not really trying to fix it, in the interim.

Girlfriend finished her work around 5pm and took the dog for a looong walk and I finished around 6pm (work / meetings over) … and started second day of work and fixing things … and stayed until about 9pm when I got most of the stuff working again.

The Fixing Experience

The motherboard has an HDMI slot, and I have a HDMI->USB dongle. It’s still a PC it spews boot output onto a framebuffer, so I plugged in HDMI into the board, USB into laptop (chromebook), started the Camera app/Extension … it didn’t work.

… computer wasn’t dead. USB-A port on the Chromebook is barely used (this might be the first time ever), and needed wiggling, then the “change camera button” appeared and I could switch to the hdmi dongle input.

I saw a sharpened but blurry mirrored image of a cropped screen (4:3 mirrored photo Camera app weirdness) that said UEFI defaults loaded press F1 to continue F2 to enter setup. I figured out I can change the “photo aspect ratio” to 16:9 to uncrop text and disable mirroring…

… I then got a USB-3 mechanical keyboard from my standing desk, that I plugged into the extension cable that had my usb zigbee adapter (I have an extension cable because zigbee doesn’t like usb 3.0 – interference, works great on an extension cable though), … and keyboard didn’t work, … I did a power off/on, it started working. (turns out the USB3.0 mechanical keyboard the “taurus durgod” is temperamental and uses lots of power, sometimes needs to be plugged in twice if cold booting).

I pressed F1 and it started wanting to boot off the network … WTF !!! ; reboot / press F11 … it offers a bunch of partitions that makes no sense, and an entry called “debian”. I picked debian, … and it started doing PXE/RARP but looks differently now… WTF!!!.

Consumer motherboards / BIOS-es / Firmwares rely on small batteries to keep state and happily scrub themselves to factory defaults randomly - you don’t want this for your server/router.

There’s rarely any KVM functionality (e.g. I like ASRock Rack boards for their price and built in kvm).

*There’s complicated boot process for getting into the OS (not like e.g. -Pi or many home routers that have uboot setup PLCs and RAM, throw a device tree to a kernel, jump into kernel, mount squashfs as root filesystem from the read only flash chip.

… at this point, I figured, I need to boot off of a flash stick, and log in, and repair the install.

I’m basically a backend/cloud developer/worker, I work using Chromebooks for browser/terminal, and my code runs on a server and in the cloud anyway. With a chromebook, I downloed the arch iso, I zipped it, and passed it to the “recovery extension” to write the image onto a 32G flash drive I had lying around (there’s no local shell on a chromebook).

Next, I put a small shell script on a webserver, (much easier than imaging a flash drive in my world)

The shellscript goes:

#!/bin/bash

pacman -Sy
pacman -S --noconfirm tailscale
systemctl start tailscaled
tailscale up --ssh --accept-dns=false --auth-key=tskey-somethingsomehting

I then modified the Arch ISO for tailscale and passed script=https://my.cloud.server/installer_script.sh on the kernel command line, … to force Arch to download and execute the shell script above off the internet on boot.

Plan was perfect, plug stick in, wait a bit, pop a web shell.

Turns out, the 32G stick, got flashed corrupted. … I rarely use usb flash sticks, I’m always suspicious… I flashed again, it was corrupted again on a different offset hmmm, it didn’t boot, I tried again, got a UEFI error on a different offset.
I tried booting from partition via syslinux (ie, the non UEFI mode) and it worked ! … because syslinux retries on error, and I guess the UEFI doesn’t deal well with flaky usb / flaky USB connections on the flash drive … network booting tends to be more reliable.

Before decrypting the partitions, I looked at ESP … and found GRUB !!! WTF !!! I never installed grub !! I did find traces of systemd-boot, which I did set-up but it wasn’t showing up in F11 menu… or so I thought.

Turns out, (this is a theory btw, unclear), but I think somehow, … debian unattended upgrades swapped and broke my bootloader, … and I don’t know when/how that happened exactly, I’d have to go through time and look at logs. The system was normally kexec-ing into new kernels without fully rebooting / without losing power for months… this was never an issue… not sure why, I forgot how I set this up.

Deleting traces of grub, setting up systemd-boot again got it booting and everything is working again. I’ll still have to followup more over the weekend. I’m too scared to reboot now.

The Plan

  1. Blacklist GRUB
  2. email me when host package list changes (healthchecks.io is free for 20 checks)
  3. Make dnsmasq serve netbootable tailscale enabled netboot archiso
  4. Dust off / setup the J3160 / 8G / 128G SSD from a drawer (hope it hasn’t rotted)
  5. Run some cables and setup keepalived (like I used to have on openwrt at some point)
  6. Move more things into containers, ideally incl. routing too (e.g. use macvlan with WAN too, I’m too lazy to write a docker network driver)… because running 5 things on 2 machines is eaier than running 2 machinse with 5 things.
  7. Once this government relaxes rules on solar somewhat, fill up the roof, and buy a powerwall or a myenergi libbi or a LG Chem RESU (… or something along those lines that could run my modem/router/switches/wifi for a few hours in the meantime).
4 Likes

Quite the story. I did the forbidden router once, I can’t not recommend it highly enough. :rofl: OPNsense on old hardware for me.

I have opinions about how your SO is treating you…

2 Likes

Same, that treatment is not okay. Especially for something outside of your control.

In the mean time, I would suggest getting a tower UPS. I have them on my critical electronics just in case. Eventually we will go solar with power storage solution.

1 Like

OMG I know it’s a year later but seriously this story really brightened my day and I nearly spit water on my keyboard here as I laughed reading it.

1 Like