Cursed server build vent

McMonster · February 17, 2024, 12:05pm

I’m working on the most cursed server build in my life. It’s a mix of me, coincidences and bad luck, I just want to vent.

I had ASRock N3150-ITX board gathering dust and some old drives, so I decided to make it into a dedicated remote backup box. Plug the drives, install OS, a bit of configuration, ship to a remote place, what could go wrong?

I couldn’t decide for a long time if I should run Proxmox, Fedora, Debian or something entirely else, so I tested a lot of stuff reinstalling a lot and that took a lot of my limited free time.
Finally decided to install Fedora as usual, SSD died shortly after, like it couldn’t happen when I was still playing with stuff.
Replaced SSD, second install, CMOS battery died, but I thought I broke something with the installation.
Third OS install, I realize it’s the battery. Obviously all the shops just closed. I’m starting to get annoyed.
I’ve updated my basic setup scripts recently and they deleted a lot more in /etc than I expected, I decide to leave USB stick plugged forever.
Fourth OS install, zfs-dkms fails, system randomly loses DNS and drops off the network. My sanity slowly fades away as I add btrfs snapshot of root filesystem as the first action in my install scripts.
I decide it’s Tailscale’s fault (I’ve only started using it recently) with the network problem, even uninstalling doesn’t help, you won’t guess what comes next.
Fifth install, I skipped Tailscale, but network still craps out. Turns out NetworkManager refuses to persist any settings on the connection on both nmcli and Cockpit. I give up further debugging for now.
Building ZFS module still fails so I spend the evening debugging that and it turns out it’s not yet compatible with kernel 6.7, so I scramble block kernel updates on my other boxes and reset the default kernel for those updated but not yet rebooted. It’s a first even slightly positive thing, but I’m slowly losing my mind anyway.
DKMS works! Except all ZFS commands still fail with vague error. I need to stock up on beer.
Remember the CMOS battery? I had a brain fart and enabled secure boot. My keyboard suffers abuse as I type queries to a search engine, but at least I learned something about configuring secure boot and kernel modules.
I set up ZFS pool and sanoid, another evening gone just getting systemd services and timers until I discover that overrides on timers don’t just replace OnCalendar entry, but add another one and that’s why it ran every minute anyway (I only wanted sanoid to hourly prune snapshots pulled by syncoid). All is automated and I finally see the light at the end of the tunnel.
It was actually a train, after pulling 4 TB of data ZFS starts reporting data errors on every scrub, I fortunately have a spare drive of the same size and have the opportunity to do my first ever resilver. What I lack is a spare beer.
I had all ZFS system services set up and working, but when I rebooted a few days later zfs-mount.service just fails. It turns out it wasn’t even on the entire pool, just the single dataset refuses to mount with just “Invalid argument” error! Violence seems more and more attractive.
I think I know what to check, but new fans and a miniPCI-E Google Coral board arrives. Box is running fine passive, but it’ll be working unattended remotely so some cooling is a good idea. And I can finally have decent movement detection for camera at that place. Surely taking a small break from software will be refreshing, won’t it?
Two hours of looking for a damned screw that will mount that Coral board in place! I’ve misplaced my bag of M.2 mounting hardware and only this bag. I somehow even had an M.2 to PCI-E adapter with non-standard screw and post for holding the SSD! I talked myself out of throwing this cursed box out the window.
The day is today. I’ve found a random screw that fits. I put it back together, plug the cables, power on and… three beeps. I’m on the edge.

It’s been 3 or 4 weeks of making a simple remote backup box and it’s still neither remote nor backup. Even cats seem to look at me with more disgust than usually.

I’m researching European forests for some place to disappear in and never see a computer ever again.

diizzy · February 17, 2024, 1:13pm

3 beeps suggests memory issue and if ZFS starts to report errors I wouldn’t be surprised that it’s something related to that.

McMonster · February 17, 2024, 9:47pm

I gathered some strength to power it back on and it booted. The drive in question had history SMART errors before, but I stress tested it and it seemed fine. ZFS is just the better test. So the only explanation of the beeps is that this box is cursed.

I wouldn’t be surprised if one day I pop the cover and Cenobites will start ringing their bell. Now back to debugging.

ThatGuyB · February 18, 2024, 1:05am

Oh, I love complaining, just what the thread I needed!

I haven’t used nmcli in what I could say ages now and I’m so glad. /etc/rc.local with ip commands is just a much better way of handling IP configurations. It’s just easy and you know ip will work on any system, even those that use nmcli or ifupdown. If you need dhcp, just enable dhclient or dhcpcd service.

Part of the reason I don’t trust old drives anymore is because they tend to fail at the worst possible moments. When I moved over the pond, my zpool was working. A week after I was here, zpool was degraded with a failed disk. I was hoping to use that remote server to do a bunch of stuff remotely, but that seriously backfired. I had to scatter stuff around here and transfer a few hundred GB over a few MB pipe (took me about 2 weeks IIRC). Everything else had to be moved to another zpool.

And then, another mistake I did was configuring a proxmox cluster (to be fair, it was done a while ago). After a power outage, 2 servers wouldn’t power back on, so I had no quorum and had to force the single node in standalone every time I rebooted it. One of the servers was failing to boot because it was looking for a drive that never existed (something USB attached) and another I’ve yet to find out.

If I was to repeat the experiment, I’d attach a pi-kvm to all the systems (or at least use a 4-way hdmi kvm on the pi-kvm to switch between them, that’d be cheaper). That only works because my VPN and pfSense router have never really failed me yet, but the pfsense box is on its way out. The webGUI broke after an update and I’m welcomes by a blank page - probably some stupid php thing. And idk how to handle pfsense from the terminal. I wish I used freebsd or openbsd on it before I left, but I wasn’t gonna test something new at the time in the short timespan I had to prepare.

It wouldn’t have been enough time and tests for me to ensure usability, I need at least a few months to make sure that after reboots and updates nothing breaks. Thankfully I tend to keep most of my OS kinda standard, so I don’t run into unexpected behavior. The simpler, the better.

diizzy · February 18, 2024, 7:09am

You might want to look into FreeBSD which would seem to be a lot less painful but the Realtek NIC makes me a bit hesitant…

McMonster · February 18, 2024, 5:00pm

I agree, I have run OpenWRT on my router and love the simplicity. You can actually have good understanding of everything happening in the system. On the other side, NetworkManager and nmcli are sometimes nicer and more user friendly. And it’s been years since last time NetworkManager caused me any problem. Certainly it was a Fedora Core back then.

I had all the parts except Google Coral already gathering dust in my apartment, so it’s an additional backup tier for “free”. Especially since I have a place with good Internet and a camera disconnected since I moved taking my server with me. So doubly “free”.

Aren’t you perhaps hosting a podcast? I have strong déjà vu.

It’s on my “one day I should” list. Computers sense when you are away and then die in the most inconvenient time.

This box would then come full circle, FreeNAS was the first OS I installed on it when it was brand new. Despite my original post I’m quite happy with runnning Fedora with OpenZFS, it’s mostly my configuration errors and some unfortunate timing that cause problems. I’ve already solved all ZFS issues and I just need to figure out why NetworkManager doesn’t cooperate.

ThatGuyB · February 18, 2024, 5:17pm

I don’t, but I’d like to hear about that. Which one you’re talking about?

I have my own “blog” on the forum, I think I’ve documented it back in around August or September 2021, when I was zfs-sending stuff around and when trying to download a 200GB disk image through my VPN at extremely slow speeds (then failing after 2 days and having to restart, so I ended up splitting the image into multiple parts, so if one part failed, I can retry without having to redownload the whole thing).

McMonster · February 18, 2024, 5:21pm

One of the Self-Hosted Show’s hosts (https://selfhosted.show/) is a British guy living in the US and I can swear they talked about Proxmox cluster quorum problems in one of the episodes.

ThatGuyB · February 18, 2024, 6:02pm

Heh, lol, I’ve had selfhosted.show in my RSS, but never really checked it out. I’m from the US, lived in Eastern Europe, now living back in burgerland. The infra was set up in Europe at my family’s place.

arkie76 · February 24, 2024, 2:57am

Luxury……

risk · February 24, 2024, 4:25am

Always hated network-manager, I’m doing Debian on that same board, with initramfs-tools and systemd-networkd.

I spent a lot of time tuning initramfs scripts through qemu to make sure Tailscale runs on boot so I could unlock drives.

ack · February 24, 2024, 5:17am

Every time I put a computer together and setup a software stack and at the end it works without any weirdness, I think thank f*ck.

There’s a reason why in software dev the project plan timeline is in 9/10 cases a work of fiction. No one knows what might go wrong, nor how long it might take, even for the simplest of things.