Marandil's Homelab evolution

Marandil · January 3, 2024, 12:55pm

Story time

Over 3 years ago, I built my first “home NAS”, based on a Raspberry PI 4B (or 3B, I’m not sure anymore) and USB Attached SCSI (UAS) enclosures for replaced laptop HDDs (at that point in time I have accumulated a small collection of 512GB-1TB HDDs that I replaced with SSDs in my/friends’/families’ laptops). It wasn’t anything fancy, just a PiHole + Samba server.

And thus, the pihost was born.

Now that I think about it, I’m pretty sure it was a 3B Pi, because its 100Base-T could have been one of the major reasons for its replacement, which came in December 2020 (yay, pandemic!). Enter Banana Pi R2 and pihost2.

There were several reasons why I replaced the Raspberry Pi solution with something else, but I can hardly remember all of them now (come on, almost 37 months at this point). Access to PCIe lanes without modding was definitely up there (although I never ended up using them), so was raw SATA connectivity. Built-in switch functionality was also nice-to-have and I remember I was even using it to extend my regular home network with its additional (GbE) ports.

The initial transition was almost painless. There was some initial struggle with the software/firmware side (MediaTek…), but as Raspberry Pi didn’t have 64-bit support at that time (does it now?) there was almost no difference in userland. The drives felt more stable as now they were plugged directly into SATA instead of via UAS and USB (which could become unstable after running for weeks).

On the software-storage side, I experimented with btrfs, but got severely disappointed with its performance on the platform. I didn’t even dare to touch ZFS or anything more complex and just ended up with two ext4 drives exported as two separate Samba shares.

On the software-networking side, there were two major advancements, although neither of them had anything to do with the platform itself (although I would never do one of them non 100MbE).
The first one was a VPN gateway service, which essentially bridged my home LAN with work lab VPN, s.t. me & my wife could have access to the work lab network from our home computers without setting up VPNs on them. Essentially we could access specific subnets as if they were local to us, but not vice-verse, i.e. the work lab would only see one VPN client. The setup is defunct by now, but at the time (pandemic) it was really helpful to have access to those resources without routing an unnecessary amount of traffic
(Also IIRC OpenVPN on Windows has/had some issues with TAP adapters that our lab was using).
The second one was that I got TLS certificates for my LAN services using a small custom tool, Let’s Encrypt and certbot with OVH DNS plugin. I bougth myself a mdl.computer domain (funny it’s not detected as url by Markdown xD) and setup internal DNS for them. The local addresses and services are never exposed but as you can see I’m getting up-to-date certificates for stuff such as my OctoPrint server.
The tool I have for this is currently private (I didn’t consider it mature enough), but if there’s an interest, I can try to polish it up and share it.

On the hardware/firmware side the board wasn’t too bad, but also wasn’t too good. 2GB RAM is/was enough, but not enough for anything more ambitious. I planned to have a small sensor panel (my 3d-printed enclosure even has a cutout for it), but it turned out that the SPI drivers… don’t really work ¯\_(ツ)_/¯. Or rather should work but you have to XYZ… As I recalled the pain of working with DTS and rebuilding it to get the switch part working, I decided to forgo that idea.

Nevertheless, after about two years, stuff started to get flaky. At first one of the drives started having ATA problems and was getting disconnected. I ended up merging both partitions onto a single 1TB SSD and removing HDDs altogether. It was around that time that I started eyeing a future replacement for pihost2, this time possibly something more… robust.
Almost a year has passed, the single SSD was working without problems, until very recently, when I started getting similar ATA errors on the SSD. In the previous setup I suspected power delivery issues, but as those started popping up with much smaller load, I got a bit worried. The drive is very lightly used, not even a full TB written. The previous drives, when inserted into other systems, also report no SMART/filesystem issues, so I’m heavily suspecting issues with the board itself. Using the other SATA/power slot I got it working again for a few weeks, before issues started popping up again . Still, no FS errors (when working), but:

[2404559.613068] JBD2: Error -5 detected when updating journal superblock for sda1-8.
[2404559.632644] sd 1:0:0:0: [sda] tag#27 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
[2404559.647922] sd 1:0:0:0: [sda] tag#27 CDB: opcode=0x2a 2a 00 00 00 08 00 00 00 08 00
[2404559.662545] blk_update_request: I/O error, dev sda, sector 2048 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[2404559.679957] blk_update_request: I/O error, dev sda, sector 2048 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[2404559.697240] Buffer I/O error on dev sda1, logical block 0, lost sync page write
[2404559.711617] EXT4-fs (sda1): I/O error while writing superblock
[2404819.847063] sd 1:0:0:0: [sda] tag#28 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
[2404819.862296] sd 1:0:0:0: [sda] tag#28 CDB: opcode=0x85 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[2406619.879328] sd 1:0:0:0: [sda] tag#29 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
[2406619.894759] sd 1:0:0:0: [sda] tag#29 CDB: opcode=0x85 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[2408419.912725] sd 1:0:0:0: [sda] tag#30 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
[2408419.928167] sd 1:0:0:0: [sda] tag#30 CDB: opcode=0x85 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00

once it starts, it keeps going. Reboot usually helps, but that’s not a long-term solution, esp. since I sometimes need to access it remotely.

Next time: current state of affairs

Marandil · January 3, 2024, 11:49pm

Current state of affairs (beginning of 2024)

For the past year I’ve been slowly gathering hardware for the next homelab iteration. This time I want to do this at least somewhat properly.
Except for the NAS problem I’m experiencing, I’m also being tempted by upcoming move to a new house (although it won’t be happening for the next few months) with a proper Cat 7 infrastructure for 10GBASE-T LAN. And that asks for a proper router and content server

Currently working inventory

Hardware
- Banana Pi R2 (trash?)
- 1TB GOODRAM(*) CL100 SATA3 SSD
- 5TB 2.5" Barracuda HDD with off-site backup (can’t remember exact model)
Software/Services
- PiHole with local DNS
- Cloudflared
- “Internal Certbot” (see story part)
- Samba
- SSH Reverse Proxy (**)

(*) I’m not surprised if you’ve never heard of them, they are a local (Polish) manufacturer with a very good price/quality ratio.
(**) I have an always-on SSH connection from the pihost2 to an external server, which exposes local port 22 over remote unix socket.
This allows access to the internal network from outside, but requires two authentication methods (first to get to the external server, second to get through the reverse proxy). I personally consider this much safer than exposing (forwarding) local ports.

Except for the SSH proxy all the services are neately containerized with Docker and can be easily migrated to any other host. When doing this inventory though, I realized there is one more “secret sauce” component, which is bind-mounting the NAS drives for Samba with masked users. The config for this is all in docker-compose, but it uses my own easyfuse docker plugin (GitHub, PyPI). For the next iteration, I guess I’ll have to do without it due to unacceptable FUSE overhead.

What I got for the next iteration

I started working on the next iteration with some discarded lab equipment from my work. In particular we had some “mining” boards (we’re doing a lot of PCIe passthrough at CloudVA and some of these boards had decent enough IOMMU groups), which I tried to mod to support E3-1245v5 but with moderate success. Well, I got it working…

… for a single boot. And just that one time. Welp ¯\_(ツ)_/¯.

I started looking for good deals on used motherboards and won an auction for a used X299 board ASUS ROG Rampage VI Apex with a delided i9-7900X (Yes, that i9-7900X, not R9 7900X, I know it’s easy to confuse the two )

I considered many more modern solutions but at that time I just couldn’t find any decent deals on anything with more than 2 PCIe slots and I consider the availability of PCIe slots even more important than gen 4/5 support. Why? Because most of NICs and HBAs I found are x4 or x8 PCIe Gen 3 anyway and getting a x16 gen 5 slot gets me nothing ;).

With this CPU the boards exposes x16/x8/x8/x8 direct CPU slots (x16 physical) (or x16/x8/x8/x4 when one of the M.2s is in use), 2 x4 CPU M.2s, and x4 + 2 x4 PCH M.2s for a total of 44 CPU lanes and 12 PCH lanes. (56 PCIe lanes total). Even with PCH lanes gimped, this is almost half of what’s available with EPYC.

At first, I was primarily concerned with SATA connectivity, trying to get the best bang for the buck and stupidly got myself one of those 6 port PCIe-to-SATA from AliExpress that was supposed to have ASM1064 (1x PCIe Gen 3 to 4x SATA 3), but it came with ASM 1061 which is a Gen 2 PCIe and only 2 SATA3, so all ports beside the first one were behind SATA expanders.

This experience again pushed me towards the used market and I got myself a HPE P420 HBA/RAID card. There was an issue with its super capacitor, but I simply turned on HBA mode and disconnected the capacitor unit. Linux “hpsa” sees the controller no problem, “ssacli” allowed me to turn hbamode=on. Good experience so far. Quickly got myself another unit, this time P822.
The main difference (for me) is that P420 has 2 internal mini-SAS (SFF-8087) ports, while P822 has 2 internal and 4 external (SFF-8088).

At that time, I also gathered all my 1TB+ HDDs for experiments. Won’t share any performance figures yet, but everything I got was +/- what I expected. The drives started to be a bit tightly packed in the temporal case (I mentioned it in this topic) which was essentially a modded Chinese case designed, I think, for crypto mining, with non-standard expansion slot spacing that I had to work around. The inside of the case could hold quite a few 2.5" drives (I’m not even counting) using 3d-printed “shelves”, but 3.5" quickly stopped fitting and so I started looking for a disk shelf.
Long story short, I didn’t find what I was looking for and some time last month I decided to finally pull the plug on the temporal case and move the lab into a proper server chassis, SC846, which arrived today.
(TODO: add nice photos )

Actually, I bought an entire server to gut for this chassis, I believe it’s 6047R-E1R24L, but definitely based around X9DRD-7LN4F-JBOD motherboard and a variant of SC846 chassis that I still have to fully identify, because I’m pretty sure it’s not SC846E16-R920B but rather SC846BE16-R920B.
The board and CPUs are now over a decade old and I’ll definitely swap them for something more modern, but first I’ll run some benchmarks and energy consumption tests just so I know “what could have been”.

I’ll write my impressions about the case and what needs to be done with it later, but for now, let me summarize what hardware is here or inbound.

“New” hardware list

Chassis: SC846 (exact model TBD, looks like SC846BE16-R920B
Motherboard: ROG Rampage VI Apex (trust me, I’d rather get another if I could for that price xD)
CPU: i9-7900X (delided)
RAM: 4x Corsair Vengeance 8GB 3000Mhz CL15 (CMK16GX4M2B3000C15 i think, 32GB total) or 4x Crucial 32GB/UDIMM, 3200MHz CL22 (CT32G4DFD832A, 128GB total)
Cooler: Enermax LiqFusion 360 or Alphacool Eisbaer 360
HBAs: HPE SmartArray P822 and backup: HPE SmartArray P420
NICs: 2x Dell/Chelsio T540-BT (4x 10GBASE-T with RJ45 each)
GPU: Some old, passive Radeon for any display or an RX 5700 TX if it fits.
Storage: needs a whole separate inventory and I’ve been writing this post for over 3 hours now.

Not counting:

Cables (various SFF-8087 and 8088 including Mini-SAS to SATA, Mini-SAS to SAS, ton of different SATAs)
Risers, mostly NVMe universal x4/x8/x16 and a couple of x1; I need to get a passive x4/x4/x4/x4 bifurcation card to test for motherboard support.

Next time: either storage inventory or “new software”

Marandil · January 8, 2024, 1:36am

Minor update:

For now I’m fighting Arch and presumably encountered a kernel bug; not sure if it’s limited to Arch builds only though. Will check later.

Cards are installed on the old motherboard and I’m printing 2.5"-to-3.5" adapters for all the 2.5" drives.

Marandil · January 13, 2024, 1:22am

Minor update:

Barely had time to focus on the project last week. Mostly did the storage inventory and looked at upgrade options.

I believe an understated metric, when computing $/TB is the upkeep cost of older and smaller drives. A 1TB 3.5" HDD can consume even ~7W when idle, which sums up to 168Wh per day per TB when idle. This is 5.124kWh per month per TB. Assuming 28.3 c€/kWh this amounts to 1.45 €/TB/month. Not that much, but with 12TB EXOS X14 the idle consumption is 5.4 W which amounts to 3.953kWh/mo and 1,11 €/month, but 0.09 €/TB/month.

For perspective I compared raw prices of used 1TB and 12TB drives at local resellers. A refurb ST12000NM0538 (12TB EXOS X14) can be found for 142€ which is 11.83€/TB. The cheapest 1TB Seagate drive I could find was ST1000NM0023 (Constellation ES.3 with a 4.45W at idle which amounts to only 0.92 €/TB/month) for ~11.23€ (/TB). So even at 1 month, the total difference is 11.92€ (12TB X14) vs 12.15€ (1TB ES.3). At one year the difference is 12.91 vs 22.27, which means in one year you’re paying almost double for smaller, older drives. This does not include costs of replacing upon failure.

While researching this I was actually surprised at the idle power consumptions of various drives. E.g. ST8000NM003B has idle power rating of 7.24W (SAS, for SATA it’s 7.06W) vs 5.3W of HUH728080ALE600, both 8TB drives, with SG being noticeably newer (2021 vs 2016 I believe). At the same time there are the X14 drives with sub-5W idle power. Another interesting datapoint is HDWG21CUZSVA/HDWG11AUZSVA/HDWG480UZSVA where for 12TB/10TB/8TB models idle power is 4.28W/7.22W/5.61W respectively; and you can easily see that the 10TB variant must be built differently, because it’s one of a kind here (higher mass, higher acoustics, different temp ratings…)

On the software side, I’m trying to figure out inter-vm networking on Xen, but the more I research, the more sceptical I am.

ThatGuyB · January 13, 2024, 3:01am

I love reading people’s homelab / homeprod blogs on the forum!

Yes, has had, technically since 3B some rev 1.3 or something (rev 1.2 was only armv7, so didn’t have 64-bit at all), but officially only since 4B (with raspberry pi os). I think armbian had 64-bit variant for 4B and the newer 3B revisions.

Since I used to have full servers running 24/7 (and lived in Europe), I’d avoid buying one now (no matter where I’d be living). I love my rockpro64 NAS (2x hdd and 2x ssd in 2 separate zpools) running freebsd. 4GB of RAM is plenty for just a NAS (NFS and iSCSI server).

Barely doing anything. And ZFS runs just fine on 2 pools.

Personally I advice people to have a separate NAS than the host running their containers or VMs. An odroid n2+ is fast enough, but if you need even more oomph, getting something like a radxa rockpi 5 (don’t have experience with this board, I’m waiting for the pine64 quartzpro64) should be enough for a lot of containers (with maybe video encoding being the only thing you couldn’t do on it and would need something like an odroid h3, I got the h3+).

For homelabs, having old servers and being thrifty is fine, as long as you don’t keep the things powered 24/7. But the support of ARM64 has gotten really far and makes good homeprod platform.

Good luck on your lab, I want to read more of your journeys.

Marandil · January 13, 2024, 3:05am

This interim (X9DRD-7LN4F-JBOD) motherboard is terrible.
Not only all the slots are physical x8, they are not open x8s. Am I too spoiled with all my other motherboards offering either physical x16 or at least open end?

Can’t even test 100Gb NIC

NBD, I should still be able to plug in my M.2 risers, right? They can fit in any of x4/x8/x16…

WRONG!, I mean I managed to fit 3 after moving 10GbE NICs to CPU1 slots (from CPU2) across both CPU1 and CPU2 slots.

For all of them, the bottom 2 (CPU1) block anything longer than x8 with the PCH heat sink, so that’s where the NICs went. I thought - maybe I’ll manage to put all in the CPU2 domain? No, as slot 5 (2nd from the top) is similarly blocked by the RTC battery.

Ignore the Optanes in risers, out of the 3 of them only one seems to be detected (the 512GB part if you’re interested).
What’s interesting, it appears to be the first from the bottom, at least one running in a CPU2 slot:

 +-[0000:80]-+-01.0-[81]----00.0  Intel Corporation Optane NVME SSD H10 with Solid State Storage [Teton Glacier]
 |           +-02.0-[82]--
 |           +-02.2-[83]--

It’s 4AM, I’m done for ~~today~~ ~~yesterday~~ tonight?

Oh, and one of the 1TB HDDs seems to have died, can’t decide if it’s good news or bad. (no data has been lost).

Marandil · January 13, 2024, 3:20am

I’m pretty sure that the last time I was configuring anything on 3B it was “technically 64bit, not practically 64bit”. The SoC obviously was 64 bit, but the bootloader wasn’t and people were trying to figure out hacky ways to get it into the 64-bit mode.

RE: power usage, I’m planning on measuring it and planning around it; but first I want some baseline benchmarks.

ThatGuyB · January 13, 2024, 3:22am

Yeah, I recall it took a while since launch for 64 bit raspbian to show up.

Marandil · January 13, 2024, 3:26am

Oh, and…

Thanks

Marandil · January 15, 2024, 4:40pm

I’m a bit sad that the cable channel/PSU backplane seem to heavily interfere with plan to put a 360 rad in place of the fan wall:

I need to recheck with the Enermax radiator as it has slightly different clearance than the Alphacool radiator (394 x 120 x 27 mm vs 400 x 124 x 30mm), but has a fill port at the other end:

I should get enough clearance for either of them if I fix the SAS cable routing and put the rad flush against the backplane cage, but haven’t tested that yet.

Marandil · January 16, 2024, 12:47am

I hate my life so very very much…

As I guessed the fill port interferes a bit on the right…

But I didn’t see that coming:

The best fit I could get with the PDB shroud on is just shy of a few mm:

I can still try going at a slight angle, after ensuring it doesn’t interfere with anything…

In the meantime, the board happily detects other nvme drives in other slots, as long as they are not H10s. Any H10 in any other slot - not detected. Huh…

Anyway, got the X299 board home, so I’ll try putting that in, although with my luck it probably won’t fit by a mm or sth…

Marandil · January 16, 2024, 1:42am

Update: it does fit

And I only had to remount 7/9 standoffs!

Now I’m quickly remembering how fast you run out of PCIe slots on those consumer motherboards…

Marandil · January 16, 2024, 3:01am

Got it mostly populated tonight:

Even the rad “fits” with cover popped open

I’m leaving the top slot (x16) unused for now, all the other lanes are occupied.

I’m debating whether to put the HBA or GPU in the middle or bottom slot. For now I went with GPU in the middle (PCH) and HBA in the bottom (CPU). With M.2s in use both slots are x4, but I think the HBA will benefit more from lower latency, while the GPU is there just because the motherboard complains if it’s being ran without any output . I might have to do something about it.

Once again it’s 4AM and I’m not sleeping yet. But it’s time. I’ll deal with the rad placement tomorrow.

Marandil · January 17, 2024, 3:18pm

Knowing the angle’s there will haunt me forever

Now I need to 3D-print some supports to keep it in place.

I still need to have a long and noisy session with UEFI to configure the fans. I put all of them in “Silent” mode and the two at the rear (above IO) still run at over 3600 RPM with no load and ramp up to very noisy every time the CPU picks up any task, so they can go to ~5000 for a second and then down to 3400-3600 immediately afterwards.

I also need to monitor temps as this may be caused by bad mount and the CPU getting thermal spikes.

In other news, having the original fan wall running at 100% PWM directly onto the radiator was the first time I saw the CPU (reminder: i9-7900X) running at 22ºC idle (mobo panicked because I didn’t plug anything into the CPU fan header and put all the fans at full speed). It was literally at room temp, or 1-2ºC above (I’m not sure what the actual room temp was at that time).

Marandil · January 19, 2024, 4:34pm

Got the rad mounted with 3d-printed supports:

I also had to put a 120mm fan on top of the AICs because NICs were getting hot - my guess is the original 80mm fans above the CPU were pulling too much air and so there was almost no airflow over the PCIe area. This is something I need to have a look into. I may also need to replace the Eiswind fans with something with a bit more oomph, or tweak RPMs - I read somewhere that some “noisy” higher airflow fans worked better and quieter at lower RPM than Noctuas at full speed, so this is something to look into.

Here’s the setup without the RGB puke from the motherboard:

Molly · January 19, 2024, 5:06pm

I had this experience recently with a passively cooled CPU in a ITX NAS build. The quiet fan that came with the case (Arctic F8) running at max speed (2000rpm) could only keep the CPU around 80C. I put a ‘louder’ fan in (Arctic P8 Max, 5000 rpm max) but it’s only running at about half its rated speed and keeping the CPU in the mid-60s now. I can’t say I’ve noticed a difference in the noise, either (it’s sitting on a shelf in my office). Has the headroom if it needs it once summer rolls around, too.

Marandil · January 19, 2024, 10:14pm

Thanks for sharing! The P8 Max would actually fit perfectly as the rear exhaust, but I would need to compare it to the San Ace 80s I have from the case. I may actually go for P12 Maxes at the rad as well.
Both options added straight to the wishlist

Marandil · January 20, 2024, 2:19am

I kinda do the blogs for myself to have a known location for the stuff I otherwise tend to write in a text file that I saved in a known location and then forget what that location was, when I need to recover.

So, today I’m gonna reinstall the experimental configuration (more on that later) as I’m still deciding between a virtualized and monolithic approach to the system, but for now the long awaited:

Flash storage inventory (2024-01-19)

For now I’m only gonna list flash-based storage as that’s somewhat constant. With rust, yesterday I went through a batch of 2nd hand drives and found 3 of them more or less damaged, so I’m gonna have a chat with the seller once I finalize my findings. Meanwhile, here it goes:

M.2 NVMe

3x Intel “Optane” H10 512G+32G; I can’t get the board to reliably recognize the 32G optane devices so I’m gonna stick to the 512G bits. In terms of GiBs that’s 476.9GiB.
Additional note: I can only fit 2 in the system at the same time, because I need to use the PCH M.2 slots.
2x Samsung SSD 970 EVO Plus 250G; lightly used. 232.9GiB.
1x Samsung OEM 256G; harvested from a laptop that decided to incinerate itself [a sad story for another day]. 238.5GiB. Under the hood appears to be the same ctrl as the 970 EVO+, just provisioned for 256G instead of 250G and configured slightly differently.

nvme id-ctrl diff

$ sudo nvme id-ctrl -H /dev/nvme5 > samsung-oem
$ sudo nvme id-ctrl -H /dev/nvme2 > samsung-evo
$ diff samsung-oem samsung-evo
4,6c4,6
< sn        : S4DXN*********
< mn        : SAMSUNG MZVLB256HBHQ-000L2
< fr        : 3L1QEXH7
---
> sn        : S4EUN*********
> mn        : Samsung SSD 970 EVO Plus 250GB
> fr        : 2B2QEXM7
112,113c112,113
< wctemp    : 357
<  [15:0] : 84 °C (357 K)       Warning Composite Temperature Threshold (WCTEMP)
---
> wctemp    : 358
>  [15:0] : 85 °C (358 K)       Warning Composite Temperature Threshold (WCTEMP)
121,122c121,122
< tnvmcap   : 256,060,514,304
< [127:0] : 256,060,514,304
---
> tnvmcap   : 250,059,350,016
> [127:0] : 250,059,350,016
142,143c142,143
< mntmt     : 321
<  [15:0] : 48 °C (321 K)       Minimum Thermal Management Temperature (MNTMT)
---
> mntmt     : 356
>  [15:0] : 83 °C (356 K)       Minimum Thermal Management Temperature (MNTMT)
148c148
< sanicap   : 0x2
---
> sanicap   : 0
152c152
<     [1:1] : 0x1       Block Erase Sanitize Operation Supported
---
>     [1:1] : 0 Block Erase Sanitize Operation Not Supported
200c200
< fna       : 0
---
> fna       : 0x5
202c202
<   [2:2] : 0   Crypto Erase Not Supported as part of Secure Erase
---
>   [2:2] : 0x1 Crypto Erase Supported as part of Secure Erase
204c204
<   [0:0] : 0   Format Applies to Single Namespace(s)
---
>   [0:0] : 0x1 Format Applies to All Namespace(s)
246c246
< ps      0 : mp:8.00W operational enlat:0 exlat:0 rrt:0 rrl:0
---
> ps      0 : mp:7.80W operational enlat:0 exlat:0 rrt:0 rrl:0
249c249
< ps      1 : mp:6.30W operational enlat:0 exlat:0 rrt:1 rrl:1
---
> ps      1 : mp:6.00W operational enlat:0 exlat:0 rrt:1 rrl:1
252c252
< ps      2 : mp:3.50W operational enlat:0 exlat:0 rrt:2 rrl:2
---
> ps      2 : mp:3.40W operational enlat:0 exlat:0 rrt:2 rrl:2
255c255
< ps      3 : mp:0.0760W non-operational enlat:210 exlat:1200 rrt:3 rrl:3
---
> ps      3 : mp:0.0700W non-operational enlat:210 exlat:1200 rrt:3 rrl:3
258c258
< ps      4 : mp:0.0050W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4
---
> ps      4 : mp:0.0100W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4

1x Samsung SSD 980 1TB not PRO unfortunately. I got 2 PROs, but they are in use in other systems currently. Lightly used for write-once data. 931.5GiB.

SATA SSD

2x Intel DC S4600 240G; different wear levels. 223.6GiB.
1x Samsung SSD 860 QVO 2TB; not tortured. Also lived in my laptop. 1 863GiB.
1x SSDPR-CL100-960-G3; or the trusty old GOODRAM . 894.3GiB.

NVMe formatting

Unsuprisingly neither of the drives supports >1 namespace, but I should still be able to use to underprovision with n1. For benchmarks:

$ sudo blkdiscard /dev/nvmeXn1

should suffice.

Assignments

I’m not yet sure what to do with all the drives. I’ll likely keep one H10 as a spare, unless I find a reliable way to have it enumerate, e.g. this time it decided to pop up:

nvme0n1             259:0    0 476.9G  0 disk                                INTEL HBRPEKNX0202AL   PHxxxx-1
nvme1n1             259:1    0 476.9G  0 disk                                INTEL HBRPEKNX0202AL   PHxxxx-1
nvme3n1             259:2    0  27.3G  0 disk              isw_raid_member   INTEL HBRPEKNX0202ALO  PHxxxx-2
└─md126               9:126  0     0B  0 md
nvme2n1             259:3    0 232.9G  0 disk                                Samsung SSD 970 EVO Pl 
nvme6n1             259:4    0 232.9G  0 disk                                Samsung SSD 970 EVO Pl 
nvme5n1             259:5    0 238.5G  0 disk                                SAMSUNG MZVLB256HBHQ-0 
nvme4n1             259:10   0 931.5G  0 disk                                Samsung SSD 980 1TB

The S4600 at some point I wanted to use for some ZFS special vdev (either metadata, ZIL or L2ARC), but I found better use for them as the boot & VM drive in MD RAID mirror. For now it works remarkably well (in testing).

The current setup has a total of 8 M.2 slots, with a x16 bifurcation card occupying the first x16 slot, so:

6x CPU (x4 lanes, limit: x24)
2x PCH (x4 lanes, limit: x4)

The PCH slots I occupy with H10s, so they are not even limited by the width (each H10 half is x2) currently, excluding other traffic through PCH (e.g. SATA, VGA).
This leaves me with 6 CPU slots and 4-5 sticks to occupy them with. So for now I just populated all the slots.

Next time maybe: ZFS Sacrilege

Marandil · January 22, 2024, 3:11am

Before I drive into that rabbit hole and get myself cancelled from this form for my zfs heresies, I need to vent my frustration at Xen hypervisor.

The beginnings were really nice. Initial setup, ability to run dom0 directly or virtualized. EFI integrations. All seemed very nice.

Some problems started when I tried installing opnsense on an EFI HVM. I did everything by the books, could get into UEFI and even into the bootloader. But no matter how hard I tried, I couldn’t get it to boot properly, it would always stop at the same spot. After some time troubleshooting, I could get it to boot a bit further, but for some reason with no inputs. The same would happen after swapping iso to “archiso” - inputs in EFI are ok, but not in booted Linux. Well, sucks.

It turned out, I could get it to boot properly by changing machine type from EFI to default (BIOS). So for my first “look around” I just went with it and installed it in a BIOS HVM. I needed another VM for tests which ended up being a PV arch instance. Cool, I had brought up an HVM, a PV, and had some experience with debugging the setup.

Now, yesterday I started testing my recorded procedure of (re)setting up the whole software stack on the host. Everything was running smoothly, until at one point qemu-xen started throwing unknown opcode errors out of the blue. It took me a long time to realize what was going on, until I finally realized that my “host” (dom0 in Xen speak)… Doesn’t list AVX as supported when running under Xen. That was obviously not the case when running without the hypervisor, so I started digging. After a fair amount of useless leads I finally stumbled upon the cause and solution at the same time:
https://xenbits.xen.org/docs/unstable/misc/xen-command-line.html#spec-ctrl-arm

Specifically this fragment:

On all hardware, the gds-mit= option can be used to force or prevent Xen from mitigating the GDS (Gather Data Sampling) vulnerability. By default, Xen will mitigate GDS on hardware believed to be vulnerable. On hardware supporting GDS_CTRL (requires the August 2023 microcode), and where firmware has elected not to lock the configuration, Xen will use GDS_CTRL to mitigate GDS with. Otherwise, Xen will mitigate by disabling AVX, which blocks the use of the AVX2 Gather instructions.

Well, sucks I thought adding the proper disable option to the command line. I can accept some performance hit in the name of security, but not in the form of diamond disabling AVX.

Right after fixing that I stumbled upon another issue, where trying to activate SR-IOV Virtual Function on a NIC would fail with a very generic error. This time it turned out to plague both direct Linux and Xen virtualized dom0, so the fix was not necessarily Xen-centric. It turned out I had to add a specific Linux Kernel options that I found in one thread, on one forum. Not only it took me a fair amount of time, the euphoria from solving the riddle was rather short-lived, when it turned out I can’t even pass through the virtual function NIC to the target VM, which was supposed to be… The new version of the HVM OPNsense.

I spent another hour or two trying to figure out why I can’t pass the card, while realizing I can’t actually pass any PCI device to the HVM. I spun up a Linux PV and verified that passthrough was working there. Ok, progress. Finished setting up the Linux VM and went back to OPNsense (which is a freeBSD under the hood BTW). I started working towards running out as a PV, but for some reason I couldn’t get pygrub to recognize root and find the Kernel. After wasting another unspecified amount of time I settled for running it as a PVH with an extracted Kernel (n.b. I can currently only extract it because apparently my host Kernel has been compiled with read-only support for ufs…) Which worked!

…

But then I got an error that PCIe passthrough is not supported on PVHs… (⁠╯⁠°⁠□⁠°⁠）⁠╯⁠︵⁠ ⁠┻⁠━⁠┻

…

PV kernel panicked on me twice I think, I don’t remember anymore. According to freeBSD docs I’m no longer sure if it’s supposed to work or not.

I need to sleep on it, it’s past 4AM again and I tried really hard to be done by 2. Writing this rant for another hour doesn’t help.

Marandil · January 27, 2024, 2:07am

A new set of fans arrived this week:

Thanks @Molly for recommending the Pn MAX series, really like the blade design on those.

I ended up going with their server fans for some reason instead of P8s though. For now they seem to be better at being quiet than the San Aces, but in an A/B test at “normal” RPM I wasn’t sure which is which.

Didn’t test them too much yet, barely finished replacing 3x Eiswind with 5x P12 MAX:

The P14 Slim is going to help cool the AICs, I still need to print a mounting bracket though:

On the software side I’m still in a bind, because the original plan (detailed below) is not going to work, sadly. I think. I didn’t manage to get a PCIe passthrough to an HVM and couldn’t get OPNsense to run on a PV, even after installing Xen “additions”. Tough luck.

The original plan was to get a Xen hypervisor and a lean Arch distribution (or something different, but highly customizable, none of that Ubuntu crap) to serve as minimal dom0 (for those unfamiliar with Xen speak, a privileged VM), plugged only to a “management LAN” with dedicated physical connections, and only on-demand internet access. Working hostname for the dom0: “vmserver”.

Along the dom0 there were supposed to be at least 3 different virtual machines (domUs) running on the server:

A dedicated router OS, most likely OPNsense, to manage the 10G NICs and switching on them; working hostname: “opnsense”
A storage server, likely another Arch installation but I wasn’t hell-bent on that. ZFS management, storage passthrough (HBA + NVMes), NAS servers (SMB, NFS, iSCSI), this kind of stuff; working hostname “zfserver”
A services server. All the other junk that I want/need to run, like the internal certbot, pihole, lancache, etc. The only vm allowed to run somewhat unvetted software in containers (like the pihole or lancache); working hostname: “svcshost”

As you might have noticed, all the hostnames are 8-letters long. The word “hostname” also has 8 letters. Coincidence? ( ͡° ͜ʖ ͡° )

Initially I wanted the domUs to have direct access to the router VM bypassing dom0 for additional separation, but that seems to be impossible, at least for now. The closest thing I have achieved was VF passthrough from the NICs, as the dom0 can stay disconnected from those, and likely this is going to be the way forward.

Ideal network separation diagram

+------------------------------------------------------------+
| Xen Hypervisor                                             |
|                                                            |
|+----------------+                                          |
|| opnsense [nic0]+-------------------------------------[nic0]
||          [nic1]+-------------------------------------[nic1]
||          [mgmt]+---------+                                |
||          [vif0]+----+    |                                |
||          [vif1]+--+ |    |                                |
|+----------------+  | |    |                                |
|                    | |    |                                |
|+----------------+  | |    |                                |
|| zfserver [vif0]+--+ |    |                                |
||          [mgmt]+----)--+ |                                |
|+----------------+    |  | |                                |
|                      |  | |+------------------------------+|
|+----------------+    |  | ++[vif0]-----+        vmserver  ||
|| svcshost [vif0]+----+  +--+[vif1]-----+-[mgmt-lan]       ||
||          [mgmt]+----------+[vif2]-----+--------------[eth0]
|+----------------+          +------------------------------+|
+------------------------------------------------------------+

Non-ideal (SR-IOV based) network separation

+------------------------------------------------------------+
| Xen Hypervisor                                             |
|                                                            |
|+------------------+                                        |
|| opnsense [n0p0v0]+--------[passthrough]------------[n0p0v0]
||          [n0p1v0]+--------[passthrough]------------[n0p1v0]
||          [n0p2v0]+--------[passthrough]------------[n0p2v0]
||          [n0p3v0]+--------[passthrough]------------[n0p3v0]
||          [n1p0v0]+--------[passthrough]------------[n1p0v0]
||          [n1p1v0]+--------[passthrough]------------[n1p1v0]
||          [n1p2v0]+--------[passthrough]------------[n1p2v0]
||          [n1p3v0]+--------[passthrough]------------[n1p3v0]
||            [mgmt]+-------+                                |
|+------------------+       |                                |
|                           |                                |
|+----------------+         |                                |
|| zfserver [vif0]+---------)----------[passthrough]--[n1p3v1]
||          [mgmt]+-------+ |                                |
|+----------------+       | |                                |
|                         | |+------------------------------+|
|+----------------+       | ++[vif0]-----+        vmserver  ||
|| svcshost [vif0]+--+    +--+[vif1]-----+-[mgmt-lan]       ||
||          [mgmt]+--)-------+[vif2]-----+--------------[eth0]
|+----------------+  |       +------------------------------+|
|                    +-----------------[passthrough]--[n1p3v2]
+------------------------------------------------------------+

The ideal layout assumes a full NIC can be passed; I’m not 100% sure that’s the case, but I didn’t try passing all 7 functions at once (4x PF for the VPs, 1x general NIC at .4, and 2 storage offloading functions). The non-deal assumes SR-IOV passthrough of all the ports to the router + inter-domain connection via the last Virtual Port - all n1p3v* act as if they were connected to the same network.

Except for OPNsense, all the other vms could have been running in PVs as they are Linux. In theory I could just go with some Linux-based router OS like VyOS, the list is long. Which prompts the…

Solution #1 - Linux-based Router OS in PV

Since PVs work so far, this solution seems like the obvious choice. I do, however, have some reservations. What if that’s not the only broken thing that I’m about to encounter in Xen? What if it’s another piece of the puzzle that’s broken? We’ve seen pfSense running in xcp-ng and that’s also Xen, right?

If I go that route I’ll probably choose VyOS. Going with Linux has the additional benefit of drivers - while the kernel is a mess at times, the community and corporate support here seems to be above what FreeBSD can offer.

Solution #2 - Dedicated hypervisor distribution

See previous point - we’ve seen pfSense running as xcp-ng VM, at least on the Son of the Forbidden Router IIRC. With passthrough. Which means it should be possible. I should possibly at least try to set it up and check whether similar problems persist.
I’m not a huge fan of those dedicated hypervisor distributions because of all them hiding the details and doing things “their way” as opposed to doing them “my way”. Even something as ~~dumb~~ simple as virt-manager can be a pain to work with the moment you have to do something outside the box.

Just so we’re clear, I have never used xcp-ng and I’m not dissing on it; I just have a feeling it’s not going to be my cup of tea. And if you think about suggesting Proxmox…

Solution #3 - Just go KVM

… I’d rather just stay with the original plan, but switch to KVM and libvirt, as I already have a ton of experience with them. I just really wanted to go with Xen initially for the added separation of dom0, but the more I work with it, the less differences from KVM I see. For instance, I assumed I wouldn’t even have to deal with “passthrough” and I’ll be able to just assign hardware to VMs sort of like I do vcpus or memory; but at least from the tutorials and documentations I went through I see no way to do it properly.

Side note: It appears to still be possible though, see e.g. this slideshow.

Solution #4 - Ah, screw it! (go monolithic)

If all else fails, I could just go monolithic instead. All the separation and security is gone, but at least it works, right? Right?

For now I’m pretty undecided, so I’m ready to receive feedback.

Edit: sorry for the typos, my eyes hurt already . Just re-read the post on my phone and corrected some, but there may still be more left.