Marandil's Homelab evolution

Well, F*

admin@zfserver:~ $ sudo zpool replace tank /dev/disk/by-partuuid/55b4b54a-efe1-497f-8dc6-c231bfe2aaa8 /dev/disk/by-partuuid/62cdfe6f-f488-
4932-876c-1cc851c423fd
[sudo] password for admin:
invalid vdev specification
use '-f' to override the following errors:
/dev/disk/by-partuuid/62cdfe6f-f488-4932-876c-1cc851c423fd is part of active pool 'tank'

No shit, Sherlock!

EDIT: I think I fixed it, but had to blkdiscard -f /dev/disk/by-partuuid/62cdfe6f-f488-4932-876c-1cc851c423fd first. Kinda annoying since I could have more than 5MiB to resilver.

I’m late to this thread but it struck a chord. I used Xen for home virtualisation of services about 8 years ago and I kept on running into weird issues.

I have no problem troubleshooting stuff but I’d gone to the trouble (for various reasons) of running it on new Supermicro gear (Xeon Xeon L3426-based) and I wasn’t happy that this enterprise gear kept on crashing or having weird issues.

Long story short, I ended up running ESXi because I needed VMs running and working for work reasons.

All I can say is, I feel your pain!

1 Like

The project’s been on hold for some time, primarily due to lack of time and secondarily due to more pressing issues - I gotta build a house to have a server room to put the server somewhere afterall :wink: .

In the meantime I got 4x DC S4600 480GB super cheap and had some time to rethink the architecture, and some more stuff.

Consolidation

I’ve decided to forego virtualizing storage server for several reasons:

  • To get some of the features I want, like having snapshots available through Samba, the “serving service” and the filesystem need to be a part of the same OS
  • To fully utilize some offloading, like iSCSI, I’d need the NIC and the storage on the same OS. Maybe SR-IOV would work, but…
  • … as tested before, SR-IOV for intra-VM communication is inferior to a dumb bridge.
  • I initially planned to passthrough full HBA to “ZFServer” to have native access to the drives from the VM, instead of passing through individual drives. However, this limits the storage available to the main system and other VMs.

Therefore I decided to remove “ZFServer” from the architecture and serve it from the host system.
Additional services that do not require this level of integration will likely be hosted through containers or minimalistic VMs though, TBD.
OPNsense remains a VM still.

Storage changes - Root-on-RAID

I decided to change the previous root-on-raid1 scheme slightly.
2x 240GB DC S4600 that previously were meant to store mirrored root partition, VM images and the like, can be extended with other SATA SSDs, since they are now also visible in the host system (no HBA passthrough).
The new 4x 480GB DC S4600 will be added to the LVM2 volume group that previously only consisted of the smaller drives. This means that I’ll have access to 1200GB of mirrored flash with about 720 GB of potential triple-striped lvs (and additional 480 GB of only double-stiped).
The smaller drives will still host the mirrored EFI partition and potentially I’ll have to deal with initializing the HBA from initramfs. Hopefully that won’t be an issue.

Storage changes - ZFS

Instead of RAIDZn, I’ll likely change to simple striped mirrors. It’s much easier with non-uniform drive collection and I don’t have to play around with weird combinations.
Two cons though:

  • much more wasted space
  • same drives much more likely to fail at the same (similar) time

Should be much easier to upgrade later though. New mirrors can be added, smaller drives replaced one-by-one, etc. Growing RAIDZ is still not supported AFAIK.

1 Like

I believe it should be, just replace the disks one by one, wait to resilver and at the end, your whole pool will be the capacity of the smallest drive in it.

I much rather prefer building a new pool and zfs-send the heck out of the old one. That way, data also gets reshuffled / reordered / balanced.

I don’t remember what you were trying to do.

If your data’s so important to you that you’re willing to go 3-way mirror, then use ZFS, not LVM. But IMHO you’re better off building a backup-server, than doing triple mirror. Do you need all that performance and redundancy? I doubt it. Use your SSDs wisely and take serious backups of data that you actually need.

You can back up 720 GB of flash with just 2x 2TB mirror drives. Always make the backup server larger than the main one, because you’re going to keep the latest backup + older backups, meaning more space utilized. But on the plus side, your backup server can be dog slow, as long as you tamper your RPO (maybe weekly backups).

My own backup server is an Odroid HC4 with 2x 20TB ironwolf pro drives. To put it into context, I have a 1 TB nvme pool, a 256GB single gum stick, a 2TB SATA flash pool, an 8TB SATA flash pool and a 10TB spinning rust pool, that all need some backups - I don’t use nearly as much data, some of them are barely at 20% utilization (if even 5% utilized). And most of the 8TB flash pool is testing, with the 2TB small flash pool and the 10TB spinner being somewhat important data. Only a very small portion of these is important enough to backup, but since I have so much capacity, might as well backup everything in one go.

My home folder on the 256 nvme drive is the one that gets the most attention. Everything else is just “nice backups to have” (like a disk image of my emmc drive of my router, in case my router borks itself, or I mess with it an break it).


Again, I’m asking, do you need 3-way mirror flash pool? If the data’s so important, why not do a 2-way stripped-mirror and use the other 2 drives in a 1-way mirror as a backup pool? (did I calculate that right? 3x mirror x2 for the stripe is 6 drives… reducing it to 4 means 2x mirror x2 for the main pool and 2x mirror for the backup pool, meaning you’ll have 2:1 size discrepancy between pools - unless you need to backup the entire pool, this should work for important data, but if you need to backup the whole pool, it’s best to go full retard performance and go for 3x stripped 2-way mirrors, so the total usable capacity of 3 of your 480GB disks combined, but do another large capacity backup pool, which can be just a mirrored vdev, to which you send data with zfs-send every now and then).

Note: you can have your backup pool be on the same system. I prefer having it separate, because I can just shutdown the backup server when it’s not needed (to save power) and it’s technically more secure (because theoretically nothing runs on the backup server, thus even if you pull zfs-send via ssh a crap / ransomwared data, you still have the previous snapshots local - this might become a problem if you push zfs-send snapshots from main to backup, instead of doing a pull from the backup server, connecting to main and grabbing it from there).

You either read that wrong, or I wrote that wrong. It’s not a 3-way mirror, it’s a 2-way mirror and 3-way stripe (6 drives total) :wink:

And that flash is mostly OS, VM storage, and the like - separate from typical data.
Oh, and must/should be accessible from a rescue ISO, so ZFS is likely off-limits (there are more reasons, but let’s leave it at that; root-on-ZFS is not a concept I like).

hrmpf or a simple proxmox installer should allow you to mount a zfs root fs in a /mnt location. I did that a few times. If it’s something people are interested in, I can write a quick wiki entry on the forum on how to recover from a broken install that uses root-on-zfs.

I personally don’t like root-on-zfs that much either (despite me actually running it on my main system, on a single disk, with encryption enabled! lmao), but it has its value when you don’t have the luxury of adding another disk for the OS (well, it’s more than that, how about taking a snapshot, updating your system, then your system is broken on reboot? - a simple zfs-boot-menu later can do a zfs-revert on that, but I never took the time to set that up).

I still much prefer OS on ext4 (or ufs / ffs), but only because the idea of an OS should be that it can always be wiped and reconfigured (or you can just mount the OS volumes, remove all OS files, copy an old backup of rootfs and you’re off to the races, assuming your partitions or volumes aren’t corrupted).

… and that’s another reason why I’m considering this LVM-on-RAID stuff, LVM also has fully-fledged COW snapshots, albeit at block device level.

While I’m nearly done with reading through @ThatGuyB’s rants (literally), currently at -26 days or so, I started having second thoughts on the init system of choice.

I initially didn’t even notice when distributions went from sysvinit to systemd, except things were breaking during major upgrades - normally no big deal, I wouldn’t do a do-release-upgrade or oldstable → stable bump if I didn’t have time and means to fix things.

Then c.a. 2020 I encountered some systemd quirks that really got me worked up, primarily in the realm of initializing networks and network-based resources (I recall we were considering pivoting from Ubuntu servers to something non-systemd-based at work for these reasons).

However, this mostly pushed me into learning how the systemd units interacted with each other and that those dependencies were actually pretty easy to model. I believe around that time was the first time I started actually writing service files and not just starting them manually after each server reboot (on my private servers) like I used to before*.

In the past 4 years I learned to live with systemd and its quirks. Maybe I even like it somewhat. Or at least some parts of it, like systemd-networkd. That last one bit me in the butt once or twice, but that was kinda my fault (or package maintainers’) for not specifying default wired DHCP behavior. I also kinda like the minimalistic systemd-boot, and in general I’m in favor of having somewhat consistent configuration style between system components. Stuff like mounts and automounts that you can set up for Kerberos-authenticated NFS shares (with dependencies!) or socket-activated services are just an icing on the cake.

However there are many valid reasons to not use systemd. I won’t go into details because most of you know them already, or if you don’t you can find them pretty easily. I don’t like the leadership style (reminds me of the Microsoft of yore-embrace, extend, extinguish); I don’t like how it’s somewhat “forcing itself” into other packages (systemd dependencies, openssh and liblzma vulnerability for instance). I’m not comfortable with how many CVEs have been assigned. I don’t like that udev is getting more and more integrated into systemd. I don’t like how it breaks stuff and then demands changes from others (yeah, that one bit me too).

So I started considering using an alternative for my server, like s6. I’ll probably give Artix a try instead of Arch, but reading the Artix Installation wiki I was brutally reminded of non-systemd/udevd network interface names of eth0, eth1, eth…8… How do you even know which is which? How do you filter them in init scripts? I didn’t like the predictable but long names (enp1, ens5f0d3, and the like) at first, but they grew on me when I learned what they mean. Also, with systemd-networkd I could just filter interfaces by their default MAC addresses and rename them if needed.

So I’m a bit torn, frustrated and sad at the same time. I guess I’ll need to give it a go and see for myself. End of rant for today tonight.

How do you guys cope with that? xD

*) Some things I still start manually because I keep forgetting to write units for them, and uptime on the server is usually pretty long. Oh, well.

1 Like

The rants are cool and all, it’s more like a journal to me. But think of it like reading a whole website that you just randomly found online (idk, UnixSheik?). You wouldn’t be going from the beginning to the end (although you could, just skip the parts that don’t seem interesting). I screw around there with SBCs, lab, random internet stuff, life experiences, different OS, solar conversion and more.

If you’re interested in the s6 suite, read the 2 wikis on the forum and its adjacent comments. Maybe help me improve the wiki, comment there and let me know what I should improve.

I wish I had a personal git (github sucks and I’m there too), to have bugs open for my wikis and track them (it’s easy to lose track of stuff here, which is why I kinda added [WIP] everywhere on the wiki that needs improvements, but I need to read it again for context, then remember what was I supposed to add there).

With the exception of systemd-boot, most of the systemd-somethings bit me (networkd, timed and mount come to mind immediately).

gummi-boot is nice, I used it on arch eons ago. I use it on proxmox now (because grub on my hardware is broken if you don’t have a keyboard plugged-in). I also have the gummiboot-efistub (not the full gummi-boot) on my system (which isn’t even running systemd) to start zfs-boot-menu.

I prefer abduco + dvtm, but I’m not so high up my horses that I’d even suggest that you change your workflow, I’m just stating my preference. But I love how tmux folks didn’t compromise in that thread.

Artix is my least favorite s6 implementation, because everything is literally started daemon-tools style - there’s the dependencies.d definition, but nobody in artix thought it was a good idea to use it, so that means, for instance, that sshd starts, there’s no network, crashes, starts, there’s no network, crashes, then there’s network and then starts, instead of just using dependencies and wait until networking is available - you can do that yourself, but come on!, at least do some sane defaults.

That’s why I’m working on my opinionated s6_services git repo and the 2 wikis on the forum (the one on how to install and the other, more generic, how to use s6 and s6-rc).

I’ve got eudevd on my system. Works fine, got enp-something-something as int names. S6 suite provides mdev, which is minimal. I didn’t try it, idk what it does (besides being a hotplug daemon like udev), but I’m not using it, I stuck with eudev. I might give it a shot if I’ll ever work on a minimal system. skarnet.org runs on a custom distro running on busybox with s6, with the whole website being hosted on the smallest gandi.net 1 core 512mb of ram offering (serves http and git, what more could you want?).

Why do you need to filter interfaces, out of curiosity? Well, with the exception of devices not starting up in the same order on boot up, then I don’t see any reason not to use the classic ethX. If your int names are randomized, then you’ve got a problem if you have static configurations (unless you check for the mac and apply them on the int with the correct mac, which sounds like a nightmare to maintain anyway - I’m glad I don’t have to go through that and that my interface names remain the same across reboots).

The reason I moved to s6-rc is because my uptime was so short. I wanted to ensure that a) my startup services always respect the same dependencies / order and that b) no services are stopped in the wrong order. I solved a) pretty easily in the runit service files (the run script that launches the daemon), but I couldn’t get b) to work (everything is killed in parallel in runit).

I was thinking of moving to nixos (which uses systemd, but it’s so removed from systemd in its own way), but then I (re-)found s6-rc. I’d still love to have a declarative os, like nixos, but for what I’m doing, as long as I back up my configurations and my actual data, I don’t need to deploy the same thing over and over (although it was really nice when I did that on 2 or 3 raspberry pis).

IDK how well would s6-rc work for systemd veterans. I love that it just gets out of your way, but the service files are scripts. Worst, they are execline scripts (which is awesome in its own way, but you need to learn execline). Well, you can literally write the run services in shell, perl, python or anything else, really, as long as you specify the shebang, but ideally you use execline to just call the shell run services from another location.

Unlike systemd, you don’t have soft dependencies and ordering. Everything is a hard dependency. And there are a few concepts that go way above systemd’s design, like bundles, which are so flexible, yet so simple, it’s incredible.

And like its daemon-tools predecessor, the context for a service always remains the same for every execution, so you always have a clean environment for your programs to run.

1 Like

Thank you for your insights. I had to wait until I have a couple of free hours to process them and find answers.

I know, but I hate reading forum topics “from the middle”. One more reason not to touch the lounge or other long-living threads… I still have 1804 posts of “Gaming on my Tesla” to go through xD.

Thanks, I’ll have a look.

:man_facepalming:

They are not randomized - I hope - but I have 9 of them (2x4 on discrete NICs, 1 on motherboard), not counting VFs. Some of them have dedicated functions, some of them might in future be used for different subnets.
At least one of them will be used for WAN, most of the rest should be switched. Now imagine one of the NICs doesn’t enumerate during POST (for whatever reason, shit may happen) and the WAN interface gets assigned eth5 instead of eth9 and gets added to a bridge instead of the firewall VM.

In networkd I’d just filter by MAC or PCIe location (PCIe tree appears to be super stable on my board).

So far I only checked out s6-rc/examples at master · skarnet/s6-rc · GitHub and I say… It’s workable, but I’m not sure it’s what I’m looking for. I’m not a fan of having a separate file for every piece of information and I’d prefer a more… compact (?) configuration file format. Well, I guess it’s time to write yet another init system. With blackjack Python and hookers YAML.

JK. But maybe…

systemd.target?

For now I think I’m staying with systemd. I have my reservations, but it works well for me.

So, it’s kinda what I was thinking with randomization, but instead not getting enumerated at all and changing order, or not getting enumerated on time and changing order. So, yea, we were thinking of the same scenario.

Check out 66 and Obarun. 66 is probably what you’re looking for, a wrapper around s6. I didn’t check it that in-depth, because I wanted to first understand the core concepts of s6, before going into what the top layer does to make it work. But I became a fan of the s6-rc way of doing things, because it’s infinitely automatable. The reason it’s written so simple is exactly to allow extensibility.

Looking at the latest page, something seems odd to me.
https://web.obarun.org/software/66/latest/

Previously 66 was the result of the combination of the former s6 and s6-rc.

That’s what I knew of it.

With time and code improvement the s6-rc program was dropped. 66 is now a fully independent service manager, although the name has been retained.

It seems it has evolved way past the original scope. I guess 66 isn’t a wrapper around s6-rc anymore. Oops. Well, even more reason for me to finish the wiki and have people understand s6-rc and make wrappers around it.

I recall 66 used to have a service file definition that was completely offline and was modifying s6-rc source files automatically. You had a single file, similar to a systemd unit file, where you defined service start command, service stop command and more.

File content example

    [main]
    @type = classic
    @description = "ntpd daemon"
    @version = 0.1.0
    @user = ( root )

    [start]
    @execute = (
        foreground { mkdir -p  -m 0755 ${RUNDIR} }
        execl-cmdline -s { ntpd ${CMD_ARGS} }
    )

    [environment]
    RUNDIR=!/run/openntpd
    CMD_ARGS=!-d -s

Even with this latest frontend, it seems 66 still maintains some ties to s6 (in particular, it utilizes s6-svscan and seems to like execline scripts). I personally don’t like this approach, but I can see why some people would prefer it to s6-rc. You automate a single file, instead of automating around folders.

I rather agree with Laurent Bercot, parsing is difficult and prone to bugs. Avoid parsing if you can. But it seems like 66 is doing parsing in a weird way. I’m not even sure what that definition style is, doesn’t look like yaml or json, it looks reminiscent of systemd unit files.

Systemd.target is more like runlevels in openrc. Bundles are literally collections of atomic services and / or more bundles. You can have a bundle acting as systemd.target and you can have a bundle acting like an individual service (e.g. “systemctl start www” which would then activate, according to the dependency tree: mysql, nginx and php). You can even have a cmdline argument at boot, to instead of booting the default system target, to boot into a custom mode (the name of the bundle or service as the argument). That will launch only the needed target (kinda like single-user mode).

Imagine doing a custom systemd unit file that’s defined as a oneshot and its single purpose / definition is “systemctl start X && systemctl start Y and systemctl start Z” and in reverse order the stop command). With bundles, you don’t need to think about it, just define the contents of the bundles and you’re done (assuming your services have properly defined dependencies).

That’s the most important thing. If it works for you, then that’s great. It didn’t work for me, so I looked for alternatives (more like, I was pulling my hair off and had to switch for my own sanity).

One more scenario I thought up: some interfaces might be hotplug, e.g. on USB. Not applicable here, but possible. Those can mess ordering pretty well :wink:

Will do.

I’m pretty sure you can use systemd.targets that way as well:

[Unit]
Description=My test bundle
Requires=sshd.service systemd-networkd.service systemd-resolved.service

[Install]
WantedBy=multi-user.target
# systemctl enable my.target
Created symlink '/etc/systemd/system/multi-user.target.wants/my.target' → '/etc/systemd/system/my.target'

Start also works, I think…

Ah, so there’s the difference. systemctl stop my.target doesn’t stop the services listed. I need PartOf= instead (systemd - How to stop all units belonging to the same target? - Unix & Linux Stack Exchange). ConsistsOf= would be nice, but AFAICT it can’t be specified (WHY NOT, LENNART?)

So this works, but is needlessly verbose:

$ cat /etc/systemd/system/my.target

[Unit]
Description=My test bundle
Requires=my-a.service my-b.service

[Install]
WantedBy=multi-user.target

$ cat /etc/systemd/system/my-a.service  # same with my-b.service
[Unit]
Description=My Service A
PartOf=my.target

[Service]
Type=oneshot
ExecStart=echo "Starting service A"
ExecStop=echo "Stopping service A"
RemainAfterExit=yes
$ sudo systemctl start my.target
$ sudo systemctl status my-a.service
● my-a.service - My Service A
     Loaded: loaded (/etc/systemd/system/my-a.service; static)
     Active: active (exited) since Wed 2024-07-03 19:02:12 CEST; 4s ago
 Invocation: c0d325035c0f48169e1cdb84a9752665
    Process: 1262 ExecStart=echo Starting service A (code=exited, status=0/SUCCESS)
   Main PID: 1262 (code=exited, status=0/SUCCESS)

Jul 03 19:02:12 vmserver systemd[1]: Starting My Service A...
Jul 03 19:02:12 vmserver echo[1262]: Starting service A
Jul 03 19:02:12 vmserver systemd[1]: Finished My Service A.

$ sudo systemctl stop my.target
$ sudo systemctl status my-a.service
○ my-a.service - My Service A
     Loaded: loaded (/etc/systemd/system/my-a.service; static)
     Active: inactive (dead) since Wed 2024-07-03 19:02:38 CEST; 2s ago
   Duration: 26.300s
 Invocation: c0d325035c0f48169e1cdb84a9752665
    Process: 1262 ExecStart=echo Starting service A (code=exited, status=0/SUCCESS)
    Process: 1277 ExecStop=echo Stopping service A (code=exited, status=0/SUCCESS)
   Main PID: 1262 (code=exited, status=0/SUCCESS)

Jul 03 19:02:12 vmserver systemd[1]: Starting My Service A...
Jul 03 19:02:12 vmserver echo[1262]: Starting service A
Jul 03 19:02:12 vmserver systemd[1]: Finished My Service A.
Jul 03 19:02:38 vmserver systemd[1]: Stopping My Service A...
Jul 03 19:02:38 vmserver echo[1277]: Stopping service A
Jul 03 19:02:38 vmserver systemd[1]: my-a.service: Deactivated successfully.
Jul 03 19:02:38 vmserver systemd[1]: Stopped My Service A.

I have tested 4 different scenarios for system partitions and volume manager.
C.f. “Root-on-RAID” here.

All 6 DC S4600 are used in each of the scenarios. It is important to me that recovering the array must be as simple as possible; working out-of-the-box, plug-and-play are mandatory.

Scenarios

  1. LVM-on-MDRAID1 – SSDs split into RAID1 pairs (sd[ab], sd[gh], sd[ij]), each pair made into /dev/md/pv[012]pvcreatevgcreate. LVs created either with -i2 and pv[12] or -i3 (using 4 or 6 drives) respectively.
    This is +/- the original setup, but with 6 drives instead of 2.
    PARAMS=-i3
  2. LVM-RAID10 - All SSDs used to create LVM pool; individual volumes created with RAID10 as --type raid10 -m1 -i2 /dev/sd[ghij]2 or --type raid10 -m1 -i3.
    PARAMS=--type raid10 -m1 -i3
  3. LVM-RAID10 with integrity - same as LVM-RAID10, but with option of --raidintegrity on a per-LV basis to detect corruption.
    PARAMS=--type raid10 -m1 -i3 --raidintegrity y
  4. BTRFS - All SSDs used to create a single filesystem, setup with -d raid10 -m raid10 -csum xxhash.

Benchmarks

I did one real-life benchmark and three synthetics based on fio.
/var/lib/aurbuild resides on a dedicated LVM in case of LVM scenarios and on a subvolume in BTRFS scenario.
In fio tests, $BENCHMARK is either:

  • sudo lvcreate -n benchmark -L 32G vgroot $PARAMS for LVM cases;
  • truncate -s 32GiB ~/benchmark for BTRFS case.

  1. Linux build - building Arch package linux-lts using Arch Package Build mechanism. Presented results are real times from:
    • mkdir -p ~/build; cd ~/build
    • pkgctl repo clone --protocol=https linux-lts; cd linux-lts
    • time makechrootpkg -r /var/lib/aurbuild/x86_64
  2. fio randrw – sudo fio --rw=randrw --bs=8M --threads=32 --iodepth=32 --ioengine=libaio --size=32G --name=$BENCHMARK
  3. fio rw – sudo fio --rw=rw --bs=8M --threads=32 --iodepth=32 --ioengine=libaio --size=32G --name=$BENCHMARK
  4. fio write – sudo fio --rw=write --bs=8M --threads=4 --iodepth=32 --ioengine=libaio --size=32G --name=$BENCHMARK

Results

Bench type LVM-on-MDRAID1 LVM-RAID10 LVM-RAID10 +
RAID integrity
BTRFS RAID10
Linux build 39m48.154s 40m0.934s 40m0.281s 40m13.435s
fio randrw - READ 330MiB/s 365MiB/s 170MiB/s 367MiB/s
fio randrw - WRITE 349MiB/s 386MiB/s 179MiB/s 388MiB/s
fio rw - READ 1575MiB/s 1656MiB/s 1636MiB/s 1487MiB/s
fio rw - WRITE 1664MiB/s 1749MiB/s 1728MiB/s 1570MiB/s
fio write 1603MiB/s 921MiB/s 326MiB/s 1407MiB/s

The differences on the “real-life” benchmark are… negligible to say the least. The test is a bit too long for me to have a proper run-to-run variance analysis and the like, so I’m sticking with “roughly the same” here.

fio tests tell a different story and I’ll need to revisit them again, because the difference between fio write in scenarios (2 and 3) and (1 and 4) is tremendous. I expected drop in write performance for RAID integrity, but there appears to be no difference in linear RW, but over 1/2 drop in random RW and 5x drop in linear write compared to mdraid.

Too bad it takes a while to reconfigure the test environment between the scenarios (only changes between 2 and 3 are painless).

The BTRFS performance seems nice (acceptable), but I don’t like that it doesn’t have an option for zvol-like volumes that expose block devices. I used to use thin LVs for VM images (with snapshots); with BTRFS that would have to be replaced with either qcow2 or raw image files. qcow2 seems like a waste on a CoW filesystem. Raw image files are the closest, but incur filesystem penalties, require additional steps to mount them in host system, and cannot be snapshoted individually (AFAICT they need to be put in a separate subvolume → directory).

1 Like

I redid fio tests and added synthetic dd tests as well:

  • dd if=/dev/urandom of=/tmp/pattern bs=32MiB count=1024
  • dd if=/tmp/pattern of=$BENCHMARK status=progress bs=8MiB count=4096
  • dd if=/tmp/pattern of=$BENCHMARK status=progress bs=4KiB count=$((8*1024*1024))
  • dd if=$BENCHMARK status=progress bs=8MiB count=4096 | xxhsum
  • dd if=$BENCHMARK status=progress bs=4KiB count=$((8*1024*1024)) | xxhsum

This time on a archiso s.t. I don’t need to reinstall system after each change xD

Just to preface the updated results, the theoretical linear R/W speeds for each drive are around 400-500 MiB/s. Which means that a 6-drive RAID10 should be limited to 6*400-500 READs (2400-3000 MiB/s) and half that for writes (1200-1500 MiB/s).
Meanwhile…

(fio results are min/median/max)
UPDATE: added ZFS numbers, see below.
UPDATE2: added MDRAID10 numbers, see next post

Bench type LVM-on-MDRAID1 LVM-RAID10 LVM-RAID10 +
RAID integrity
BTRFS RAID10 ZFS
striped mirrors
MDRAID10-6 MDRAID10-4 LVM-on-MDRAID10
Linux build 39m48.154s 40m0.934s 40m0.281s 40m13.435s
fio randrw - READ 278MiB/s
282MiB/s
330MiB/s
263MiB/s
269MiB/s
365MiB/s
153MiB/s
154MiB/s
170MiB/s
318MiB/s
367MiB/s
374MiB/s
291MiB/s
311MiB/s
347MiB/s
357MiB/s
369MiB/s
369MiB/s
230MiB/s
231MiB/s
234MiB/s
303MiB/s
303MiB/s
317MiB/s
fio randrw - WRITE 293MiB/s
298MiB/s
349MiB/s
278MiB/s
284MiB/s
386MiB/s
162MiB/s
163MiB/s
179MiB/s
336MiB/s
388MiB/s
395MiB/s
307MiB/s
329MiB/s
367MiB/s
377MiB/s
389MiB/s
390MiB/s
243MiB/s
244MiB/s
247MiB/s
320MiB/s
320MiB/s
335MiB/s
fio rw - READ 1549MiB/s
1575MiB/s
1595MiB/s
1574MiB/s
1596MiB/s
1656MiB/s
1610MiB/s
1623MiB/s
1636MiB/s
583MiB/s
1298MiB/s
1487MiB/s
1268MiB/s
1285MiB/s
1306MiB/s
1557MiB/s
1576MiB/s
1633MiB/s
1419MiB/s
1491MiB/s
1542MiB/s
1409MiB/s
1568MiB/s
1602MiB/s
fio rw - WRITE 1637MiB/s
1664MiB/s
1685MiB/s
1663MiB/s
1685MiB/s
1749MiB/s
1701MiB/s
1715MiB/s
1728MiB/s
616MiB/s
1371MiB/s
1570MiB/s
1339MiB/s
1357MiB/s
1380MiB/s
1645MiB/s
1665MiB/s
1725MiB/s
1499MiB/s
1574MiB/s
1629MiB/s
1488MiB/s
1656MiB/s
1692MiB/s
fio write 1585MiB/s
1603MiB/s
1629MiB/s
910MiB/s
912MiB/s
921MiB/s
326MiB/s
351MiB/s
353MiB/s
731MiB/s
1407MiB/s
1436MiB/s
1292MiB/s
1293MiB/s
1297MiB/s
885MiB/s
947MiB/s
1008MiB/s
894MiB/s
927MiB/s
929MiB/s
824MiB/s
845MiB/s
858MiB/s
dd bs=8MiB WRITE 495 MB/s 372 MB/s 135 MB/s 652 MB/s 834 MB/s 435 MB/s 529 MB/s 482 MB/s
dd bs=4KiB WRITE 489 MB/s 355 MB/s 133 MB/s 617 MB/s 752 MB/s 481 MB/s 513 MB/s 481 MB/s
dd bs=8MiB READ 450 MB/s 492 MB/s 410 MB/s 2.2 GB/s 648 MB/s 660 MB/s 396 MB/s 606 MB/s
dd bs=4KiB READ 891 MB/s 541 MB/s 451 MB/s 1.7 GB/s 648 MB/s 636 MB/s 519 MB/s 662 MB/s

Takeaways:

  • MDRAID-1 single thread performance is borked. It doesn’t stripe the reads, even if the underlying drive is an SSD.
  • LVM2 performance is weird. It can be on par or better than MDRAID, but then drop suddenly in the simplest cases.
  • BTRFS perf is all over the place
  • Values that are too high might be indicative of caching artifacts

Will test ZFS next, but I’m not too optimistic about it.
Edit/Update: I’ve quickly added a ZFS striped mirrors test without configuring the full root-on-ZFS stuff. $BENCHMARK is a ZVOL with bs=16K. fio numbers are +/- in line with what I expected from the previous tests. dd results are somewhat weird. Way better WRITE performance than anything else but I’m sure it’s lying with READ performance (definitely everything was in ARC, I’ve seen no leds blinking on the READ dd tests), but even with that handicap the performance is… mediocre.

And the consumer limits for the READ tests are:

root@archiso ~ # dd if=/tmp/pattern bs=4KiB count=$((8*1024*1024)) | xxhsum
8388608+0 records in
8388608+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 20.2804 s, 1.7 GB/s
c6b0ee636032b6d6  stdin
dd if=/tmp/pattern bs=4KiB count=$((8*1024*1024))  5.21s user 15.05s system 99% cpu 20.287 total
xxhsum  4.20s user 9.94s system 69% (nice) cpu 20.286 total

root@archiso ~ # dd if=/tmp/pattern bs=8MiB count=$((4*1024)) | xxhsum
4096+0 records in
4096+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 17.2292 s, 2.0 GB/s
c6b0ee636032b6d6  stdin
dd if=/tmp/pattern bs=8MiB count=$((4*1024))  0.03s user 11.82s system 68% cpu 17.236 total
xxhsum  2.33s user 9.92s system 71% cpu 17.235 total
1 Like

It came to me in a dream that I should probably check if I can get any performance out of mdraid by using it directly for RAID10-ing the 6 drives. At first I didn’t like the idea because of size disparity between the 240 and 480 GB drives, but then I realized that I can split the bigger drives into more partitions, having two separate RAID10 arrays, 4- and 6-drives wide.

sda [4GB ESP] [~236GB raid 10 pv0]
sdb [4GB ESP] [~236GB raid 10 pv0]
sdg [4GB ESP] [~236GB raid 10 pv0] [~240 raid 10 pv1]
sdh [4GB ESP] [~236GB raid 10 pv0] [~240 raid 10 pv1]
sdi [4GB ESP] [~236GB raid 10 pv0] [~240 raid 10 pv1]
sdj [4GB ESP] [~236GB raid 10 pv0] [~240 raid 10 pv1]

Then I can lay out LVM over the two arrays, but specify which LVs should use pv0, which pv1, and which can be mixed one or the other gets depleted.

Today I ran the same benchmarks as before (except Linux build) and updated the table in the previous post. I also ran them on individual mdraid arrays (pv0 and pv1 in the schematic above).

Overall this solution appears to be as-good-or-better than raw LVM RAID10, although it sacrifices the possibility of running the dm-integrity layer via LVM*.

*) I know it’s possible to go raw partition → dm-integrity → md-raid → lvm, but:

  1. It’s not possible to have per-LV integrity protection; for some the write penalty of dm-integrity is not justifiable, while for some it’s acceptable and welcome.
  2. It’s not plug-and-play. Starting dm-integrity requires AFAICT manually running integritysetup for each device, c.f. GitHub - tomato42/mkinitcpio-dm-integrity: dm-integrity module for the mkinitcpio initramfs used in Archlinux
1 Like

I was wondering why you didn’t try this, lol.