Linux at the bleeding edge, ZFS, and CUDA

I can wholeheartedly recommend Void Linux, which is a rolling distro, but there’s much more emphasis on stability(they even call it a “stable rolling release”). When it comes to zfs, root on ZFS is officially supported, and the whole procedure is well documented, and on top of that ZfsBootMenu considers Void a first class distribution.

Void uses runit as its init system, so you don’t have to worry about it if you don’t like systemd.

Because Void Linux was created by a former NetBSD developer some feel that it’s a hybrid between Linux nad BSD, which to some is an advantage(I love BSDs and Slackware was my first linux, and that distribution was influenced by BSDs as well).

Some things that perhaps should be mentioned is that Void is generally aimed at more advanced users, that gives you more control over the system, but also requires you to know a little bit more.

There unfortunately aren’t that many packages available for Void Linux, but there are some workarounds. For one there’s xbps-src - which builds packages from source(somewhat similar to AUR), then there’s xdeb - which allows you to convert debian packages to the native Void package format, some people supplement void by using nixpkgs, and finally you can build what you need directly from source.

I would also like to mention that it really impressed me when I’ve updated a Void Linux VM that I didn’t use for two years, and after updating it everything worked perfectly. Honestly I was expecting breakages, and the fact that I could bring it up to date so easily was surprising.

1 Like

Thanks to everyone for the replies.

I hear the points about supposedly stable distros, but the thing is, at least the ones I really spent time with (Fedora, and Pop OS 20.04 LTS) do push out changes in between 6 month releases. Some of them are important, for security, etc, but it seems like some of them aren’t, and it was surprising to me how often I’d lose ZFS or the nvidia driver to dnf or apt upgrade.

Noted. Good points.

Also good points. I’ll think on this a bit more.

It is, with a few "but"s. First of all, you need to be running FreeBSD CURRENT. I haven’t checked 13.1, but the required functionality hadn’t made it into bhyve (the hypervisor) by 13.0. When 14 rolls around, it’ll be in the “stable” release line for sure. Second caveat is that I don’t think it fully resets the card properly, and you need to power cycle the host after shutting down a VM if you want to allocate a GPU to a different VM.

There’s definitely some proprietary kernel stuff, but I don’t know the details. Yes it was a pain. Maybe you guys have a point, that virtualization might be a good route to go anyways…

I looked into this and it seems very interesting. I do very much like the sound of it.

Aren’t there cross-distro package repositories that might work as well, like homebrew?

Yes there are! Of course you can use homebrew! But I happened to mention nixpkgs for two reasons. Nixpkgs are somewhat officially supported(the instructions from 2014 are out of date though, they were written when Void used OpenRC), people use them and they work. Secondly - I happen to agree with Chris here, I think that nixpkgs is simply better than homebrew(but, of course, you can use either one).

1 Like

Thanks for the info, this is interesting! When I checked it some years ago it was very far from anything usable. Reset needs to be in place (at least for the not too buggy cards) before it can be used in production though.

I’ve noticed too that most distros nowadays seem to allow quite a lot of updates between releases, not only security fixes. In the past the software was more left alone between releases. But there is an additional variable when proprietary software is involved, which could affect you - that’s which distros the proprietary stuff are tested with.

My experience is that most companies would officially support Ubuntu LTS, RHEL, perhaps something more, and call it a day. Even a slightly downstream distro such as Pop won’t get the same support in practice, as nVidia won’t care to match their release cycle. Meanwhile it’s impossible for distro devs to try to work in support from their end, to code that they can’t review. I haven’t checked the official support list of CUDA, but I would guess that this logic applies there. The bottom line is that I would trust only the most mainstream paths whenever it comes to most proprietary kernel blobs. (the regular consumer-focused video drivers from nVidia and AMD are less problematic though, as their use is so widespread and thus they receive a lot of intermittent testing).

That leaves me in an unfortunate position, as I rather dislike Ubuntu… RHEL might be better, I dont know.

But fair enough. It’s sounding like a workflow centered around VMs and GPU passthrough might be ideal… Which sort of makes me wonder if windows and WSL2 might be a good choice… WSL with a supported distro for cuda, maybe a hyper-v FreeBSD vm for ZFS, and windows itself as the UI.

Thanks again, I appreciate everyone’s input.

Actually, as I see it the logic I laid out above applies to proprietary drivers, but not to ZFS. ZoL is quite mature nowadays, and it does not have the problem I mentioned with secret proprietary code, only a slightly incompatible license. If I were in your position, I would definitely dare to run Linux with ZFS on the host with something else than Ubuntu/RHEL, and only virtualize the CUDA workflow (with an officially supported distro).

Also, if you run CUDA in a VM, you could NFS-mount everything but the virtual OS drive, from within the VM. Then if an update breaks the VM, you can always roll back it’s OS drive to an earlier snapshot without risking the loss of any completed or ongoing work (I’m assuming the VM’s OS drive is backed by ZFS here). This makes the choice of distro for the guest less of an issue too!

What I describe is very close to my setup, actually. I recently left (X)Ubuntu for Pop 22.04 on my host system, as I got too annoyed with snap, still wanted something debian-based, that in turn was more bleeding edge than Debian but still LTS. I run ZFS for everything but the root disk, and virtualize Windows and some other Linuxes. I haven’t used CUDA recently but the need may emerge in the near future, in that case I would put it in a VM.

I can’t say I used this setup long enough to verify that it “typically” won’t break, but I don’t really worry with ZFS on Linux at this point, as I do with proprietary kernel blobs. ZoL is getting more and more established in the general Linux ecosystem for every year.

2 Likes

Btw, I tried this in a production system (my home work/play/everythingstation) around 2011/2012. It worked quite well, I passed through an entire LSI2008 SAS controller to host the disks, but I never felt comfortable with the complexity it entailed. Also the FreeBSD guest experienced some kind of reset bug that I could not solve, so whenever I rebooted FreeBSD I had to reboot the host. Not that it was an issue as FreeBSD needed far fewer reboots than the Linux/XEN host, but still.

The reason for this setup was that it was before VFIO was a thing, so I had to use XEN for GPU passthrough. XEN was incompatible with ZFS on the host, and ZoL was also quite embryonic at the time. When VFIO matured a couple of years later, I went to use Xubuntu with ZoL and VFIO, and never looked back.

I had no notable breakage as far as I remember, with Xubuntu LTS, ZFS on the host, and VFIO, across three different hardware platforms between 2013 and early 2022 (when I switched to Pop OS). No proprietary blobs in the host’s kernel though :slight_smile:

1 Like

No matter what you choose there will always be some drawbacks. No solution is perfect, and going with Windows and WSL2 means trading one set of problems for another.

So in the end I think that you should choose the host operating system that you will feel comfortable using.

Going down the VM + host on ZFS route, will not only grant you the flexibility and resilience mentioned by @cowphrase , but will also allow you to put your VMs on a separate ZFS pool/filesystem and then use raw images instead of qcow2. Qcow2 supports snapshots(as well as compression), but .raw is more performant, and because you’ll be using ZFS anyway, you could use that for your snapshots instead, and you’ll probably be using compression anyway!

2 Likes

Thanks very much to @cowphrase @oegat and @chain

After much consideration, I ended up installing Rocky Linux 9. Ultimately, where I landed was that I’d have the least headache with an officially CUDA-supported distro. I was able to get zfs and nvidia drivers installed fairly easily. Time will tell if they constantly break.

3 Likes

Participating in this discussion actually got me thinking about my own setup, and gave me the kick necessary to start making it a reality. I decided to document the setup process for my workstation on the community blogs section of the forum. Maybe you’ll find that interesting.

Either way I’m glad you’ve found your solution, and hopefully it’ll work great for you!

3 Likes

@hsnyder Just checking in 1 year later. How are things going for you? Did you stay on Rocky Linux 9? Are you using DKMS or kABI tracking method for ZFS? How have update cycles on Rocky been for you?

Hey there. No I didn’t stay on Rocky. The system in question was being used as a workstation and wanting more recent versions of GUI programs, I eventually tried out Arch. That worked for a while (surprisingly) even with zfs (via archzfs). But eventually I had a non-zfs-related breakage of pacman and I couldn’t figure out what was wrong. I’m now on void and doing zfs on root, which is decidedly my preference. I’m cautiously optimistic, but we’ll see…

2 Likes

I’ve been using ZFS with DKMS on CentOS and Alma for many years. I had some issues in the very early days, but it’s been solid for me for many years. So I can highly recommend it.

Also I’d choose DKMS over kABI tracking.

2 Likes

Thanks for the feedback @hsnyder Sounds like you found a combo that is working for you!

Thank you @cowphrase ! Glad to hear DKMS on RHEL clones has been solid for you! After a good amount of forum and GitHub parsing I went with DKMS as well. Looked like the kABI tracking method did not always account for minor releases kernel updates. And I figured there was a solid reason 45drives went DKMS in their ZFS automated setup scripts on commercially supported rigs. It says something when you are willing to be available to help on the other end of a support hotline for a given setup.

For what it’s worth, I’ll mention that another thing I’ve been looking into recently is alternatives to zfs! Zfs is wonderful, but if there are specific features that you’re after, you may be able to get it with built-in tools.

And of course there’s also btrfs. At the time of this writing I haven’t personally used any of the aforementioned, but I’m considering giving them a try. I do love ZFS, but it does have it’s downsides (including Linux compatibility) and ultimately what I’m really after is OS-to-disk data integrity checking, and perhaps to some extent snapshots. I’m open to trying ZFS alternatives.

I’d still use ZFS on a storage server (granted, I’d be using FreeBSD unless I had a compelling reason not to), but on a development workstation that I want to constantly update without thinking too hard about it, maybe an alternative is worth considering.

1 Like

IMO ZFS is basically native at this point. I’ve seen far more problematic things on linux (nvidia drivers anyone?). I’ve even got DKMS signing on one system so that I can use Secure Boot with ZFS. If you run Arch or Gentoo, there are even kernels that include ZFS directly.

Maybe BTRFS RAID1 could work for you? Or better called “two copy”, since it can store two copies of your data over N disks. You lose throughput performance compared to RAIDZ though, and you lose IOPs compared to RAID10. BTRFS RAID1 isn’t something you choose for performance. But if you’re using NVMEs who cares.

1 Like

I’ve been looking at these non-ZFS alternatives too. If performance needs are not super high or if it’s an all flash rig, LVM integrity RAID is not overly complex to setup.

If it’s not all flash I’ve found LVM integrity RAID (and DM-Integrity) have a big write performance penalty.

  • In 7k SATA HDD testing (R1 mirror of 2 drives) I would see a 90% decrease in write performance but only about 15% decrease in read performance.
  • In NVMe testing (R1 mirror of 2 drives) I would see a 73% decrease in write performance and about 18% decrease in read performance

But with blazing fast NVMe those performance hits are tolerable and not all that different than you might see on ZFS.

On HDD that turns a 200 MB/s LVM or MDADM mirror write speed into around 20 MB/s which is not so great.

I tried to experiment with layered caching, greatly complicating the setup and found why so many have just stayed with the relative simplicity of ZFS.

You can layer read Caching via Stratis with super high performance on the read side and a comparably simple setup as a similar ZFS pool and file systems. But Stratis doesn’t offer write caching yet in my testing. When they do that’s what I would look at.

In the mean time to get write caching you need to jump through some significant hoops, but it is possible with a custom recompile of LVM2 (bypassing some safety controls).

More info here from some really helpful devs on the LVM GitHub:

I’m worried the custom patch / recompile might need to be redone after subsequent OS updates. But this functionality will eventually come to native LVM as more users test successfully, and no doubt to Stratis as well on the write caching side.

On a side note, if anyone is debating between DM tables caching vs LVM caching, there is a noticable performance boost on the LVM side in my testing. And an even more significant boost in Stratis based caching. I’m not sure why as they’re all leveraging the same underlying tools. But food for thought as you build and test.

Thanks for sharing those experiences. One thing I do want to remark on - why is the write performance impact 90%? That sounds way higher than it should be, shouldn’t it be more like 50%? Perhaps there’s some block size tuning that needs to happen between the various layers? (e.g. RAID, integrity, LVM, etc)?

EDIT: feel free to move the discussion of dm-integrity slowness over to here: Experiments with dm-integrity and T10-PI on Linux if you want to - that thread is more focused on dm-integrity and t10-pi

1 Like

Thanks for that new topic link @hsnyder Yes will check that out!

Yeah I agree the 90% write performance drop is shocking. The drives are awesome without Integrity RAID and they perform fairly well on ZFS. I think what is hurting the LVM / DM-integrity approach is the “emulation of a block device” as described here:
https://docs.kernel.org/admin-guide/device-mapper/dm-integrity.html

Emulation and high performance aren’t usually close friends. However as Stratis matures I think we’re going to get new and better options coming that either write cache through the issue or begin to address the integrity RAID performance challenges at the DM level.


Here’s a rundown of the workflow I used to get to that performance testing point, aligned on 4k sector / block size to match the Seagate EXOS drives (ST10000NM0146)

View supported Block sizes:

./SeaChest_Lite_x86_64-alpine-linux-musl_static -d /dev/sg4 --showSupportedFormats

If 4096 is listed, proceed with erasing the drive and changing to 4K:

./SeaChest_Lite_x86_64-alpine-linux-musl_static -d /dev/sg4 --seagateQuickFormat --setSectorSize=4096 --confirm this-will-erase-data

Create Backing Disk HDD Partitions - use 100% space (the whole 10TB drive)

this will be partition 1: “set 1” references “p1” when assigning “lvm” flag

parted -s /dev/sda mklabel “gpt”
parted -s -a optimal /dev/sda mkpart “BackingDisk” 0% 100%
parted -s /dev/sda set 1 lvm on

parted -s /dev/sdb mklabel “gpt”
parted -s -a optimal /dev/sdb mkpart “BackingDisk” 0% 100%
parted -s /dev/sdb set 1 lvm on

Create single partition PV (partition 1) on backing HDD disks

pvcreate /dev/sda1 /dev/sdb1

Create Logical Volume with 1 mirrored copy HDD at 4k block

lvcreate --type raid1 --mirrors 1 --raidintegrity y --raidintegrityblocksize 4096 -l 2365291 --name hybridmirror_lv writecache_vg /dev/sda1 /dev/sdb1 --nosync

1 Like