On X870, a ZFS pool of 3x M.2 drives, each PCIe 5.0x4- doable?

Hi,

I’m speccing out a new system, likely a Ryzen 9 9950X on an X870/X870E motherboard. I don’t intend to install a GPU, so the PCIe 5.0 lanes will be put toward high performance storage.

I’m referring to this very helpful spreadsheet to try to make sense of my options:

Several of the more expensive X870/X870E boards show for M.2 (M): 3*5x4, 2*4x4 – which I translate to: three M.2 slots at PCI 5.0x4, two M.2 slots at PCI 4.0x4. All of the 5x4 slots being from the CPU.

I want to maximize read & write performance in a ZFS pool. I’m thinking a raidz1 or mirror vdev utilizing the 3*5x4 (i.e. three M.2 SSDs at PCIe 5.0x4) would accomplish that, but I’m not sure.

A few questions:

  • Is there any obvious issue in the scenario I’ve described? E.g. is it fine to purpose the 5.0x4 lanes from the CPU for storage, given I won’t be installing a GPU?
  • Is there a more performant vdev configuration that incorporates additionally the two M.2 slots at PCI 4.0x4? Could a pool of five disks (three very fast M.2 SSD + two regular fast M.2 SSD) outperform a pool of three very fast M.2 SSD, or could the slower SSDs hinder overall performance? Would it matter that some of the SSDs use CPU lanes and others use motherboard lanes?
  • Is there a more appropriate architecture for my build (fast many-core CPU, fast storage, no GPU)?

Thanks.

1 Like

Main AM5 series CPU’s have 28 PCIe lanes. As I understand none of these are used for integrated graphics, so you don’t lose any of them if you only use the integrated graphics.

Four of these slots are reserved for the chipset and not available to you.

That leaves 24x lanes directly to the CPU you can use.

In typical configurations 16 of those are used for the main GPU slot, (though this can be dropped down to 8x, if a second 16x slot is used, in which case it gets the other 8)

Then there are 8x left. 4x are usually used for the primary m.2 slot, and 4x are usually used for the 40Gbit USB feature AMD now requires on higher end boards.

Some boards allow you to disable the 40Gbps USB - however - and instead use those 4 CPU lanes for a second m.2 slot

If you get the right x870 board, the right CPU and you don’t care about a GPU (or only want to use the integrated GPU) you should be able to use up to 6 4x NVMe drives direct to the CPU.

Drive 1 - Dedicated M.2 CPU slot on board
Drive 2 - Secondary optional M.2 slot on board gained by disabling 40Gbps USB)
Drive 4-6 - 4-way riser in 16x Slot that fits four 4x drives

(alternatively, if you get a better deal on them you could use two 8x two way risers, one in each slot, instead of one 16x riser.

All of these will run on CPU lanes, and be Gen5 capable, but there are some caveats.

1.) To use the 40Gbps USB 4x lanes for m.2 instead, your motherboard must support this capability. Only a minority of motherboards do.

2.) Your motherboard must either support PCIe Bifurcation on the 16x slot(s) (often called NVMe RAID mode in PCIe settings). This splits the 16x slot up into four 4x slots for use with an adapter. If your motherboard does not support bifurcation, you need to use an adapter that has a PCIe switch (often referred to as a PLX switch) in it. These are more expensive, use more power, and add latency, making the storage slower. Generally they are not considered a good idea other than as a last resort.

3.) In order to get Gen5 performance out of the slot riser, you must buy one that claims to get Gen5 speeds. The higher the gen, the trickier it is to maintain the speeds with longer signal paths, so unless it specifically tells you it has Gen5 speeds, chances are an adapter suffers too much signal degradation and it doesn’t work. Sometimes these can be made to work at slower speeds by manually setting the port to Gen4 or lower in the bios, but sometimes they just outright refuse to work.

I have never cross-shopped for exactly these features in an AM5 board, so I don’t know exactly which support both disabling 40gig USB and using the lanes for an m.2 and bifurcation on the 16x slot.

@lemma put together this nice list of boards that support a second m.2 slot to the CPU the last thread I discussed this topic in, here.

I’m sure at least a few of those support bifurcation, but you’ll need to do your research and confirm it or it won’t work.

As for risers, I have five Asus Hyper 4-way 16x to 4x m.2 risers in various systems. (Three in my server, one in my main workstation and one in my backup workstation). They have worked perfectly for me in both server and workstation use, with ZFS in all cases, so I do recommend them. Mine are the Gen4 versions though. Since you want Gen5 speeds, you are going to want the Gen5 versions.

I cannot speak to how well the Gen5 risers work, but if they perform as well as the Gen4 ones do, they should be great.

Once again, these risers DO require that your motherboard supports bifurcation, and has it enabled, otherwise you will only see the first drive. Sometimes in the bios this is called PCIe raid mode (or something like that) so you may need to poke around in order to find the right setting. It isn’t always apparent. And not all motherboards support it.

This is going to depend very much on your use case. Can you share any more details?

You say you want to maximize read and write performance. Are you more interested in Sequential reads/writes for quickly copying large files back and forth, or more interested in IOPS/Latency/4k random performance that make many databases and applications more responsive?

The general wisdom is that if you are gunning for performance in ZFS nothing beats a mirror. And you can just add more of them and ZFS will stripe the mirrors together and (in theory) benefit from them performance wise.

IN my experience this wisdom checks out perfectly with hard drives (SATA or SAS) and SATA SSD’s, but with NVMe - in my experience - something else becomes the bottleneck in a hurry. While I absolutely love ZFS, I haven’t found the performance to be terribly impressive on late gen NVMe drives in any configuration where they are used for main pool data storage.

(using them for “Special” devices is something different all together and can help a lot. though. And if you need sync writes, nothing beats using Optanes, which are also NVMe drives.)

I mean, sure, way faster than any SATA SSD. but it just doesn’t scale as well as you add faster and faster NVMe drives, with each gen giving you less and less.

For instance, on my main workstation, I have a storage pool consisting of four 1TB Samsung 990 drives, configured as two mirrors (so, ZFS equivalent of RAID10)

I also added tow mirrored 280GB Optane 900P’s as SLOG/LOG/ZIL drives in order to speed up sync writes.

So this is all Gen4, not Gen5.

The intent of this storage pool was to run Virtualbox VM’s off of it, by creating ZVOL’s (virtual block devices on top of ZFS) and passing them through to the VM to use as a raw block device instead of a disk file.

They perform OK, but nowhere near what I would have expected.

I ran some disk benchmarks inside a Windows 10 VM, after spending some time selecting the fastest virtual disk controller I could (which wound up being LSI SCSI emulation. There was an NVMe emulation driver, but it was very inconsistent)

These were the results:

------------------------------------------------------------------------------
CrystalDiskMark 8.0.6 x64 (C) 2007-2024 hiyohiyo
                                  Crystal Dew World: https://crystalmark.info/
------------------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

[Read]
  SEQ    1MiB (Q=  8, T= 1):  4727.213 MB/s [   4508.2 IOPS] <  1766.81 us>
  SEQ    1MiB (Q=  1, T= 1):  3643.352 MB/s [   3474.6 IOPS] <   287.46 us>
  RND    4KiB (Q= 32, T= 1):   232.559 MB/s [  56777.1 IOPS] <   545.78 us>
  RND    4KiB (Q=  1, T= 1):    78.210 MB/s [  19094.2 IOPS] <    52.15 us>

[Write]
  SEQ    1MiB (Q=  8, T= 1):  1965.707 MB/s [   1874.6 IOPS] <  4189.91 us>
  SEQ    1MiB (Q=  1, T= 1):  2040.791 MB/s [   1946.2 IOPS] <   513.17 us>
  RND    4KiB (Q= 32, T= 1):   227.947 MB/s [  55651.1 IOPS] <   562.85 us>
  RND    4KiB (Q=  1, T= 1):    64.503 MB/s [  15747.8 IOPS] <    63.27 us>

Profile: Default
   Test: 1 GiB (x5) [E: 0% (0/50GiB)]
   Mode: [Admin]
   Time: Measure 5 sec / Interval 5 sec 
   Date: 2025/05/10 21:33:46
     OS: Windows 10 Pro 22H2 [10.0 Build 19045] (x64)

I mean, it’s not bad, but in a 4-way mirror of drives where each drive can read at 7,450 MB/s sequential, and in theory should scale almost linearly in this mirrored configuration, it is also less than 16% of the theoretical max…

I’m essentially getting a little bit more than single Gen3 drive performance out of four Gen4 drives in two striped mirror sets.

I knew to not get my hopes up too much, what with virtualization having some performance losses, and running a virtual block device on top of ZFS also not being the most performant solution, and there potentially being scaling issues due to CPU limitations both in the VM and on the host, etc. etc., but still this was kind of disappointing to me.

This is on a Threadripper 3960x with 128GB DDR4-3200 ECC RAM, btw.

I still use it for the other benefits of ZFS, but I was definitely hoping to get a little bit better performance out of it.

My take when it comes to ZFS and performance is that it can be a very well performing file system, if used on systems that are highly storage bottlenecked. ZFS does a lot of CPU computation to optimize things understanding that any given CPU core can do millions of computations in the time it takes to wait for a hard drive to finish. Thus, performance on hard drives is absolutely stellar (for hard drives). Performance on SATA SSD’s is quite good. But the faster your storage devices get, the less true the foundational assumption of ZFS (that the CPU is several orders of magnitude faster than the storage) is and ZFS starts to get bogged down in overhead.

I still love it and use it for its many features and high level of protection against corruption, but on NVMe drives it just isn’t going to be a high performance file system like it is on hard drives.

Oh this is going to be extremely dependent on your workload. Why don’t you tell us a little bit more about what you are trying to do?

NAS? Enormous video editing editing scratch disks? AI? Something else?

I may not be familiar with your exact workload, but at least it can give me some ideas.

4 Likes

I just recently bought an EPYC4000D4U board, AM5 with 24 CPU lanes flexibly available, all Gen5.

I chose it for 4x NVMe and dual 25Gbit Networking via x8 PCIe card, no chipset lanes required and dirt cheap compared to a lot of other stuff that requires all kinds of adapters, cards and compromises.

Haven’t built the server yet (board and half the parts waiting for rest to arrive), but this is great stuff for a pure and focused NVMe storage server on a (Core/Ryzen) budget.

2 Likes

Thanks so much for the in-depth response, I really appreciate the insight, especially the note about how ZFS’s performance benefit from mirroring/etc. naturally drops off as the speed of underlying storage increases because there is less interstitial opportunity for ZFS-related computation and less benefit to access-optimized data placement. That makes sense.

The workstation will be used to juggle tens (30-50) of lightly utilized VMs under Xen, most VMs running a Linux variant. So a fast, many-core CPU is most helpful but I don’t expect the system will be stressed in general use. The wish for fast storage is mostly to ameliorate a specific sporadic/daily pain point: the IO-intensive cloning of a VM (10-100 GB).

It’ll take me some time to absorb all you wrote, just wanted to express my thanks and say a little more about the goals of the new build.

At a glance, this does look indeed look compelling for my use case. I’ll take a closer look. Thank you for the reference!

It’s not a Gaming board, so availability is more restricted. And is very new on top of that…I only found two retailers selling it in the EU two weeks ago. AsRock Rack boards just don’t ship in volumes like gaming boards.

And I use it as a server, so I’m limited by dual channel memory and networking anyway. In my case, Ryzen can’t handle much more than 4x NVMe anyway. So for me, this was the sweet-spot board on what I can possibly do on a Ryzen while keeping board costs low.
I can probably upgrade to single port 100Gbit NIC and larger drives later on…but that is years ahead.

1 Like

just to be clear, 2x NVMe via a PCIe card, too? otherwise what are you doing to get the extra 2 NVMe?

There are 2x MCIO 8i ports for 2x NVMe each. You need the PCIe slot for network, otherwise you are stuck with 1Gbit networking and might as well use HDDs and not NVMe, because even HDDs are faster than 1Gbit :wink:

1 Like

Yeah. Since @Blurk’s looking for 3x 5x4 though the right x8+x4+x4 PEG switching’ll work without needing risers or going to the few MSI boards with ASM4242 switching. Offhand I think that’s just X870E Aorus Elite and Master.

Compares poorly with 5x 5x4 from a PEG riser in B850 Steel Legend/Riptide, though, so I’d do that unless an x8 slot’s important for future proofing (or X870 Steel Legend/Riptide if USB 4.0 down’s more attractive than losing the 4x4 CPU M.2).

Me neither. Something like half of the kit which claims PCIe 4.0 actually does 3.0, so I’d expect even more bogosity with 5.0 though.

ASRock Blazing Quad and Asus Hyper are the two probably legit options I know.

PEG x8/x4/x4 and x4/x4/x4/x4 have been a AGESA feature since somewhere in AM4 at least. Not sure the extent to which Asus, Gigabyte, and MSI might cripple that out. Every ASRock board I’ve checked has them.

1 Like

Did you ever try a non-ZVOL configuration, with virtual disk files in a regular ZFS file system? I’ve seen comments here and there about ZVOLs underperforming.

I’ve seen little difference from my experience and I settled with zvols eventually. VM=>SAN=>block storage as far as I’m concerned.

I think people saying that zvols are underperforming just mess up their volblocksize and other dataset properties, which can make block storage really painful. But so can qcow2 in a ZFS dataset with shoddy properties.
And for some it’s just more familiar to work with files than dealing with a more abstract concept of block storage. Whatever floats your boat.

Getting the correct and optimal tuning, especially for record- and volblocksize, IS KEY when working with VMs or other Initiators.

And grow/truncate/etc. are features with ZFS where you have to get other tools to do the same, if available at all. I always liked doing most stuff with zfs and zpool commands.

Funny you shoudl ask.

I actually did some benchmarking of both configurations, and the ZVOL performed better.

Let me see if I can find the data, and I’ll post it here too.

  1. Asus calls this “RAID MODE” which is 4x/4x/4x/4x for a single PCIe 16x slot.
  2. Depends on your workload, but it might matter depending on bandwidth used.
  3. It depends on how you’re going to use the storage, OS etc.