Critique My ZFS Setup for a VM/Dev Workload Using 3.2TB PCIe 4.0 Optane

So I’m building a new machine for VM/dev workloads. I’ve already spent a couple months reading all of the existing materials on Optane and ZFS (mostly Optane and bcachefs/btrfs before I realized ZFS is a lot more solid), and started buying parts as the plan came together.

What I currently have now is a 16-core Zen 4 system with I/O for 6 × SATA and 16 × PCIe 5.0 lanes intended for storage. I have a half dozen 118GB P1600X Optanes intended for use a boot devices and 3 × 380GB 905P Optanes (I purchased simply because they were the highest density Optanes I could get which has an M.2 form factor). The 3.2TB P5800X Optane just got delivered last week.

There’s a setup I’m eventually aiming for, but because there is a lack of support for the PCIe 5.0 ecosystem right now, I’m pondering an alternative that could work in the meantime:

  • 1 × 3.2TB P5800X connected to an M.2 slot via a passive M.2 adapter and active PCIe 4.0 OCuLink cable (just 25cm long)
  • 1 × 118GB P1600X installed directly into the primary M.2 slot
  • 6 × (some very large number)TB SATA SSDs for ZFS RAIDz2

The 118GB P1600X would be a boot/OS drive for Proxmox. The 3.2TB P5800X and SATA drives would host a ZFS pool for the VMs. From what I gather so far, the Optane ought to be allocated as follows:

  • 11GB SLOG (computed from 4 data chunks × 550MB/s × 5 seconds)
  • Remainder for L2ARC

Lack of redundancy aside, I would not create a metadata device since ZFS currently doesn’t permit a setup where L2ARC can be configured to not cache metadata (it would be inefficient to cache metadata that’s already on the Optane same media). So, the metadata would live on the QLC NAND RAIDz2 device and the ZIL and L2ARC devices on Optane would hide the latency.

Then, when I get a second P5800X both the SLOG and L2ARC would be put into the ZFS equivalent of RAID-0 mode.

Couple of thoughts:

  • Congrats on the Optane P5800X. This is quite an expensive device in your setup. What exactly is your use case? Your 905 or even your P1600X Optanes could suffice (for a lot less money).
  • Check out these observations about connecting Optane P5800X via m.2. Make sure to test your connection under load and check for errors
  • Consider using your SSDs as a ZFS special device. Look at ZFS Metadata Special Device: Z and ZFS: Metadata Special Device - Real World Perf Demo for explanations and details
  • I have been running an Optane SSD as single SLOG and L2ARC device for a while, but using two without redundancy does introduce risks I would not be willing to take.
  • Do you have a plan for using your 16x PCIe Gen 5 lanes? Review your mobo capabilities and schematics. I assume your 16x PCIe Gen 5 lanes may be very hard to use. Look for CPU vs chipset connected PCIe lanes. I assume yours are connected directly to CPU (because Gen 5). If they were connected to the chipset, they would share bandwidth with the 6x SATA lanes on their way to the CPU/memory. Does the mobo support bifurcation? If not, do you have a PLX card to switch the PCIe lanes into usable PCIe lanes (e.g. 4x or 8x m.2 or similar number of PCIe slots).
  • Your 6x SATA HDDs are intended as main storage medium and will determine the speed of your zfs pool. In a RaidZ2 configuration that would range between ~ 10MB/s to a theoretical max of ~800MB/s. The SSDs as SLOG, L2ARC, or special device can be used to overcome the performance drops that HDDs have helping them to stay at peak performance by read-/writing mostly large block size chunks of data. With an Optane P5800X in the mix I have the feeling that this is not the performance level you’re looking for.

I appreciate the prompt feedback! I read many of the threads already but seemed to have missed some critical posts. Just to make sure I have a good foundation to work with, I went ahead and put the whole M.2-OCuLink-enclosure setup together for testing. This thread about PCIe bifurcation casted a lot of doubt that I could ever get it to work: A Neverending Story: PCIe 3.0/4.0/5.0 Bifurcation, Adapters, Switches, HBAs, Cables, NVMe Backplanes, Risers & Extensions - The Good, the Bad & the Ugly - Hardware / Motherboards - Level1Techs Forums.

Using this finally worked for me with 0 errors while putting the drive under load for 10 hours:

  • Asus ROG Crosshair X670E Extreme
  • Micro SATA Cables Passive PCIe Gen 4 M.2-to-OCuLink adapter on the Asus-provided Gen-Z.2 riser card (PCIe 5.0 side)
  • LINKUP 25cm active OCuLink cable (OCU-8611N-025)
  • ICY DOCK U.2-to-OCuLink enclosure (MB699VP-B V3)

This was after some trial and error with different M.2 slots and a 50cm version of the active OCuLink cable. The 50cm cable gave errors no matter which M.2 slot I used. 42 WHEA errors events in 30 minutes was the best I could get. The M.2 slot at the far end of the motherboard gave the most errors—27 events in 60 seconds.

This would be my first time sinking so much money into experimenting with this stuff and I was surprised how much of a difference a few centimeters could make.

I had been intending to test bcachefs well into my part orders and intended to get 2 × P5800X as a mirrored write cache and metadata device. I only started looking at ZFS a couple weeks ago and unfortunately, some further reading explained that the SLOG is not the equivalent of a write cache which I thought would have allow me to have the best of both high-capacity RAIDz2 and the speed/latency of Optane for the bursty workloads that I’d be throwing at it. An actual writeback cache feature for ZFS doesn’t seem to have gone anywhere. I’ll test out ZFS still, but cutting 96TB of storage in half with RAID1 is something I would like to avoid if I can help it.

I know the SLOG only helps synchronous writes and protects asynchronous writes with sync=always (at the cost of latency), but would a persistent L2ARC at least mask the read latency of QLC NAND over SATA for VM/dev workloads?

Yeah. They are hard to use. I have been monitoring news of PCIe 5.0 HBAs/RAID cards, which I plan to get for hooking up the Optane SSDs. They’re never going to make PCIe 5.0 Optanes, so the next best thing is to aggregate a whole bunch of them together behind an HBA/RAID controller to better utilize the bandwidth. In the meantime, those M.2-OCuLink adapters will do.

For the Asus ROG Crosshair X670E Extrene, the chipset gets 4 × PCie 5.0 lanes worth of bandwidth, and after subtracting the bandwidth consumed by 2 Thunderbolt ports, that leaves 2 lanes worth shared by 2 × PCIe 4.0 x4 expansion slots and presumably a bunch of other things on the motherboard. Definitely not using those PCIe 4.0 slots especially since having tested 4K random read performance of the 905Ps on both PCIe 4.0 (chipset) and PCIe 5.0 (CPU) I found that the MBps dropped 7.89% on PCIe 4.0.

1 Like

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.