Tips for NVME based ZFS on EPYC Rome?

Hi,
I’m planning to build a Proxmox machine with a Windows VM which will have a GPU passthrough for video editing.

For storage, I’m planning to use something like Asus’s Hyper M.2 card with 4 2TB Seagate Firecuda 510 drives in RAID 0 (hey, if Apple can do it, I can do the same, and besides - I do daily snapshots and weekly backup). The CPU will be EPYC Rome 7302 (16 cores)

So, I was wondering if anyone has any tips for the ZFS setup (it’s only for storing the virtual disk for the video content). I’ve seen issues with NVME only with ZFS on LTT with large number of drives, but in this case it’s only 4.

So far I gathered:

  • No cache drives are needed, neither SLOG
  • Ashift should be 12
  • Sync off

Anything else?

1 Like

I’d consider a single kioxia u.2 unit instead of the Asus card an 4 nvmes and raid0 …
What performance figures are you looking for)expecting and how much time do you want to spend troubleshooting?

I have a similar setup, 4x WD SN850 2TB on a Gigabyte Aorus AIC adaptor (basically exactly the same as the Hyper M.2 card) on an EPYC Milan. Used to have two cards with the same SSDs and it worked great but moved one of them to another box.

Currently have each SSD partitioned the same: 1GB ESP for boot, and 2 partitions of ~1TB each for ZFS. Then I use the first 1TB partitions for a striped mirror pool, then the other 1TB partitions for a striped pool with no redundancy.

I put my root fs dataset and project files on the pool with redundancy, and scratch files on the striped pool. All ESPs have the exact same files (rsync’d by a package manager hook on every kernel upgrade). That way if a SSD suddenly dies, I can still boot/work any only lose the scratch files.

If you use the standard fallback EFI loader name on all ESPs (/EFI/Boot/bootx64.efi) then you don’t even need to create EFI loader entries, the firmware will just pick the first one available (which ever order it is) at boot.

Using zstd-1 compression and encryption (aes-256-gcm) on everything, benchmarked it against no encryption and saw little difference. No other special settings. Played around with the tips from Wendell in the big setup vids but they made no difference for me that I could tell with so few SSDs (rcu_nocbs and some other kernel params I forgot).

RAID configuration is hard with 4 drives. RAID 0 Striping is a classic and I use it myself, but it also disables self-healing properties of ZFS, which is a major selling point for many users. ZFS struggles to scale NVMe performance because of overhead in the ARC (mainly because arrays of NVMe approaching bandwidth and latency of main memory). I’ve seen multiple figures stating that ZFS more or less caps out at 12-13GB/s. But then you get the benefit of having ARC running for your reads which often make more than up for it because DRAM.

I have two striped PCIe 3.0 drives running with ZFS and I don’t see any noticeable performance limitations when comparing to vendor drive spec maximums. But ZFS always has some toll to pay regarding disk performance. That’s the nature of a CoW filesystem.

Good :slight_smile:

I never set any ashift on my pools. default is auto-detect which always worked fine for me.

LTT is hardly a reference regarding proper information on how to do or use ZFS. So set up replication and scrub jobs on your proxmox storage. Right after you created the pool :wink:

One additional note…if you want performance, passthrough the drives. Native always beats virtual in performance (especially when virtualizing means huge single file qcow2 images or Zvols) and troubleshooting. Restricts you to Windows FS and no ZFS, but for that kind of work I’d take that hit and backup data to Proxmox.

Oh and my setup: Server has striped mirrors of HDDs (RAID 10) and my workstation has a single NVMe boot drive and one NVMe disk for data. I will move to cheap SATA boot drive (BTRFS) + 4x1TB RAIDZ1 (ZFS) just to get everything to ZFS and properly checksummed, redundant and performant. So I have a different approach to the “what to do with 4x drives?”. Because if I go ZFS, I want the whole integrity package and 10+GB/s isn’t a necessity but latency is.

1 Like

btw, if you do a raid 10, make the mirrors first, then stripe them.

That way when the read occurs, it finds the sector and reads that from one of your mirrored drives. You could have one working drive in each mirror remaining, and still be able to read and write to the array.

I am planning something similar, but with an optane drive for my virtual memory so that it does not put load on zfs for the temporary write events.