Building a dual Xeon 6442Y system with 8 x DC5800X Optane NVME drives and ZFS IOPS/throughput is horrendous. RAID 10 and getting ~9K IOPS and ~2GB/s sequential read across all 8 drives. Can someone possibly point me to a resource for getting this tuned?
Also depends on your pool settings. record/blocksize, compression, atime,etc.
and especially ashift. I don’t know what ashift is best for optane? 12 always works.
ZFS has deficits in capitalizing on fast storage, because it relies so heavily on main memory doing the job. Pure NVMe pools sacrifice a lot of performance with ZFS regardless. But I’ve seen 12-13GB/sec sequential in ZFS, so something is odd with those 2GB/s.
In any case you will leave a lot of performance on the table. That’s why people don’t use Optane and (to a lesser degree) NAND as main storage.
That seems too slow even for ZFS, can you verify that your processor clocks are ramping during the fio run? I know sapphire rapids had some weird clock ramping behavior early on.
zfs set atime=off tank
zfs set compress=lz4 tank
zfs set primarycache=metadata tank
zfs set sync=off tank
zfs set sync=disabled tank
zfs set relatime=off tank
zfs set recordsize=16K tank
Installed tuned and changed the profile to “througput-performance”
I have not, however, I wouldn’t expect such a massive delta between a single drive (ext4) and RAID 10 with 8 drives (ZFS). Could probably get really good performance with md, but ZFS snapshots are so useful for database backups.
Be careful with the logbias setting. This results into serious fragmentation over time and there is nothing worse for performance than a framented pool. May not be that problematic with large blocks, but 16k and less, it’s nightmare even for NVMe/Optane. Writing the data orderly a second time keeps the pool healthy, but requires more work. I never found benefits outperforming the resulting framentation.
What are you running? FreeBSD, Linux, something else? I’m going to guess Linux since you mentioned ext4 though. Since you seem to have this as a test box, do you see the same behaving trying to use a newer kernel and/or FreeBSD 14.0? Are you sure lz4 compression don’t cause a bottleneck?
It’s Debian 12, kernel 6.1.0 and compression does slow things down a bit, the IOPS test increases to 17K vs ~4.5K with it on LZ4. I might try to load FreeBSD on the box. The box is essentially a playground for now until I can figure out how to get the storage to not suck this badly.
Edit :
created a md RAID10 set with all 8 drives, sequential reads are between 30 and 53GB/s, IOPS are 300-500Kish (which still seems pretty low when each drive is rated for 1.5M).
try it as a raidz instead. You still get a single drive of safety, and the rest are in a raid. Small transactions will only go to 2 drives, so you potentially get 4x the performance.
Also can you “zpool status” so we have the right visual?
Day #2 on this adventure. Noticed LBAs were set for 512 instead of 4K on the drive also downloaded the Intel CLI tool and flashed new firmware on everything. We’ll see.
on the random readwrite fio test, IOPS were bouncing between 30 (!) and ~5K. I completely disabled primary/secondary cache and things stabilized around 5K for RAIDZ so looks similar to before.
Connectivity is all fine. Single drive (xfs and ext4) performance is as expected. md RAID10 is nearly the throughput cap at ~50GB/s and ~500K IOPS. This seems to be a ZFS thing entirely. The only thing I really need ZFS for are snapshots and the endurance on these Optanes is such that it would take many lifetimes of expected load to even start denting them so maybe use md and stick ZFS on the /dev/mdX? Just thinking out loud.