Why is my ZFS outperforming HW RAID by so much?

I was playing around with a dell R720 with a h710 Perc. I wanted to mess with ZFS, but sadly this controller has no passthrough. So for giggles I took 4 drives and made them each into a single RAID0 array. In Proxmox I took those 4 and made a Zraid pool. I have a windows VM it has a boot drive on a SSD vol. I made an additional disk for it on the HW raid vol, and on the zraid vol, those being E: and F: respectively. And I got the surprising benchmarks below. Why is ZFS so much faster on pretty much everything but the 4T write?

All the drives on the raid controller are 10k SAS rust.

Did you rule out ARC and other caching? The 3.2G sequential reads look a lot like NVMe or ARC, hard to compete vs. main memory for…well anything really.

2 Likes

Yea It has to be some sort main memory caching, I have proxmox’s disk cache turned off for this VM so I am guessing this is something set in the ZFS system. I ran the test again and watched my CPU and mem usage, and holly crap! I have never had so much CPU usage on this server. Also I got close to 5000 MB/s seq write which honestly concerns me as that means a lot of write data is hanging out in memory just looking to be lost on power failure.

Capturezfs2

1 Like

I would normally say it is the ARC, but that cannot explain the write speeds - ARC is not a write cache, and there should be no data hanging due to ARC that you could lose at a power loss.

I don’t know how to bypass ARC by other means than disabling it (not recommended) or decreasing its size. Turning off caching on VM backing via libvirt does not touch the ARC, since it is a layer completely opaque to libvirt.

If you want to test the impact of ARC, you could resize it to about 0.5-1Gb (default is 1/2 of system memory, unless proxmox enforces other defaults). Then sensibly adjust the file size that CrystalDiskMark creates.

(I’m unsure what is the best practice for changing arc size on proxmox, so I won’t give a direction, its better that you check the docs)

2 Likes

Maybe this is a symptom of the benchmark. If the benchmark was just rewriting data on the drive could the ARC match the data to an existing hash already on the drive and just modify it’s tables. Like a de-duplications, you don’t have to re-write data that is already there. This might explain write speeds that are unexplainably high.

1 Like

That’s my suspicion as well. Does CrystalDiskMark write data that’s compressible? If so that explains why it’s doing stuff at NVMe/RAM speeds for that second test. He may have removed a possible bottleneck, which resulted in higher CPU usages as it compressed everything.