Zfs nvme read performance

hi,

im trying to find out why zfs is pretty slow when it comes to read performance,
i have been testing with different systems, disks and seetings

zpool create -f -o ashift=12 iotest /dev/nvem0n1
zfs create -V 8g iotest/iotest
zfs set compression=zstd-fast iotest/iotest
zfs set primarycache=metadata iotest/iotest
zfs set secondarycache=none iotest/iotest

fio --ioengine=libaio --filename=/dev/zvol/iotest/iotest --size=8G --time_based --name=fio --group_reporting --runtime=60 --direct=1 --sync=1 --iodepth=1 --ramp_time=10 --rw=randread --bs=8k --numjobs=256

testing directly on the disk im able to achieve some reasonable numbers not far away from specsheet => 400-650k IOPS
testing on zvol is almost always limited to about 150-170k IOPS and it dosnt matter what cpu or disks im using

is there a bootleneck in zfs … or am i doing something wrong?

any though?

cheers

Why do you manipulate ARC and L2ARC?

tuning off L2SARC could be a way to get mor ehonest IOPS numbers but ARC?! This is an integral part of zfs - if you screw with it, performance will go down.

1 Like

I’m not sure you’re going to get ZFS to perform on a single disk, or rather not well.

You’re really intended to have disks for storage and disks for zil and l2arc.

I don’t have any lab kit that I can test this on to compare right now, but my gut is telling me I would see similar results if I used a single disk like that.

@igoodman it is supposed to run kvm, so no point for arc, it still dosnt explain this bottleneck

@cybersplice i have tried with same disks in mirror, mdraid scales up to about 1.1M IOPS, zfs is stuck at ~150K IOPS

the only performance difference im seeing is while testing on a desktop hardware with i7 8700k and ramdisk, im able to reach about 260K IOPS but ZFS eats all cpu

tmpfs 30G 9.1G 21G 31% /test
truncate --size 10G /test/disk1.img
zpool create -f -o ashift=12 iotest /test/disk1.img
zfs create -V 8g iotest/iotest
zfs set compression=zstd-fast iotest/iotest
zfs set primarycache=metadata iotest/iotest
zfs set secondarycache=none iotest/iotest
image

Do you have a third?
If you do, define one as pool, one as zil, and one as l2arc.

See if you still get the bottleneck.

1 Like

Why whould you omit ARC for KVM?

This is a little bit like biking with flat tyres: Sure it will go, but do not expect high speeds

esp. if you use a KVM wirtualisation you need it!

I agree with @cybersplice

whats the point in putting vm vfs cache on top of zfs cache?

slog/zil - im testing read performance issue, so dosnt matter,
l2arc dosnt make much sense with 2 quick nvmes like p4510

I understand what you’re asking re vfs / zfs, but you can’t lump them together.

vfs is to cache io operations at the hypervisor level. ZFS isn’t intended to be used without cache.

If you don’t want to use the features that make zfs useful, why bother with ZFS at all?

Not trying to be belligerent here, but why not use btrfs or xfs? I would expect you’d get the performance you’re looking for.

1 Like

in raid1 with 2 p4510, comparing to mdraid, i have about 80-90% performance drop,
i can accept 30 maybe 40 … but 80-90% hurts

im using zfs for mirroring and check summing, it is also the only software raid officially supported by proxmox

Unless you want to buy some more nodes and use Ceph…! :joy:

It will perform better with caching. You can probably get away without zil, but it will be faster with. I appreciate that’s inconvenient when you’ve bought fast SSDs for storage.

Like I say, it just isn’t designed to work without it.

And yeah you don’t get a lot of freedom with proxmox, I guess it makes sense considering their business model and target market.

I guess you could roll your own if you have nothing better to do…

Alternatively, xcp-ng uses mdraid I believe, and works pretty nicely. Not as nicely as Proxmox in the UI Department, but i would expect it will do what you want it to do.

my recommendation for your usecase: set up a raidZ1
do not touch any ARC settings

have you tried ashift=13?
What is your recordsize?