Proxmox ZFS NVME loose 80% performance

thetrick · February 17, 2024, 3:30pm

I have A pair of Dell R640 NVME with Samsung PM1725. I have Proxmox 8 and use ZFS for the boot mirror and the zRAID. This is an out of the box basic setup. NVME performance is terrible. A ZFS mirror or zRAID is at best 80% slower than an LVM on the same server.

Once I noticed something was off when our production DB was causing massive IO wait. I reconfigured one of the the servers to have a LVM , zmirror and zraid. I used a vm that I moved it’s disk to each type of volume. I used fio and ran tests that were both sequential and random, and mixed loads.

Every ZFS configuration I tried was 6x or 80% slower than the LVM. I have read a many threads and tried a lot of ZFS config changes, nothing made much of a difference.

I have also read that this, is not uncommon, but there are no answers, and some people say zfs is just not performant. Is this true? Is ZFS just not made for high speed drives? Is anyone using ZFS with nvme and not paying a huge performance penalty? What OS/file system are being used on big flash servers?

It’s hard to believe that 20MBs of db writes can overwhelm a drive designed to sustain 2,000MBs writes.

Todd

aBav.Normie-Pleb · February 17, 2024, 3:37pm

Note: I’m an absolute Proxmox/ZFS noob but am soon building something similar (NVMe mirror as boot vdev)

Is maybe the write cache on the drives themselves disabled as some sort of ZFS default data protection setting since ZFS can’t now if a used drive has powerloss protection?

nx2l · February 17, 2024, 7:36pm

imo

i’d save the nvme for the main storage and not use it for booting the OS

MadMatt · February 17, 2024, 8:47pm

HOw did you create the pool?
What ashift did you use?
Did you set up a special device/cache?
post a qm info of the vm you used for tests

You can’t just throw a statement like ‘zfs is 80% slower than lvm’ and then not follow up with data/specs of the scenario you are running … and expect someone to have a meaningful explanation …

thetrick · February 17, 2024, 9:47pm

The pools were created using the Proxmox GUI. I tried ashift 12 and 13, I verified the drives are 4k. I also built with lz4 compression and with default. The hardware is just factory spec Dell R640 512gb ram. No special caching, but I did try with a slog for fun, but it made no real difference. I have ARC capped at 32gb, as I have had experiences with this getting out of control and causing OOM kills on VM’s

The vm was an privileged lxd container running Ubuntu 23.04, but I have run this on the host against the boot zmirror and get the same kind of results.

The test command is:
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=32 --size=4G --readwrite=randrw --rwmixread=25

This is heavy on the write with a 25/75 mix random R/W. I did this because it matched my work load. I also did a test that was sequential and the disparity was the same.

The best I got from the random r/w test was 15MBs read and 46MBs write. Then to simulate a DB work load I set zfs sync=always as databases never write async. That brought my results down to 8MBs read and 25MBs write.

I guess the points is not so much to have people tell me what is wrong (no problem if you can), but if there is anyone getting anywhere near (like even 50%) of native performance out of ZFS using nvme drives.

I have run Proxmox clusters for 5 years in production and these are our first NVME based ones, but honestly this makes me question a lot about our other ZFS SSD pools.

alkafrazin · February 17, 2024, 10:22pm

probably wrong about this but...

This may be an unpopular opinion, but the entire point of ZFS was that your CPU was so much faster than spinning storage that it just made sense to have a much more complicated filesystem.
With high-speed flash storage, ZFS and BTRFS have much, much lower benchmark performance because these filesystems do a lot more stuff every time you do anything. 80% lower is a lot, but for a raid volume with 4+ drives on a low-clock system? I could see it.

I haven’t tried ZFS on nvme, but BTRFS definitely sees a pretty big performance drop. Trying out your test on a single NVME drive, I see 80% of a thread.

Seems like what you would expect with the difference in single threaded CPU performance.

Do you notice any problems in real world performance?
Also, I would definitely keep compression off if you want your NVME drives to perform. The drive bandwidth is very high, your CPU speed is going to have trouble just moving the bytes. The less steps you go through, the faster it will be. LVM is probably the way to go if performance is a concern.

thetrick · February 18, 2024, 1:08am

I think you might be on to something. As these servers are newer, but their per core single thread performance is not great. To back that up I just turned off compression and there was about a 17% improvement in performance.

Also I wonder if this is not being seen as much because many people playing with ZFS are on desktop CPU’s with high single thread performance.

As far as real world problems our database VM kernel panicked when I migrated another vm to the same hypervisor. The IO wait became so bad that kernel panicked due to a disk timeout. Also the sync behavior of mysql leaves little room for catchup or caching.

I think I am going to move the database servers to mirrored LVM disks. It’s a shame because i will have to do all the snapshot management manually as I don’t believe proxmox manages those LVM features.

Zedicus · February 18, 2024, 2:29am

I run mirrored zfs on nvme on proxmox on an EPYC (first gen) server. 1.88tb pm983a drives. The performance is as shoud be.

Best to not boot to your vm pool storage device, too.

thetrick · February 27, 2024, 4:50pm

It was the damn RAIDz, use RAID10 on zfs. Especially with NVME. This article gave, me the depth to risk making a change to my production pool.

jxdking · February 27, 2024, 5:05pm

What’s your record size?

If you use 128KB record size and do 4KB random test, it will be 32x times write/read amplification.

thetrick · February 27, 2024, 5:34pm

It’s 128k.

With RAIDz the fsync calls produced by database writes essentially reduce and entire pool to being sequential. Those syncs block all reads and writes, as it can not read until it writes the parity to all drives from the fsync. RAID10 does not have this problems as there is no parity, and you can read and write from different vdevs at the same time. Turns out it’s not so much about the raw throuput but the blocking issue of RAIDz