Help / Guidance w/Slow ZFS on ProxMox w/NVME Raid 1 Mirror as boot

burmjohn · March 14, 2021, 6:54pm

HI All,

It looks as though I may have setup my new promox home server incorrectly for the ZFS raid 1 boot. I’m seeing some pretty odd issues with write speeds. I started digging into the issue and I noticed I may have selected an incorrect ashift size. This is the first time I’m attempting proxmox w/ZFS as my boot drive, so I am sure mistakes were made.

I have 2x 2TB NVME drives, when I checked the sector size its reporting as 512 and I setup everything with a ashift of 12. Could this be why I am seeing issues with write performance?

Pool:

root@proxmox:~# zpool status
pool: rpool
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using ‘zpool upgrade’. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(5) for details.
scan: scrub repaired 0B in 00:08:33 with 0 errors on Sun Mar 14 00:32:34 2021
config:

    NAME                                         STATE     READ WRITE CKSUM
    rpool                                        ONLINE       0     0     0
      mirror-0                                   ONLINE       0     0     0
        nvme-ADATA_SX8200PNP_2K462L2G4EPE-part3  ONLINE       0     0     0
        nvme-ADATA_SX8200PNP_2K4329A897KU-part3  ONLINE       0     0     0

Disks:

root@proxmox:/mnt/backups/dump# fdisk -l
Disk /dev/nvme1n1: 1.9 TiB, 2048408248320 bytes, 4000797360 sectors
Disk model: ADATA SX8200PNP
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 7D6D1FB1-67E9-4451-9761-E0B8E7F607C9

Device Start End Sectors Size Type
/dev/nvme1n1p1 34 2047 2014 1007K BIOS boot
/dev/nvme1n1p2 2048 1050623 1048576 512M EFI System
/dev/nvme1n1p3 1050624 4000797326 3999746703 1.9T Solaris /usr & Apple ZFS

Disk /dev/nvme0n1: 1.9 TiB, 2048408248320 bytes, 4000797360 sectors
Disk model: ADATA SX8200PNP
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: F769EFD0-BF00-4E91-9EE8-C1C1CAFF851B

Device Start End Sectors Size Type
/dev/nvme0n1p1 34 2047 2014 1007K BIOS boot
/dev/nvme0n1p2 2048 1050623 1048576 512M EFI System
/dev/nvme0n1p3 1050624 4000797326 3999746703 1.9T Solaris /usr & Apple ZFS

anon27075190 · March 14, 2021, 8:44pm

An ashift of 12 should be correct. fdisk is not able to report the real sector size on SSDs. Try to run stat -f /dev/nvme0n1 instead. I want to point out that I have heard that some SSDs also use 8K block size, so in this case you would need an ashift of 13.

burmjohn · March 14, 2021, 9:28pm

Interesting, I didn’t realize that. So does this mean the sector size is really 4096 even though fdisk is reporting 512?

root@proxmox:~# stat -f /dev/nvme0n1
File: “/dev/nvme0n1”
ID: 0 Namelen: 255 Type: tmpfs
Block size: 4096 Fundamental block size: 4096
Blocks: Total: 16473991 Free: 16473991 Available: 16473991
Inodes: Total: 16473991 Free: 16473280

What kind of speed should I be seeing in the fio test, because for random writes this doesnt seem correct?

ex:

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=® 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.12
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [F(1)][100.0%][w=8648KiB/s][w=2162 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=18363: Sun Mar 14 16:59:02 2021
write: IOPS=22.7k, BW=88.5MiB/s (92.8MB/s)(5429MiB/61330msec); 0 zone resets
slat (nsec): min=260, max=11354k, avg=951.12, stdev=15944.23
clat (nsec): min=210, max=41627k, avg=41728.29, stdev=129623.18
lat (usec): min=4, max=41628, avg=42.68, stdev=131.22
clat percentiles (usec):
| 1.00th=[ 6], 5.00th=[ 6], 10.00th=[ 7], 20.00th=[ 9],
| 30.00th=[ 31], 40.00th=[ 33], 50.00th=[ 34], 60.00th=[ 35],
| 70.00th=[ 37], 80.00th=[ 43], 90.00th=[ 88], 95.00th=[ 128],
| 99.00th=[ 163], 99.50th=[ 176], 99.90th=[ 519], 99.95th=[ 1090],
| 99.99th=[ 4817]
bw ( KiB/s): min=23784, max=218968, per=100.00%, avg=92857.52, stdev=35705.39, samples=119
iops : min= 5946, max=54742, avg=23214.35, stdev=8926.35, samples=119
lat (nsec) : 250=0.01%, 500=0.01%, 750=0.01%
lat (usec) : 10=22.03%, 20=2.10%, 50=59.29%, 100=7.10%, 250=9.24%
lat (usec) : 500=0.13%, 750=0.03%, 1000=0.02%
lat (msec) : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=3.54%, sys=3.53%, ctx=1426932, majf=0, minf=44
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,1389930,0,1 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=88.5MiB/s (92.8MB/s), 88.5MiB/s-88.5MiB/s (92.8MB/s-92.8MB/s), io=5429MiB (5693MB), run=61330-61330msec

Log · March 14, 2021, 9:31pm

Your drives are lying. Most solid state storage is 8K, 16K or some even weirder number underneath. However, most consumer SSD’s are tuned so that treating them as if they are 4K or 8K (ashift 12 or 13) is optimal (you can seriously just pick either one). There are some that are best at 512B, but those are few and far between.

Even if your ashift was way to big, the only real loss you’d have is space efficiency for files that could be compressed to be below half your normal ashift. Theoretically there are some potential read latency concerns as the entirety of the larger blocks needs to be returned, but that’s not a real life concern unless you are managing a database or otherwise doing a lot of constant tiny read operations within files.

Setting your ashift too small (ashift=9) results in massive read-write amplification.

How are you determining poor performance? Did you benchmark your drives as is to determine base performance, and again under ZFS? Did you use htop to check if your CPU cores are maxing out while doing operations?

burmjohn · March 14, 2021, 9:37pm

I did a few copies and was checking out the status with progress to see the speed. I then dove into FIO testing and the performance seems less then ideal for NVME’s ( I pasted the output in the previous reply. ). I did not do any base performance, this is a new setup. I’m going to try to do some more tests later as well, but open to any suggestions. ** Thanks to both of you for the replies.

John

Log · March 14, 2021, 9:40pm

Also have a look at this post, which has some settings that can a pretty big impact on ZFS performance in linux. Especially atime. Fuck atime.

How-To - Kioxia 4x 1tb ZFS Root Wikis & How-to Guides

I blanket apply the following to my datasets, and also have it applied to my proxmox mirrored root pool (note, some of these kill BSD compatibility, which I’m fine with): zfs set acltype=posixacl xattr=sa atime=off relatime=off aclinherit=passthrough rpool Normally I ALSO include dnodesize=auto in that list, however I came across the following: https://lists.proxmox.com/pipermail/pve-user/2019-March/170503.html So don’t use dnodesize=auto on rpool in proxmox, and possibly other distributions. Also be wary of ZFS using it as a default when creating a new pool and moving your root pool. It seems zfs set dnodesize=legacy rpool is what is needed. I’m sure you know this, but it should be mentioned so others can more easily search is that the term for splitting x16 to x8x8 or x4x4x4x4 is “PCIe bifurcation”, and should be present and searchable in the motherboard manual if the feature is present. As far as hardware goes, I’ve had zero issues with ASUS Hyper M.2 X16 PCIe 4.0 X4 Expansion Card, though all they are doing is holding some cheap ebay 16gb optanes (which only use 2 lanes PCIe 3.0 each) for my pools SLOG’s.

burmjohn · March 14, 2021, 9:44pm

@Log I’ll check that out, thanks! What do you think of the slow results from the random write test?

I ran that same test on my spinning rust ZFS mirror which is inside a VM (truenas scale) on the same host and I am seeing 45+ MiB/s, seems odd that NVME’s would be only double the speed on the benchmark?

Log · March 14, 2021, 9:56pm

I wish I was some benchmarking guru, but at best I’m a cargo-cultist when it comes to that.

The most plausible explanation from my hazy understanding, is that you (and I) aren’t correctly understanding what FIO is actually doing, and that read-write amplification is happening. I’ve yet to seriously benchmark my own disks, because thus far it’s been good enough for me not to care enough to have to burn time on it. Sorry I can’t be more helpful in that regard.

burmjohn · March 14, 2021, 10:31pm

No problem, thanks again for the help. I just wanted to do the benchmarks to make sure I’m actually getting the speeds I should be on this new build before leaving this thing alone.

anon27075190 · March 15, 2021, 8:35pm

In case you are on Linux you could try KDiskMark. It is the Linux version of CrystalDiskMark. When benchmarking ZFS make sure to rather set the number of passes to one and choose a bigger size for the test file. If you don’t do that ZFS might keep the file it is testing with in cache for each pass after the first one and that would falsify the results. Even though is is a great program in general speaking from experience it seems not to work too well with ZFS. At least I had somewhat varying results from it during different runs. It might still be another starting point. You can deduce from my writing I am also not an expert on ZFS benchmarking.

burmjohn · March 16, 2021, 2:11am

So I went ahead and tried various tweaks and changes last night. No luck, still seeing really slow speeds on the host and the VM’s. Just was super frustrated at this point… I even ran out to MicroCenter today and grabbed a ironwolf 125 SSD to add as a SLOG (I wanted this anyways for my ZFS NAS VM, but figured I’d try it on the host first to see). That made zero impact on the host, same issues.

Today I backed backed up all the VM’s and decided to try a different filesystem so I could get some tests going to compare vs my ZFS struggles. I thought perhaps I had a bad drive or a config issue.

I setup raid 1 via madam w/XFS as the filesystem for the two NVME drives and re-installed proxmox. Re-imported all the VM’s and up and running again. First test was pulling data off my TrueNas Scale VM to another VM via NFS. Before I’d see maybe 150-200, now I am seeing a consistent 350-400+ ex: 14.6% (9.6 GiB / 65.4 GiB) 420.6 MiB/s remaining 0:02:15. Same goes for copying a file within a VM as well as copying files on the host, everything is fast as it should be, ex: 32.6% (12.0 GiB / 36.8 GiB) 915.9 MiB/s remaining 0:00:27.

I re-ran some of those fio tests on the host as well

Random write:
root@proxmox:~# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=® 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.12
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [F(1)][100.0%][w=53.0MiB/s][w=13.8k IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=1474: Mon Mar 15 18:55:27 2021
write: IOPS=117k, BW=458MiB/s (481MB/s)(27.7GiB/61761msec); 0 zone resets
slat (nsec): min=250, max=1905.6k, avg=893.55, stdev=761.48
clat (nsec): min=550, max=1028.1k, avg=5132.49, stdev=1666.20
lat (usec): min=3, max=1907, avg= 6.03, stdev= 1.94
clat percentiles (nsec):
| 1.00th=[ 2544], 5.00th=[ 2640], 10.00th=[ 2672], 20.00th=[ 2768],
| 30.00th=[ 5472], 40.00th=[ 5792], 50.00th=[ 5920], 60.00th=[ 5984],
| 70.00th=[ 6112], 80.00th=[ 6240], 90.00th=[ 6432], 95.00th=[ 6624],
| 99.00th=[ 7136], 99.50th=[ 7520], 99.90th=[ 8896], 99.95th=[ 9664],
| 99.99th=[11328]
bw ( KiB/s): min=20792, max=1090624, per=100.00%, avg=585781.51, stdev=238168.11, samples=98
iops : min= 5198, max=272656, avg=146445.40, stdev=59542.06, samples=98
lat (nsec) : 750=0.01%
lat (usec) : 4=28.68%, 10=71.28%, 20=0.03%, 50=0.01%, 100=0.01%
lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%
lat (msec) : 2=0.01%
cpu : usr=14.35%, sys=25.54%, ctx=7392187, majf=0, minf=45
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,7248968,0,1 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=458MiB/s (481MB/s), 458MiB/s-458MiB/s (481MB/s-481MB/s), io=27.7GiB (29.7GB), run=61761-61761msec

Read:
root@proxmox:~# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=testio --filename=testio --bs=4k --iodepth=64 --size=4G --readwrite=randread
testio: (g=0): rw=randread, bs=® 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.12
Starting 1 process
testio: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [r(1)][100.0%][r=1153MiB/s][r=295k IOPS][eta 00m:00s]
testio: (groupid=0, jobs=1): err= 0: pid=1850: Mon Mar 15 18:56:55 2021
read: IOPS=296k, BW=1158MiB/s (1214MB/s)(4096MiB/3538msec)
bw ( MiB/s): min= 1144, max= 1166, per=100.00%, avg=1158.00, stdev= 7.91, samples=7
iops : min=293086, max=298528, avg=296448.86, stdev=2024.47, samples=7
cpu : usr=19.45%, sys=77.83%, ctx=12705, majf=0, minf=71
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=1048576,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: bw=1158MiB/s (1214MB/s), 1158MiB/s-1158MiB/s (1214MB/s-1214MB/s), io=4096MiB (4295MB), run=3538-3538msec

Disk stats (read/write):
dm-1: ios=1036134/0, merge=0/0, ticks=143644/0, in_queue=143644, util=97.38%, aggrios=1048576/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
md0: ios=1048576/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=527403/3117, aggrmerge=3118/3118, aggrticks=73058/186, aggrin_queue=0, aggrutil=96.12%
nvme0n1: ios=412398/6232, merge=0/6236, ticks=73881/373, in_queue=0, util=96.12%
nvme1n1: ios=642408/2, merge=6236/0, ticks=72235/0, in_queue=0, util=96.12%

Super happy with how things are now seeing I am getting the speeds and performance I should be. Not ruling out there was something up with my config but I feel like I gave it a solid try and something wasn’t adding up.

Thanks!
John

ChuckH · March 16, 2021, 7:34am

I recently did some ZFS benchmarking with a 1TB 970 Pro…

I tested with fio, random-read and random-write 512 byte and 4k all the way to 8M.
I ran the test on a 970 Pro freshly formatted to ext4, XFS, ZFS, and NTFS for laughs. Joke was on me though, NTFS did really weld on reads. As expected terrible on writes

Anyway, on to the good stuff.

WRITE:
XFS was slightly faster than ext4 at most IO sizes.
ZFS was slower than either until reaching its record size, then it exceeded both XFS and ext4.

Examples (ordered ext4, XFS, ZFS 128k record, ZFS 64k record) in MiB/s:
4k IO: 468, 401, 209, 229
8k IO: 628, 733, 404, 442
32k IO: 1548, 1852, 989, 1406
64k IO: 1828, 2171, 1722, 2453
128k IO: 2033, 2263, 2746, 2489
8M IO: 2120, 2318, 2595, 2645

I have more results but I didn’t want to copy/paste a massive table.

READ:
really hard to test and get meaningful values from ZFS due to its ARC. Depending on what I did with the ARC Cache and how big of files I tried to make it read would drastically change the results. Leaving things at ZFS defaults gave me read results as high as 9756 MiB/s!! I basically quite trying due to it not being meaningful results.

@burmjohn If you are running the test in VMs, then there is also the question of the image file or ZVOL’s performance. I have not tested it, but from my reading the best thing to do with ZFS is match your record size with the sector size of the qcow2 file (64k I think) and pre-allocate your qcow2’s to prevent the performance hit from allocation on writes.
The other oddball thing I read about was using a ZVOL formatted to XFS or ext4 to store your images on. What were you running for record size & VM disk images?

Some reading on VMs on ZFS with benchmarks:
https://jrs-s.net/2018/03/13/zvol-vs-qcow2-with-kvm/

https://jrs-s.net/2013/05/17/kvm-io-benchmarking/

EDIT: My testing was also with compression & encryption on for ZFS, since that is how I would use it. That probably improved the max throughput for ZFS writes.

burmjohn · March 16, 2021, 11:59am

Hi Chuck,

Interesting results. I’m just not to sure why I wasn’t getting anything close to acceptable results on the host machine, I guess it could have been a number of factors. ZFS is working amazing for the NAS setup though, I’ll give it another go when I have some time to see if it was indeed a configuration issue / user error (highly likely, its me after all ) .

John

ChuckH · March 16, 2021, 8:32pm

I’ll check what all ZFS options I was using when I get a chance.

The think that sticks out to me off the top of my head is that I was using whole disk and not a partition. Maybe your partition isn’t aligned to 4k sector size and so your getting write amplification?

ChuckH · March 17, 2021, 5:03am

FYI: I ran
zpool create -f -o ashift=12 -O encryption=aes-256-gcm -O keylocation=prompt -O keyformat=passphrase -O acltype=posixacl -O dnodesize=auto -O normalization=formD -O relatime=on -O xattr=sa dpool /dev/nvme0n1
for my testing

I’m not sure how that compares to default proxmox settings.