Wendell, great writeup!
I’m running a TR 3960x with 4 x 1 TB gen 4 NVMe’s and am testing raid setups. As you noticed writing the raw interface speed is great:
root@3960x:~# fio fiotest.txt
random-read: (g=0): rw=randread, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=128
fio-3.12
Starting 1 process
Jobs: 1 (f=4): [r(1)][100.0%][r=15.7GiB/s][r=129k IOPS][eta 00m:00s]
random-read: (groupid=0, jobs=1): err= 0: pid=106324: Sun Feb 9 20:04:14 2020
read: IOPS=118k, BW=14.4GiB/s (15.4GB/s)(863GiB/60003msec)
slat (nsec): min=2080, max=80847, avg=2888.18, stdev=867.04
clat (usec): min=33, max=26691, avg=1083.16, stdev=1200.87
lat (usec): min=36, max=26695, avg=1086.09, stdev=1200.86
clat percentiles (usec):
| 1.00th=[ 285], 5.00th=[ 412], 10.00th=[ 453], 20.00th=[ 537],
| 30.00th=[ 635], 40.00th=[ 734], 50.00th=[ 840], 60.00th=[ 963],
| 70.00th=[ 1123], 80.00th=[ 1352], 90.00th=[ 1762], 95.00th=[ 2245],
| 99.00th=[ 5211], 99.50th=[ 8979], 99.90th=[17695], 99.95th=[19268],
| 99.99th=[21627]
bw ( MiB/s): min= 5601, max=16249, per=99.99%, avg=14725.90, stdev=3055.10, samples=120
iops : min=44812, max=129994, avg=117807.20, stdev=24440.78, samples=120
lat (usec) : 50=0.01%, 100=0.06%, 250=0.66%, 500=14.94%, 750=26.08%
lat (usec) : 1000=20.70%
lat (msec) : 2=30.71%, 4=5.42%, 10=1.01%, 20=0.39%, 50=0.03%
cpu : usr=10.26%, sys=39.59%, ctx=2814273, majf=0, minf=4105
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=7069271,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
READ: bw=14.4GiB/s (15.4GB/s), 14.4GiB/s-14.4GiB/s (15.4GB/s-15.4GB/s), io=863GiB (927GB), run=60003-60003msec
where fiotest.txt is copied from your example above:
root@3960x:~# cat fiotest.txt
[global]
bs=128k
iodepth=128
direct=1
ioengine=libaio
randrepeat=0
group_reporting
time_based
runtime=60
filesize=6G
numjobs=1
[job1]
rw=randread
name=random-read
#filename=/tank1/test.txt
filename=/dev/disk/by-id/nvme-Sabrent_Rocket_4.0_1TB_redacted:/dev/disk/by-id/nvme-Sabrent_Rocket_4.0_1TB_redacted:/dev/disk/by-id/nvme-Sabrent_Rocket_4.0_1TB_redacted:/dev/disk/by-id/nvme-Sabrent_Rocket_4.0_1TB_redacted
When running this on a stripped zfs volume:
zpool create tank1 /dev/disk/by-id/nvme-Sabrent_Rocket_4.0_1TB_xx0 /dev/disk/by-id/nvme-Sabrent_Rocket_4.0_1TB_xx1 /dev/disk/by-id/nvme-Sabrent_Rocket_4.0_1TB_xx2 /dev/disk/by-id/nvme-Sabrent_Rocket_4.0_1TB_xx3
I get the following results:
root@3960x:~# fio fiotest.txt
random-read: (g=0): rw=randread, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=128
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=5302MiB/s][r=42.4k IOPS][eta 00m:00s]
random-read: (groupid=0, jobs=1): err= 0: pid=110170: Sun Feb 9 20:10:43 2020
read: IOPS=41.7k, BW=5207MiB/s (5460MB/s)(305GiB/60001msec)
slat (usec): min=10, max=112, avg=23.27, stdev= 5.29
clat (nsec): min=1910, max=4521.4k, avg=3049079.32, stdev=350086.73
lat (usec): min=26, max=4596, avg=3072.43, stdev=352.77
clat percentiles (usec):
| 1.00th=[ 2409], 5.00th=[ 2507], 10.00th=[ 2573], 20.00th=[ 2671],
| 30.00th=[ 2769], 40.00th=[ 2900], 50.00th=[ 3097], 60.00th=[ 3294],
| 70.00th=[ 3392], 80.00th=[ 3392], 90.00th=[ 3425], 95.00th=[ 3425],
| 99.00th=[ 3654], 99.50th=[ 3687], 99.90th=[ 3687], 99.95th=[ 3720],
| 99.99th=[ 3752]
bw ( MiB/s): min= 4631, max= 5842, per=100.00%, avg=5209.28, stdev=387.09, samples=119
iops : min=37050, max=46738, avg=41674.20, stdev=3096.71, samples=119
lat (usec) : 2=0.01%, 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%
lat (usec) : 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=100.00%, 10=0.01%
cpu : usr=3.21%, sys=96.78%, ctx=121, majf=0, minf=4105
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=2499238,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
READ: bw=5207MiB/s (5460MB/s), 5207MiB/s-5207MiB/s (5460MB/s-5460MB/s), io=305GiB (328GB), run=60001-60001msec
I’ve tried adding polling, disabling arc compression, rcu_nocbs etc but haven’t seen much improvement.
root@3960x:~# cat /etc/kernel/cmdline
root=ZFS=rpool/ROOT/pve-1 boot=zfs mce=off iommu=pt avic=1 pcie_aspm=off rcu_nocbs=0-47 pci=noaer nvme_core.io_timeout=2 nvme.poll_queues=48 max_host_mem_size_mb=512 nvme.io_poll=1 nvme.io_poll_delay=1
I didn’t have the timeout issues you’re seeing with Epyc, perhaps rcu_nocbs doesn’t do much here for me…
Is this the best we can do on ZFS NVMe raid right now? In your final config on the Epyc setup what are your fio results for the md setup as well as best zfs setup?
Very much appreciate this forum and enjoy the youtube videos, keep up the great work!