What's wrong with my ZFS pool?

So I have a threadripper pro 3955WX running on an Asus Pro Sage WRX80-SE WiFi. I have 2 optane NVMEs in the PCIE riser card that was included with the motherboard operating as a mirrored SLOG and a 512GB Samsung 980 Pro in the motherboard operating as an L2ARC. (The riser is configured with proper bifurcation and both drives are seen independently). The main storage of the pool is 4 7200RPM 8TB HDDs running in a RAIDZ10 configuration (striped mirrors). This board supports insane amounts of bandwidth and I should have no bottlenecks whatsoever with the configuration I am running but my ZFS performance is extremely slow. When I benchmark the pool I get about 11MB/s. I should be seeing higher speeds than that on a single HDD let alone a RAID10 with all this cache. Does anyone have any theories what the issue might be? Ashift is default and compression is lz4. I don’t have dedup or any other esoteric settings configured. When I try to copy files from a zvol to another inside a Windows KVM it just stops writing entirely after a few GB. What on earth is the problem here? Here are some outputs so you can get a better idea of my configuration.

Pathetic FIO results:
Run status group 0 (all jobs):
READ: bw=10.7MiB/s (11.3MB/s), 10.7MiB/s-10.7MiB/s (11.3MB/s-11.3MB/s), io=644MiB (675MB), run=60002-60002msec
WRITE: bw=10.8MiB/s (11.3MB/s), 10.8MiB/s-10.8MiB/s (11.3MB/s-11.3MB/s), io=646MiB (677MB), run=60002-60002msec

zpool status
pool: mainpool
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using ‘zpool upgrade’. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 10:26:31 with 0 errors on Wed Jun 14 09:46:55 2023
config:

NAME                                                                                                 STATE     READ WRITE CKSUM
mainpool                                                                                             ONLINE       0     0     0
  mirror-0                                                                                           ONLINE       0     0     0
    ata-ST8000VN004-2M2101_WSD2ZSL0                                                                  ONLINE       0     0     0
    ata-ST8000VN004-2M2101_WSD4A6V9                                                                  ONLINE       0     0     0
  mirror-1                                                                                           ONLINE       0     0     0
    ata-ST8000VN004-2M2101_WSD2YXXP                                                                  ONLINE       0     0     0
    ata-ST8000VN004-2M2101_WSD47QSX                                                                  ONLINE       0     0     0
logs	
  mirror-2                                                                                           ONLINE       0     0     0
    nvme-nvme.8086-504842543731373230344d4430313644-494e54454c204d454d50454b31573031364741-00000001  ONLINE       0     0     0
    nvme-nvme.8086-50484254373231353035544230313644-494e54454c204d454d50454b31573031364741-00000001  ONLINE       0     0     0
cache
  nvme4n1                                                                                            ONLINE       0     0     0

errors: No known data errors

zdb -C
mainpool:
version: 5000
name: ‘mainpool’
state: 0
txg: 632285
pool_guid: 16313834611246847220
errata: 0
hostid: 279845615
hostname: ‘nixos’
com.delphix:has_per_vdev_zaps
vdev_children: 3
vdev_tree:
type: ‘root’
id: 0
guid: 16313834611246847220
create_txg: 4
children[0]:
type: ‘mirror’
id: 0
guid: 6134229592076744602
metaslab_array: 145
metaslab_shift: 34
ashift: 12
asize: 8001548713984
is_log: 0
create_txg: 4
com.delphix:vdev_zap_top: 129
children[0]:
type: ‘disk’
id: 0
guid: 7998283492972965798
path: ‘/dev/disk/by-id/ata-ST8000VN004-2M2101_WSD2ZSL0-part1’
devid: ‘ata-ST8000VN004-2M2101_WSD2ZSL0-part1’
phys_path: ‘pci-0000:2d:00.0-ata-1’
whole_disk: 1
DTL: 1287
create_txg: 4
com.delphix:vdev_zap_leaf: 130
children[1]:
type: ‘disk’
id: 1
guid: 6234571946024815914
path: ‘/dev/disk/by-id/ata-ST8000VN004-2M2101_WSD4A6V9-part1’
devid: ‘ata-ST8000VN004-2M2101_WSD4A6V9-part1’
phys_path: ‘pci-0000:2d:00.0-ata-2’
whole_disk: 1
DTL: 1286
create_txg: 4
com.delphix:vdev_zap_leaf: 131
children[1]:
type: ‘mirror’
id: 1
guid: 7588941257002706402
metaslab_array: 135
metaslab_shift: 34
ashift: 12
asize: 8001548713984
is_log: 0
create_txg: 4
com.delphix:vdev_zap_top: 132
children[0]:
type: ‘disk’
id: 0
guid: 14078437244791113390
path: ‘/dev/disk/by-id/ata-ST8000VN004-2M2101_WSD2YXXP-part1’
devid: ‘ata-ST8000VN004-2M2101_WSD2YXXP-part1’
phys_path: ‘pci-0000:2d:00.0-ata-5’
whole_disk: 1
DTL: 1285
create_txg: 4
com.delphix:vdev_zap_leaf: 133
children[1]:
type: ‘disk’
id: 1
guid: 5891649818398663305
path: ‘/dev/disk/by-id/ata-ST8000VN004-2M2101_WSD47QSX-part1’
devid: ‘ata-ST8000VN004-2M2101_WSD47QSX-part1’
phys_path: ‘pci-0000:2d:00.0-ata-6’
whole_disk: 1
DTL: 1284
create_txg: 4
com.delphix:vdev_zap_leaf: 134
children[2]:
type: ‘mirror’
id: 2
guid: 14460296762837717987
metaslab_array: 1002
metaslab_shift: 29
ashift: 9
asize: 14388035584
is_log: 1
create_txg: 287884
com.delphix:vdev_zap_top: 999
children[0]:
type: ‘disk’
id: 0
guid: 6533743922685917232
path: ‘/dev/disk/by-id/nvme-nvme.8086-504842543731373230344d4430313644-494e54454c204d454d50454b31573031364741-00000001-part1’
whole_disk: 1
DTL: 970
create_txg: 287884
com.delphix:vdev_zap_leaf: 1000
children[1]:
type: ‘disk’
id: 1
guid: 13115791165044171766
path: ‘/dev/disk/by-id/nvme-nvme.8086-50484254373231353035544230313644-494e54454c204d454d50454b31573031364741-00000001-part1’
whole_disk: 1
DTL: 969
create_txg: 287884
com.delphix:vdev_zap_leaf: 1001
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data

1 Like

Are those HDDs CMT or SMT drives? In case they are SMT drives this is where I would suspect the bottleneck.

Edit: I quacked the serial number and those should be CMR drives, so that’s not it. An ashift of 12 should also be correct here, so two usual suspects for performance degradation are not it. To be honest I suspect a hardware issue here, or something wrong in the UEFI. You can do things wrong with ZFS an degrade perfomance, but in case you did not touch stuff you did not know about it should bring more performance.

Edit: Do you have data on the pool? Otherwise tear it down and test it of the devices on their own and see if there is any bottleneck or if they all work as intended on their own. This would give a better indication if it is an issue with the hardware or with ZFS, because in theory if the devices perform and ZFS is not setup in a weird way it should perform more nominal. This would give us an idea where to look!

2 Likes

What’s the exact fio command you ran to get your results?

2 Likes

fio --name=benchmark --directory=/mainpool --direct=1 --rw=randrw --bs=4k --size=4G --numjobs=4 --time_based --runtime=60 --group_reporting --ioengine=libaio --iodepth=32 --randrepeat=0

Raw NVME gets 850MB/s as it should. Zpool gets 11MB/s

You asked fio to read and write random 4k blocks as fast as possible (aka 4 jobs) to a pool with two vdevs containing a total of 4 HDDs.
What kind of performance did you expect?
Look at the spec sheets of your devices. You’ll find that your HDDs will do ~100 IOPs (input output operations per second)
Your nvme will do up to 1M IOPs. Do the math.

Running the same command on a pool with 4 vdevs each raidz1 with 4 HDDs (total 16 HDDs, 12 data/4parity) + NVMes for special devices, log, and cache yields

Run status group 0 (all jobs):
   READ: bw=1254KiB/s (1284kB/s), 1254KiB/s-1254KiB/s (1284kB/s-1284kB/s), io=73.5MiB (77.1MB), run=60014-60014msec
  WRITE: bw=1241KiB/s (1271kB/s), 1241KiB/s-1241KiB/s (1271kB/s-1271kB/s), io=72.7MiB (76.3MB), run=60014-60014msec

Running it on a single Samsung 980 Pro:

Run status group 0 (all jobs):
   READ: bw=1003MiB/s (1051MB/s), 1003MiB/s-1003MiB/s (1051MB/s-1051MB/s), io=58.7GiB (63.1GB), run=60001-60001msec
  WRITE: bw=1003MiB/s (1051MB/s), 1003MiB/s-1003MiB/s (1051MB/s-1051MB/s), io=58.7GiB (63.1GB), run=60001-60001msec

So, I think your results are in line with what could be expected with the simulated test load.

The simulated workload is extremely heavy and artificial. Even RDBMS workload typically reads/writes larger blocks than 4k.

If you increase the value in the bs parameter you’ll find you get faster results from your hard drives.

4 Likes

Additionally there’s probably a ton of artificial read-modify-write amplification, because you’re likely using the default dataset recordsize of 128K, which your benchmark is furiously trying to modify only 4K of.

Generally you want to match the recordsize of the dataset to the blocksize fio is trying to write, assuming of course that the blocksize actually represents the workload you are trying to simulate.

That said, setting a dataset recordsize to 4K is not advisable because the zfs operational overhead becomes silly. I personally wouldn’t go below 16K, ever. If you have a database workload, you match the to it’s clustersize or whatever name it has for block size.

In real life, for most non-database type use cases, you’ll likely want the recordsize at default or larger, regardless of what a random benchmark tells you.

Yeah HDDs suck hard at 4k random I/O. That’s why we invented SSDs in the first place.

I’d probably get a stroke if I had to boot a zvol from that pool :slight_smile:

105k IOPS. Not bad for some rusty discounted Toshiba drives.

Nahh, just kidding…got too many tuning going on to provide reliable data. This was mostly ARC at work. HDDs were still working 20sec after FIO was done :wink:

11MB/sec looks normal for HDDs on 4k random IO. Has been a long time since I benchmarked QD=32 impact on HDDs without ZFS interfering all the time.

But could be normal. You don’t get HDDs if you want to do 4k stuff. If it’s 11MB/s on sequential stuff…then there is indeed a problem. Just reads OR writes on 1M blocksize should clarify this.

I’m not sure why it’s THAT slow for both of you despite everything being async.

1 Like

Ok I think you’re right I just misunderstood the nature of the particular benchmark I had copy pasted from ChatGPT. I have run some more tests and am getting more expected results but the random 4k write test at the end still seems low to me at 44MB/s. Let me know what you think. Thanks you guys for all the feedback!

SEQUENTIAL WRITE TEST:

COMMAND: fio --name=write_test --ioengine=libaio --iodepth=4 --rw=write --bs=1M --direct=1 --size=2G --numjobs=1 --runtime=60 --group_reporting --directory=/mainpool

write_test: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=4
fio-3.33
Starting 1 process
write_test: Laying out IO file (1 file / 2048MiB)

write_test: (groupid=0, jobs=1): err= 0: pid=255037: Wed Nov 15 12:34:28 2023
  write: IOPS=4320, BW=4321MiB/s (4531MB/s)(2048MiB/474msec); 0 zone resets
    slat (usec): min=131, max=885, avg=230.08, stdev=49.20
    clat (nsec): min=1533, max=1460.9k, avg=693512.21, stdev=126081.62
     lat (usec): min=241, max=1895, avg=923.59, stdev=162.30
    clat percentiles (usec):
     |  1.00th=[  502],  5.00th=[  537], 10.00th=[  570], 20.00th=[  594],
     | 30.00th=[  611], 40.00th=[  652], 50.00th=[  693], 60.00th=[  717],
     | 70.00th=[  734], 80.00th=[  758], 90.00th=[  840], 95.00th=[  938],
     | 99.00th=[ 1106], 99.50th=[ 1188], 99.90th=[ 1369], 99.95th=[ 1450],
     | 99.99th=[ 1467]
  lat (usec)   : 2=0.05%, 250=0.05%, 500=0.83%, 750=75.98%, 1000=19.87%
  lat (msec)   : 2=3.22%
  cpu          : usr=10.99%, sys=65.96%, ctx=1801, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.1%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,2048,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
  WRITE: bw=4321MiB/s (4531MB/s), 4321MiB/s-4321MiB/s (4531MB/s-4531MB/s), io=2048MiB (2147MB), run=474-474msec

RANDOM WRITE TEST:

COMMAND: fio --name=randwrite_test --ioengine=libaio --iodepth=16 --rw=randwrite --bs=4k --direct=1 --size=1G --numjobs=4 --runtime=60 --group_reporting --directory=/mainpool

randwrite_test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.33
Starting 4 processes
randwrite_test: Laying out IO file (1 file / 1024MiB)
randwrite_test: Laying out IO file (1 file / 1024MiB)
randwrite_test: Laying out IO file (1 file / 1024MiB)
randwrite_test: Laying out IO file (1 file / 1024MiB)
Jobs: 4 (f=4): [w(4)][100.0%][w=21.4MiB/s][w=5476 IOPS][eta 00m:00s]
randwrite_test: (groupid=0, jobs=4): err= 0: pid=271996: Wed Nov 15 12:37:43 2023
  write: IOPS=10.7k, BW=41.9MiB/s (44.0MB/s)(2517MiB/60001msec); 0 zone resets
    slat (usec): min=3, max=246620, avg=371.30, stdev=467.37
    clat (nsec): min=1032, max=252789k, avg=5587951.72, stdev=3752283.04
     lat (usec): min=131, max=253212, avg=5959.25, stdev=3979.98
    clat percentiles (usec):
     |  1.00th=[  392],  5.00th=[ 1598], 10.00th=[ 2089], 20.00th=[ 2933],
     | 30.00th=[ 3621], 40.00th=[ 4228], 50.00th=[ 4817], 60.00th=[ 5604],
     | 70.00th=[ 6456], 80.00th=[ 7635], 90.00th=[10290], 95.00th=[12780],
     | 99.00th=[15664], 99.50th=[17957], 99.90th=[25822], 99.95th=[27657],
     | 99.99th=[53216]
   bw (  KiB/s): min=11984, max=292584, per=100.00%, avg=43098.96, stdev=7770.00, samples=476
   iops        : min= 2996, max=73145, avg=10774.73, stdev=1942.48, samples=476
  lat (usec)   : 2=0.01%, 250=0.08%, 500=2.54%, 750=1.25%, 1000=0.23%
  lat (msec)   : 2=4.90%, 4=26.85%, 10=53.36%, 20=10.49%, 50=0.29%
  lat (msec)   : 100=0.01%, 250=0.01%, 500=0.01%
  cpu          : usr=0.40%, sys=5.42%, ctx=614485, majf=0, minf=47
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,644308,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=41.9MiB/s (44.0MB/s), 41.9MiB/s-41.9MiB/s (44.0MB/s-44.0MB/s), io=2517MiB (2639MB), run=60001-60001msec

You’re not testing the disks, you are benchmarking ARC (and probably SLOG in the second run, not sure on your dataset properties. latencies look more NVMeish ).

particularly because of this and your dataset settings. Use a fio file that is multiples of your memory/ARC. Sso for 64G, use 128G fio file.

Where is the problem exactly? All fio tests show expected values…

That’s what HDDs do and is normal.

Your SLOG won’t make HDDs any better. HDDs aren’t good at random I/O . If you want to do random writes, get some SSDs. Or dig into the rabbithole of tuning parameters to streamline async writes.

1 Like

HDDs are always trade offs though. improving random writes on small files will cause other things to be even slower. there are a lot of tricks to make thing X better, at the cost of thing Y.

1 Like

low recordsize with influx of big metadata, throughput sacrifices…or sacrifice some RAM for bursty workload, doesn’t help with sustained 4k writes.

Big companies bought hundreds of HDDs to get some juicy out of them…15k rpm SAS drives were a good deal. Only to later say “fuck this, I’m investing in SSDs big time.” :wink:

HDDs are great for large stuff and mostly cold data…if you ever had a HDD as swap, you know how bad the latencies/seek times are.

1 Like

stares at a shelf full of 600gb 15k SAS drives that will NEVER be used again

yeah, no comment.

1 Like

I assume you are using 7200 rpm HDDS. Those have an access time of 8ms to 12ms, let’s assume yours are 8ms. This means that each vdev can perform 1000ms/(8ms/transaction)=125 transactions per second. if you are performing 4KB transactions, each vdev can process 4KB/transaction * 125 transactions/S = 500KB/S = 0.5 megabytes per second. With a 2 vdev pool, you get 1MB/sec, which is what you got on your test.

Fortunately the system in front of zfs does a lot of work into making the transactions not random, so in real world use cases your performance is much better than that. However if you have a use case that requires many 4k random transactions, make an SSD backed pool instead of an HDD backed pool.

It looks like you are using 3 SSDs to accelerate this hdd backed pool. You can probably reduce that to one, and make another mirrored vdev and pool with those SSDs for your random write tasks.

You can allocate 16GB to your SLOG and the remainder to your l2arc. Having the slog disk or l2arc disk die will impact performance but will not cause data loss, so there is no real need to mirror it. Also if you find a bottle neck in read speed of your array, instead of getting fancy with better L2ARC caching, just give more ram to your storage server. Ram is usually over 100 times faster than the fastest SSD. Increasing the ram allocation to your file server from 8GB to 32GB makes a huge difference.

5 Likes

in comparison, a slow ssd will have an access time of 0.3ms, and a fast ssd will have a response time of 0.015ms.

Because of this they can have much faster response times for transactions.

SSDs can also have a much higher maximum throughput for each drive.

HDDs can also give very high number of transactions per second, and very high throughput, you just need a lot of them. In production environments 2PB-6PB per zfs server is fairly common with 8TB disks. With that many disks the HDD disk array often still provides sufficient performance.

ZFS data blocks contain a checksum for the data block at the end of the data block, so upon reading can validate the information without needing to read data from a different disk. They only read data from other disks when the checksum for a data block is does not match its checksum. Because of this the write speed is proportional to the number of vdevs, but the read speed is proportional to the number of drives.

In large pools, the hard drives are often in vdevs of mirrors, with each member of the vdev on a different sas disk shelf. This way an entire disk shelf could go offline and the vdevs on that pool won’t accept write events, but as long as there exist other VDEVs in that pool which are not degraded, there should be very little difference in performance.

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.