Slow ZFS pool performance - where do I start?

Hi guys,
I’m new to the ZFS party, and I’m not really sure how to configure the thing to get full performance.

I’ve got a low power box(AMD GX-424CC) with a raid controller attached to it and a bunch of disks. There’s 8GB of RAM inside. I didn’t configure any cache, just the spinning rust.

I have a raid 10 pool configured with 4 HDDs, and I can’t go past 100MB/s write speed. Usually it’s more like 60MB/s. I’m copying from 3 HDDs put in a raid 0 on the controller itself. How do I check what’s the bottleneck here?

I would appreciate a few pointers, thanks.

Here’s some fio results:

sudo fio --name=random_write --filename=/tank/test.tmp --filesize=100g --rw=randwrite --bs=4k --direct=1 --overwrite=0 --numjobs=10 --iodepth=32 --time_based=1 --runtime=30 --group_reporting
random_write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=32
...
fio-3.19
Starting 10 processes
Jobs: 10 (f=10): [w(10)][0.0%][eta 12d:04h:17m:35s]
random_write: (groupid=0, jobs=10): err= 0: pid=384191: Mon Jan 24 22:06:26 2022
  write: IOPS=423, BW=1692KiB/s (1733kB/s)(56.3MiB/34073msec); 0 zone resets
    clat (usec): min=21, max=8953.2k, avg=23606.55, stdev=291378.84
     lat (usec): min=21, max=8953.2k, avg=23607.25, stdev=291378.84

This isn’t a ZFS problem. HDDs aren’t good at random writes or reads, that’s why people love SSDs. The (atrocious) random write speed looks normal to me.

If you replace --rw=randwrite with --rw=write you should see values that are more in line with “advertised” speed values, meaning ~200-500MiB/s, depending on your drives.

edit: --bs=4k is yet another worst case scenario for a HDD. Use blocksize of 1024k if you want to compare your machine’s performance to manufacturer spec sheets.

2 Likes

if you wanted to test for a particularly bad drive, you could run

zpool iostat -yv 0.5

(ctrl+c to close)

and it’ll give the report of how much each drive is reading/writing?
might be one more or less busy.

As Exard mentioned, just a read, or write at a time, if you want to check performance?

Bearing in mind that ZFS stores data in sectors. the system builds chunks in memory, then flushes it out to disk when the chunk reaches a certain size, so you should see waves of more- and less- data being transferred.
It’s the peaks you might want to check?

running with 3 second intervals should smooth it out.

There is also compression that might be used to speed up (or slow down) a transfer.

fio is basically useless as a metric for drive performance if caching or compression are influencing the results. I set up a special fio dataset for my pools for testing purposes, where I disabled all interfering zfs magic. Even things like atime,sync or special_small_blocks may result in fio gibberish. ZFS is great, but is a bitch when it comes to setting up a benchmark.

1 Like

The use of “RAID-10” is confusing. Are you running hardware RAID with a ZFS pool on top? ZFS really just wants the disks passed through JBOD and you surrender the advantages of the filesystem if you use a hardware RAID solution. Please include the output of “zpool status -v” when you get a chance.

Setting the right record size for your workload can help improve performance a lot.

Raid 10 meaning a pool of 2 mirrors with 2 disks each, the disks are set as JBOD on the controller. This is the zpool status output:

pool: tank
 state: ONLINE
config:

	NAME                                                 STATE     READ WRITE CKSUM
	tank                                                 ONLINE       0     0     0
	  mirror-0                                           ONLINE       0     0     0
	    scsi-SAMCC_9650SE-12M_DISK_SZ900269CED07C003F94  ONLINE       0     0     0
	    scsi-SAMCC_9650SE-12M_DISK_0B906634CED0A400B4D0  ONLINE       0     0     0
	  mirror-1                                           ONLINE       0     0     0
	    scsi-SAMCC_9650SE-12M_DISK_0B600341CED0B3005032  ONLINE       0     0     0
	    scsi-SAMCC_9650SE-12M_DISK_0B600342CED0C2003B5C  ONLINE       0     0     0

I have checked the iostat output, and disks are kind of equal.
I have a samba share on a VM with disk on the pool, and writes stabilize ~40MB/s. That should be sequential, right?

About the record size - what would be optimal for a VM storage pool?
I’m going to try compression, maybe it’ll get better.
Thanks

That reminds me…

With FIO, you always want the blocksize you are writing out to match or or be a multiplication of the recordsize on the ZFS dataset you are writing to. From my indirect understanding, FIO does everything is what is essentially one big file, which means the possibility for massive read-modify-write amplification is just like a database. Rather than just writing out a stream of sequential 4K blocks like you would think, you end up writing out a huge file made of 128K blocks and then every random 4K write is reading one of those 128K blocks back, and change 4K of it, and then writing out 128K again, and repeat.

So if you are testing 4K writes with FIO, you’ll generally want a 4K recordsize on the dataset.

This is more important for read tests, but you also want to have the test to read/write out double the amount of memory that ZFS is using for ARC. Otherwise your read tests speeds can end up just being a test your RAM speed. Creating a dataset with compression set to off or lze (so long as you don’t write zeros, as lze compresses those) is another possible solution. Don’t ever set compression off otherwise though, it’s meant only for testing and debugging, at the very least use lze on incomprehensible datasets, so at least the slack space in the last (usually) partially filled block of the file is compressed.

Then you get into the really esoteric shit like indirect block writes: random writes cause high write amplification · Issue #6584 · openzfs/zfs · GitHub

ahrens commented on Apr 26, 2018

@zviratko You are seeing high write amplification when doing random writes, as well as slower performance than expected. The write amplification is likely due to indirect blocks. E.g. with recordsize=4k, every random 4k write will need to write 4KB of user data, plus at least 2x 128K (before compression) level-1 indirect blocks. Depending on your file size, level-2 indirect block writes may not “amortize out” over the txg either. So 32x inflation would be expected. In addition to this, if the indirect blocks are not all cached, then you will need to read the indirect block before making the modification, which has a very adverse effect on write latency.

…I’m not gonna pretend to know what that means.

Also make sure atime=off on the dataset (or even better the whole pool). This stops the completely unnecessary changes to metadata every time anything even looks at a file.

Matt Ahrens, the pope of ZFS.
Whenever I tune into one of the more in-depth lectures on the OpenZFS YT channel, I usually feel the same.

But regarding the OP…

It’s difficult to help pinpoint the source of your problem. Too many variables like RAID controller, some other ominous RAID 0 array that is on the same server but not ZFS(?) and connected via Samba share (?) in an unknown VM configuration and unknown network speed.

Might as well be because the VM is limited to 1GBit/s bandwidth due to networking. That would also explain the ~100Mib/s max write speeds. Or your embedded AMD processor can’t keep up with gzip-9 compression on your dataset. Or your hardware RAID controller is messing up everything. Or you are expecting too much from async random writes (and possible write amplification, as mentioned by @Log).

Sometimes it’s as simple as the problem being the program that measures the speeds. If I just check my KDE file transfer window (not sure if thats a KDE Plasma, Dolphin or NFS problem/error) on my laptop, it always tells me about 50-70MiB/s transfer speed when copying a big file.
But I know for a fact that this file is copied at ~90MiB/s (which is what you can expect from an old 1GBit/s NIC writing to an NFS share on the network), backed up by htop, zpool iostat and my good old physical stop watch + basic maths.

If you want to test things, you don’t start out complicated.

You eliminate points of failures, one at a time. If fio doesn’t show anything wrong (tbd), I’d continue with eliminating possible problems/bottlenecks with the VM, starting with things like iperf, then testing Samba performance and VM disks after that.

Thanks for the tips, I’ve investigated a little bit more. Iperf gives me ~945Mbps, so that is ok. I’ve disabled the compression on my dataset, and CPU does not seem to be an issue. With fio and sequential writes directly on the host I get now around 180 MB/s, I guess that’s all I’ll get from old HDDs, unless I add a ton of drives.

That leaves me with the VM or samba config being an issue. After enabling asynchronous writes and receivefile size, I got up to 60MB/s on host system. Is hosting the share inside a VM even a good idea? I wanted it to be nicely contained inside a disk image, but maybe it’s just too much trouble for what it’s worth.

I have my recordsize on the default, 128k. Is this appropriate for VM storage pool? Also, is splitting the pool to many datasets and changing the record size a thing?

Thanks

For file sharing I found that 1M gives good performance but for VM storage I think something like 64K might work better.

You can set different record sizes for different datasets but it will only apply to new files so the easiest thing to do would be to create a new data set and copy the old data on to it.

You may also want to check that atime is disabled on the dataset too as that can slow things down as well , especially for VM storage.

VM Disks are stored as a single file on the host system (e.g. qcow2 or raw) and presented as virtual, but fully functioning devices. This usually results in sub-optimal performance and is the worst, but usually easiest and only option. My VM disks are mostly boot drives and the VMs in need of storage get an iSCSI or NFS share from my NAS.

Hypervisors also let you add a full drive to a VM which bypasses the troublesome VM disk file situation. But that drive isn’t available to the host anymore.

The tried and trusted option however is to passthrough the entire controller to the VM. This usually is your RAID Controller/HBA or on-board SATA controller. The guest VM gets full control with minimal virtualization overhead because the storage isn’t virtual disks or virtual SCSI controllers, but actual real hardware.

I’m personally running a TrueNAS VM, with no performance problems whatsoever (had it running on bare metal before, didn’t notice any major difference). Shares and storage in a Fileserver/NAS-VM is totally viable, but performance was atrocious until I passed my SSDs and SATA controllers to the VM. NIC passthrough helped my 10GBit network throughput quite a bit too.