Poor ZFS Write Performance

RBE · October 9, 2020, 8:53pm

I am in the process of copying data from a ext4 formatted HDD into a ZFS VDEV, and the write performance seems very bad - 2MB/s at best. The VDEV consists of three Seagate 4TB 7200RPM 7E8 Exos HDDs in a mirror configuration, along with a 8GB ZIL SLOG hosted on mirrored NVMe SSDs.

At pool creation I set:

compression=lz4
sync=always
ashift=12

The ashift value was selected based in the information given in the Arch Wiki ZFS entry dealing with advanced format disks.

I believe that it is most likely to be the ashift value causing the issue, but honestly I am not sure. The HDDs purport to be 512e/4kN, so I assumed they were 4K drives, but the datasheet states they are in fact 512 bytes per sector. According to the Arch Wiki ZFS entry cited above, this should not be an issue, but still…

How do I go about determining what the issue is?

EDIT

The machine hosting the pool is a 2950X with 64GB of memory, just in case its relevant.

Dynamic_Gravity · October 9, 2020, 9:17pm

what’s the performance like if you blow away the ZIL and your pool topology are just the mirrored drives?

Also, what kind of data are you transferring? Lots of little files or large single files?

RBE · October 9, 2020, 9:50pm

I currently have data in flight so will wait until that copy is completed before removing the ZIL and testing. This will take several hours. From my initial testing prior to adding the ZIL, I seem to recall that the performance was even worse than it is currently, which is why I added the ZIL.

I have previously been copying FLAC files (about 20MB each), and am currently copying DVD ISOs (around 5GB each).

EDIT

Further reading makes me think it might be the record size, along with the compression that is causing the bottleneck. Problem is that according to the OpenZFS documentation, changing the record size will only take effect for new files.

RBE · October 9, 2020, 11:47pm

I created a few test data sets and set various properties to see what happened. Here are the results:

Setting recordsize=1M results in an initial improvement in write speed to the pool, but after half an hour or so it drops back to ~2MB/s
Setting sync=standard results in an order of magnitude increase in apparent write speed to the pool.
Irrespective of recordsize or sync values, reading from the pool is very fast.

As a result of this I am going to destroy the pool and recreate it, only setting compression=lz4 as the default for the entire pool. Record size and sync will then be tuned per dataset.

SgtAwesomesauce · October 10, 2020, 12:46am

Can you post the pool topology? (From zpool status) 3 disks mirrored doesn’t make a lot of sense to me, are you doing a 3 way mirror or 3 mirrored pairs?

Don’t define ashift, zfs automatically determines this by reading the drives.

RBE · October 10, 2020, 3:50am

As requested:

$ sudo zpool status
  pool: slowpool
 state: ONLINE
  scan: none requested
config:

    NAME                                                         STATE     READ WRITE CKSUM
    slowpool                                                     ONLINE       0     0     0
      mirror-0                                                   ONLINE       0     0     0
        ata-ST4000NM002A-2HZ101_XXXXXXXX                         ONLINE       0     0     0
        ata-ST4000NM002A-2HZ101_XXXXXXXX                         ONLINE       0     0     0
        ata-ST4000NM002A-2HZ101_XXXXXXXX                         ONLINE       0     0     0
    logs	
      mirror-1                                                   ONLINE       0     0     0
        nvme-Samsung_SSD_970_EVO_Plus_1TB_XXXXXXXXXXXXXXX-part4  ONLINE       0     0     0
        nvme-Samsung_SSD_970_EVO_Plus_1TB_XXXXXXXXXXXXXXX-part4  ONLINE       0     0     0
        nvme-Samsung_SSD_970_EVO_Plus_1TB_XXXXXXXXXXXXXXX-part4  ONLINE       0     0     0

errors: No known data errors

Having destroyed and rebuilt the pool, I am now consistently getting 100MB/s transfer rates to a dataset with a 1M record size and sync set to standard.

SgtAwesomesauce · October 10, 2020, 3:55am

That’s basically in line with what I’d be expecting.

I presume you intended to have a 3-way mirror? That’s a lot of redundancy.

RBE · October 10, 2020, 3:59am

That’s good to know - thanks Sarge. With regards to the three-way mirror, that is indeed what I was intending. I don’t need a lot of storage space as I don’t have a lot of data, but want to ensure the data I do have is protected. Hence the three way mirror and the 3-2-1 backup scheme etc.

SgtAwesomesauce · October 10, 2020, 4:01am

Wow, you really need that data.

Good news: if anyone’s data is going to survive, it’s yours.

To be clear, I’m not 100% sure, but I highly suspect your issue is the ashift value was improperly configured. you should be able to do zfs get -o ashift slowpool and see if it’s changed.

RBE · October 10, 2020, 4:12am

I have just had a check, and the result is:

$ sudo zpool get ashift slowpool
NAME      PROPERTY  VALUE   SOURCE
slowpool  ashift    12      local

As you can see, the ashift value was set by me, rather than ZFS as per your previous suggestion (I had recreated the pool prior to reading your post). My money is on the sync setting being the source of the problem, but now that I am getting okay performance from the pool, I am reluctant to destroy the pool and recreate it just to test…

SgtAwesomesauce · October 10, 2020, 5:43am

Well, if you’re getting 100mb/s I think that’s about all you’ll get out of that configuration.

I wouldn’t recommend rebuilding it again.

Log · October 10, 2020, 8:35am

Drives lie, and so ZFS will faithfully set the ashift incorrectly to 9 instead of the much more sane 12. This has and continues to cause many people problems that requires them to completely destroy the pool to fix. ZFS does have a list of known lying drives, but it’s very incomplete and will never stop being so.

For most use cases, setting ashift too big is a tiny performance problem that might show up in a benchmark. Setting it too small is a big performance problem that can be felt. Set HDD’s to 12 unless you know you need something else. SSD’s are usually better at 13, but also usually do fine at 12 because they are optimized for that. There can be issues removing things from the pool if there are different ashifts (like the pool set 12 and a special vdev set to 13) so the current advice is to keep it all the same.

His real issue is sync=always. This means ZFS will immediately commit the write, which takes time to complete. Non-sync writes are cached together and committed to disk the default of every 5 seconds. Normally an SLOG is used to vomit the writes to for safe keeping, allowing ZFS to cache the writes in ram like normal for performance reasons (though frankly most SSD’s are going to have disappointing performance). It looks like for some reason his SLOG is managing to choke on writes somehow. While ZFS does currently have ongoing performance issues with NVME drives and needs a big refactoring, I’m positive it’s something else. Frankly that “-part4” on the nvme seems suspicious, looks like he didn’t use the entire disk

@RBE
Consider posting here, there are more experienced people there who work with ZFS professionally and may be able to see what’s going on.

You are moving large files, so an ashift of 12 is NOT your issue.

Try using only the base of the NVME, not the “-part” partitions

Also, are you using samba to move files? If so that’s another wrench thrown in and can have all manner of bizarre interactions, and you may test setting samba to do only sync writes instead of ZFS.

Trooper_ish · October 10, 2020, 10:04am

Might be worth testing with dd from /dev/urandom (or /dev/zero if compression=off) or using “FIO” or some other benchmark to remove source as a bottleneck?

SgtAwesomesauce · October 10, 2020, 7:50pm

Yes and no. This used to be more of an issue, not so much today. Additionally, I know the 7E8 Exos to be faithful.

Ya know, I knew this, but that’s what I get for trying to provide support while drinking lol

All in all, excellent overview.

SgtAwesomesauce · October 10, 2020, 7:50pm

Or just @ freqlabs.

disclaimer: summoning freqlabs can and will set things in motion that may have dire consequences for all non-BSD users.

RBE · October 11, 2020, 3:37am

Thanks for the clarification. It appears I misunderstood how sync=always actually works. For some reason I thought sync transactions were still bundled into a TXG and committed every five seconds.

Using the base of the NVMe SSDs for my SLOG rather than a partition is not possible due to the fact that the SSDs also contain other partitions for ESP, boot, and root. I am aware that ZFS prefers entire drives, but in this case its just not possible.

Log · October 11, 2020, 12:38pm

Oops, turns out I was wrong. I had previously noticed that sync writes caused immediate writing to hard drives, but I was mistaken as to what was exactly being written.

Turns out it does bundle sync writes into txg’s the exact same as async writes, and also immediately starts making a copy to the ZIL area of the disk so it can recover from a crash.

RBE · October 11, 2020, 7:15pm

From the article linked by @Log :

The other key difference is that any asynchronous write operation returns immediately; but sync() calls don’t return until they’ve been committed to disk in the ZIL.

What’s interesting there is that it says “committed to disk in the ZIL”, rather than “committed to disk in the pool”. Given that my ZIL SLOG consists of Samsung 970 EVO Plus NVMe SSDs that have a write speed of ~3GB/s*, I should of experienced very little in the way of a performance hit when copying files with sync set to always.

The way I understand it, the HDD that I was copying data from should of been the performance bottleneck. Let’s say I copy a 5GB ISO image into the pool, and the source HDD iss capable of a sustained read performance of 200MB/s. With my 8GB ZIL SLOG, it should take around 25 seconds to copy all the data to the ZIL SLOG.

This was most definitely not the case. As stated in the OP, I was seeing transfer rates of more like ~2MB/s. Note that this transfer was done using the GNOME file manager, and that the source HDD is physically located in the same machine as the zpool.

I will create a test dataset tonight after work and see if I can replicate the issue.

*Assuming synchronous writes rather than random.

Dynamic_Gravity · October 11, 2020, 8:20pm

Also, try tests using cli to eliminate issues with the gui

Log · October 12, 2020, 3:13am

Generally people use FIO. DD is single threaded and thus cannot emulate many use cases.

Step 1 Figure out how read/writes are actually being performed by your intended workload. Not what you think is happening, but what is actually happening. Writes to ZFS are in practice somewhat “random” (not sequential), because it’s usually doing a variety of other stuff like changing and reading metadata.
Step 2 Figure out how to emulate that in FIO
Step 3 Only then start optimizing ZFS. It’s really easy to get something incorrect and hurt your performance because you misunderstood or were ignorant of something. It’s also a good idea to observe how the disks themselves are behaving with ZFS’s built in iostat. Unfortunately that also has some pitfalls that I’ve forgotten but aum pretty sure involves very misleading column names about because for some reason performance monitoring tools in unix-like systems are designed to be as arcane as possible.

One big pitfall is that you want your FIO write blocksize to be equal to or an even multiple of your dataset blocksize. If you write 64K blocks with FIO into a 128K ZFS dataset, you’ll get artificial read/write amplification because FIO reads and writes into a single file and ZFS is a COW filesystem. What happens when FIO writes a file that requires multiple blocks is:

FIO writes out 64K block, which ZFS will store in a 128K block, because all the blocks in a multiblock file must be the same size.
Then when FIO tries to write out the next 64K block, ZFS must read the half empty 128K block before it can modify/add the next available 64K.
Finally it writes out the whole 128K block.

If you had FIO match the dataset block and write out 128K in the first place, it would have skipped the unneeded read and a write. If FIO was to write a 256K block, then ZFS would split that perfectly into two 128K blocks. If FIO wrote out a 255K block, then holy hell we are now dealing with partial blocks again. If or how much of this is smoothed over when batching in txg’s, I have no idea. I imagine that Sync writes would be much worse off than async, as there’s already extra activity to the ZIL area.

The following links may be helpful. I generally don’t try to mess with this stuff so this is about all I can comment on.

Also be mindful of hardware limits (your HBA can only do so much before it hits a wall) and ZFS’s ARC cache. For dealing with ARC, your test needs to be about 2x larger than your ram, or consider making a test dataset and setting primarycache to metadata, and be mindful of read amplification for the same reasons mentioned above.