Unusable performance in photo workflow with ZFS NVMe Mirror

I am setting up a TrueNAS SCALE server that I want to use as storage for my RAW photo files as I am processing/editing them instead of keeping it locally on my laptop or desktop. So I got 1 TB of RAM and 4 NVME 2TB drives in 4 m.2 to PCIe adapters set up as a pair of mirrored VDEVS. I also put a 10gig card in my desktop and in the server and direct attached the two for increased bandwidth.

Even with these upgrades, I’m getting terrible performance at the first step in my photo workflow. I use a software called Photo Mechanic for importing, applying metadata and culling my RAW photos. The import speeds were pretty slow but I didn’t measure them. Where I really noticed the poor performance was opening a folder in photomechanic to start sorting through the photos.

The test I ran was the time to open the same photo folder until I can start working (clearing the local app cache each time), and this is using a SMB share on TrueNAS. The folder has 2750 RAW files (around 25MB each, and a corresponding XMP file, mainly 2 KB each)

Results:
Server to Windows Desktop 10 Gig - 3:50 (3 min, 50 seconds)
Server to Windows Laptop 1 Gig - 3:40
Windows Laptop to Windows Desktop 1 Gig - 0:38
Local NVMe on Windows Desktop - 0:22

I also did Server to Windows Desktop 1 Gig, I don’t have the exact time but it was somewhere in the 3:40-3:50 range.

When I ran iperf from the Desktop to the Server, I get the expected bandwidth. If I copy a folder from the server to the desktop, it will copy at several gigabits. So it doesn’t seem to be a network limitation. I did a crystaldiskmark test from Windows to the SMB share and got 1200 MB/sec on read and 1100 MB/sec on write. As we see above in the PM test, If I go Windows to Windows over the network, the test completes at an acceptable speed, just slightly slower than a local drive.

So that seems to leave ZFS, TrueNAS, PM or a combination of the 3 as the culprit. I did have a proxmox server a long time ago and while I didn’t end up editing off of it, I do recall PM going very slow when trying to open a folder.

What can I do to narrow down the source of the issue and resolve it?

Do you have Jumbo frames enabled on your switch and devices? Moving large media files over the network would benefit from that.

Is this a ZVOL or a Dataset? What is your recordsize/volblock size?

Try disabling compression:

zfs set compression=off pool/dataset

Try only caching metadata:

zfs set primarycache=metadata pool/dataset

Try decreasing the record size of your dataset (then move your data off and back on again)

zfs set recordsize=32k pool/dataset

Try increasing the record size of your dataset (then move your data off and back on again)

zfs set recordsize=1M pool/dataset

You can try disabling prefetch on your system:

vfs.zfs.prefetch_disable="1"

If you want to color outside the lines a bit, read this:
LinkedIn

1 Like

TrueNAS Core also usually performs better than SCALE

1 Like

my usual photography workflow takes the opposite approach; import photos from camera to the laptop / desktop’s local NVMe storage, curate & edit photos there & export the edits, then push all the files to the file server when done.

if need to go back to old photos and do more work, I just pull the needed files back locally on the laptop / desktop, do work, export, then push it all back to the file server.

saves a lot of these kinds of headaches. And it makes the file storage server’s performance irrelevant.

I don’t believe so, but it doesn’t seem to be a bandwidth issue since copying files to and from the server with windows explorer goes pretty fast. Its just the one program (so far, I haven’t tested lightroom yet) that is having issues.

Dataset.

Compression is already off and record size is 1M

Why do you think only caching metadata would be helpful?

I was going to try Core at some point if I can’t get Scale to work, to see if its maybe a BSD vs Debian thing since proxmox is also debian based.

Just for organization and backup automation purposes, I’m trying to avoid keeping stuff on local storage as much as possible. Plus I spent all this money on upgrades for the server, so it would be nice to get it working…

Going to recordsize=32k brought the time down to 56 seconds on my laptop. So still not as fast as Laptop to Desktop, but a definite improvement over 3:50. I’ll try with some more recordsizes to see if I can get it lower.

1 Like

Try going as low as 8K. That should effectively help create a situation where the NVME drive ends up getting greater than queue depth 1 I/O. That will improve performance, at the cost of space efficiency and compression (which is of anyway).

1 Like

We don’t want to compress due to this change in ZFS 2.0:
Chapter 22. The Z File System (ZFS) | FreeBSD Documentation Portal

Even with frequent data updates, enabling compression often provides higher performance. One of the biggest advantages comes from the compressed ARC feature. ZFS’s Adaptive Replacement Cache (ARC) caches the compressed version of the data in RAM, decompressing it each time. This allows the same amount of RAM to store more data and metadata, increasing the cache hit ratio.

While in normal use with spinning drives, that is ideal, for NVME the ystem RAM isn’t fast enough to handle this compress ->decompress->compress again loop.

Similarly, by only caching metadata we are freeing up memory I/O for the insanely fast transactions that NVME drives can do.

2 Likes

4k and 8k went back to being slow. 16 was 52, 32 was 56 and 64 was 45. I’ll try 128,256 and 512 later and then do multiple tests so I can get an average to figure out the best record size, but I think we may have found a solution!

1 Like

While tinkering, take a look at this page:
Performance tuning — openzfs latest documentation

May be some other dials and knobs worth testing.

1 Like

About done with my testing, it seems in my case that primarycache=all will be best. Which makes sense since the system is pcie 3. With cache set to metadata, every recordsize took a performance hit except 16k.

With cache set to all, taking the average of 8 runs and dropping the top and bottom score, the times were all within 1.5 seconds of each other and if you drop the worst performer, the remaining recordsizes were within .6 seconds.

I’m thinking I’ll go with 512 since half the files are around 30MB in size.

1 Like

Do you have a chart you can share?

I have everything in an excel doc, what would you like to see exactly?

The raw data is fine

1 Like

Could CPU be bottleneck? I think Samba is only single thread capable…