Hello all! Long time lurker, but just now getting into the forums.
I am running Truenas Scale 23.10.2, 2x Xeon e5-2699v3, 256GB ECC RAM, and a 40g Mellanox nic going to 2 Proxmox hosts. The Proxmox hosts run primarily as a workstation with GPU, USB, etc passed through to a Windows guest and the drives on an NFS share from the Truenas. The Truenas has one “Big_Data” pool with 4 Raid-z2 VDEVs - 8x 16tb Seagate Exos, 8x HGST Ultrastar (x3), plus one “VM_Data” pool with 4x 500gb WD sn730, striped (backs up to Big_Data every hour). Truenas is on 2 redundant UPS’s and the Proxmox hosts each have their own UPS as well, all with automated startup and shutdown process and network alerting.
I’m wanting to move all the datasets onto the Big_Data pool and run all applications off the one large array but accelerate it as much as possible. I have a 4x NVME bifurcation card (limited to PCIE3 due to the platform, but with all 16 lanes) and I’m thinking 4x 2tb Samsung 970 Evo. 2 of them in stripe and partitioned to L2, ZIL, and SLOG, and 2 of them in mirror for metadata.
My goal is to make each of the Windows guests and all of my other VMs get as close to NVME speeds as possible, without having to invest in a large NVME array.
Does anyone have any suggestions on what they would do differently?
yeah I would ditch Z2 and instead opt for mirrors. IOPS per VDEV for mirrors is N/2 for writes and N x 2 for reads.
NFS content as well as block storage for VM’s (ZVOL) benefit heavily from a SLOG. Be sure to have at least 2 high speed NVMe drives for your SLOG for a mirror to protect from failure. In the event of failure you wont tank your pool but performance will start to suck. You would want 4 NVMe drives in a striped mirror for the metadata because if you suffer from a failure there your entire pool is fucked. You should be beyond paranoid for there redundancy, but two in a mirror should be alright.
Another thing to consider for ZVOLs is settings sync=always on that dataset. This is to ensure the VM’s don’t hose themselves, and it has the benefit of a nicer iowait since when the trx groups are flushed they can now benefit from trx groups.
The official documentation from Oracle about ZFS says to always opt for mirrors when performance is critical.
You can skip the L2ARC it would be waste you already have 256 GB of ram that should be more than enough for your workstations. It would just be a pointless flex.
Edit: here is a link to a PDF for the best practices with use with VMWare. It’s a bit dated but the knowledge is good and transferable.
And one last thing: Make sure that you never exceed 80% of your storage on ZFS. Once you do, the internal algo (gangblock allocation) changes that becomes dog slow. 80% is your new 100% out of space condition.
Instead of following any list of “steps” or “configuration items” I would first assess your typical workload and analyze what this means in technical terms / zfs specific workload.
You’ll likely find that your workload does not max out some or all devices in your pool due to some bottleneck. zfs most likely offers configuration options to overcome these limitations.
Examples:
pool made of single HDD. Top performance for large, sequential file access when accessed without concurrency. Massive performance drop (2 orders of magnitude) when accessed in parallel, massive performance drop for random access due to massive latency of HDD hardware. Also, no redundancy - total data loss in case of device failure.
pool made of two HDDs in mirror config. Same performance for sequential access, same performance for write access, improved performance for read access (zfs is smart enough to read from both devices in parallel), improved redundancy, same capacity.
pool made of 3 HDDs in raidz1 … (I assume you’re familiar with raidz - I won’t go into detail) improved capacity, performance challenges
…
So, if you find that your software issues sync’ed writes (e.g. when using zfs dataset as NFS share), which slows down write speeds, then adding SLOG devices will mitigate this slowdown.
If you find that you experience a slowdown because your zfs pool is asked to access a larger amount of blocks than fit into RAM (meaning it cannot be cached effectively), there are three ways to mitigate that:
Add RAM (costly, but fastest option; typically capacity limited)
Add devices to pool to increase overall performance of pool
Add L2ARC vdevs.
Which option is the best is unclear and really depends on your needs and situation (wants, budget, etc.).
Using nvme as special devices are are great tool to mitigate slowdown due to HDD latency when accessing metadata and/or small files.
Generally, performance of a zpool is based on the performance of its vdevs. More vdevs help with concurrent workload. All other (zfs performance) features help overcome performance bottlenecks of your existing pool.
In HDD based pools you need to find the “typical” performance of your brand and model of hard drives.
While my HDDs manage to hit top bandwidth (~240MB/s) specs during scrubbing, they provide more like 100 MB/s during typical use in their raidz config. With that in mind, you’ll need at least 24+ HDDs to saturate your 40gb (~5GB/s in theory) network link. zfs will scale nicely with HDD based configs.
Also consider the other components in your networked setup.
I find that copying between single gen4 nvme devices is not sufficent to saturate a 40gb network any software method (I tried) to bundle nvme devices (md raid, zfs) causes enough overhead to prevent low queue depth workload to saturate the network.
So, to max out a 40gb network connection with a HDD based zpool you’ll need dozens of HDDs likely with special devices, slog devices, and l2arc attached.
Also - make sure that any slowdown in network transfer speeds depends both on source and target performance as well as channel performance.
Thanks so much for this. It’s interesting that you suggest ditching the L2. I have lots of games and game servers that I assumed were benefiting from my l2 but I never did actual testing.
Redoing my array with mirrors would drastically improve performance, but with how much capacity I’d lose… I think I’ll stick with my 4x Z2. That’s 32 drives in total. Differing capacities (for now) but still should be enough? (let me know if I’m missing something)
I should mention that VMs also run on the truenas, so my actual usable RAM for l1 is around 190gb. Also, with the storage controller, GPU (used with my docker host vm for transcoding), network card, and bifurcation card, I’m out of PCIE lanes on this board. All I can muster on my current NAS is 4 PCIE3 NVME SSDs. With that, and with your suggestion here, would partitioning a striped mirror of the 4 SSDs be a good idea? That would be 4TB of usable space to use, more than enough for metadata, slog, and L2.
Isn’t that generally true, regardless of the filesystem? I.e. there is a general performance loss associated with all block storage as that block storage approaches its maximum capacity?