Howdy all. Long time lurker turned first time poster.
In short I run a dev/test environment based on VM’s for many different project, on a shoestring budget of course.
I currently run just one server (with an offsite rsync backup) with:
2x8 core last Gen Xeon
192GB DDR4 RAM
12x12TB HDD’s
2x960GB NvME SSD’s
As mentioned, this is purely a simple VM setup, so everything is just over NFS with a number of hosts on a 10g network.
As such when I built the pool quite some time ago I did:
RAID-z2 plus one hotspare for the HDD’s
SLOG mirror of the 960GB NVMe drives.
Dedup on
Compression on
So, now I’m at the point where I’ve got 100+ VM’s running through this pool, and I’m sure some of you might be shaking your heads at this point because yes, the performance is bad. On a random 4K write test I’m getting a whopping 1MB/s, terrible IOPS/latency you get the idea. If I deploy a cluster of VM’s from template and update them all at once, everything slows to an absolute crawl.
The good news is I’m getting a 100G network upgrade and a new storage server (pair, for that offsite backup I mentioned.)
2x16 core current gen Xeon
256GB DDR5
12x16TB HDD’s
4x400GB Intel Optane P5800X
So obviously I want to fix this random write issue I configured myself into. My current thinking is:
RAID-z2 plus one hotspare HDD as before (unless someone tells me this is also bad)
SLOG mirror for 2 of the P5800X’s
Metadata special vDev mirror with the other 2 P5800x’s
Compression on
Dedup OFF (Not sure about this. My current dedup ratio is 2.09x, but if the metadata Optane drives end up just being full of dedup tables I imagine it won’t be super helpful)
Larger than default block size on HDD pool (original was 128k and I’m afraid to change it now in case it explodes)
Small block size on metadata device.
Longer time period before write transaction flushing (again I left the default 5 seconds on the original, if I change this without exploding anything let me know)
I assume that should net me better performance than my previous setup at least.
I think however what I’m really looking for is actually a write-back cache, not a SLOG or special metadata device.
Bcache (on a linux-based ZFS distro) seems to at least be a possible solution, I’ve seen some people on the forum implement it before. Seems like it would also really help my random write performance, but I’m not sure how stable it would be if I put it in front of my ZFS pool. If anyone has done this and it is long-term stable would love to hear it.
Thanks in advance for any thoughts! (Yes, I know I should be really be using a SAN not a NAS for this but I’m taking what I can get. I’m hoping I could possibly build a SAN over time with TrueNAS Scale or something)
1MB/s at 4K is 250 iops per second, 2.5 HDDs worth of iops.
Since you have a slog, are you running sync=always by any chance (I’m guessing hopefully no). … your host ram and slog should be able to absorb some of the writes.
Maybe partition your nvme drives. Make a couple of 100G partitions and use as a dm writecache, and instead of using zfs on raw HDD, use ZFS on dm-writecache accelerated volumes made up of SDD+HDD each.
Another thing if you’re spending a bunch of money, look into NVDIMMs . (sticks of ram with an additional flash chip and a battery connector for a capacitor to syphoned the ram contents into flash on power loss)
What vdevs are we talking about? one 12-wide RAIDZ vdev? or 2x 6-wide?
This isn’t possible with a well-configured SLOG. Unless your NVMe drops down to 1MB/s or is way overloaded like @risk suggests. 2 SLOG vdevs will balance the load if indeed the amount of sync writes are overwhelming the SLOG.
You can test this by setting sync=disabled and your will memory act as the SLOG for this time, alleviating all performance concerns regarding sync writes.
metadata special doesn’t deal with writes. It’s about accelerating reads when your memory can’t cache the metadata anymore.
And the reason why ZFS doesn’t have a writeback cache is consistency/integrity. Running dirty data on a non-ZFS device is very dubious.
Are we talking about some 3T dataset with 2.09x or pool-wide? If dedup is worth it is determined by calculating the saved amount of space and comparing it with the opportunity cost for memory. And usually it isn’t worth it. saving 50$ in HDDs for 200$ worth of memory? TB on HDDs are damn cheap which is why dedup is so uneconomical these days. May change with NVMe pools, just run the calculation.
400GB isn’t really much for this pool size. 0.5-3% of pool size, depending on your average record/block size is what you need. Get a mirror of 2TB NVMe drives and you’ll be fine.
ARC handles metadata, so you only need a special if your memory can’t handle the metadata anymore. This is strictly a memory-offloading feature. Or if you reboot and wipe the ARC all the time.
If you have a lot of small files, this will help greatly. For Zvols or large files (like disk images) it doesn’t apply at all. And you need a lot more capacity on the special because you’re storing actual data.
If you’re dealing with VMs, you probably use zvols and/or 4k-32k datasets for the VM disks. 128k or even higher is for the large stuff on your pool that’s rarely modified (e.g. media). Can increase sequential speeds and reduce metadata. And bad for random I/O.
Your first have to deal with your sync write problem. But increasing the txg timeout certainly can change things. Best to try it out, you can’t break stuff but you should adapt SLOG size accordingly. And if there is a lot of traffic, you have to tune other things as well because 5 seconds is only the maximum time for TXG and things get committed early if other thresholds apply. Question is what do you want to achieve?
When you do your write tests, are these 100 VMs running at the same time and polluting the results? I’m not surprised that the pool is struggling with that much going on.
Double-check your ashift value. I’m assuming it is 12, but make sure.
If you can, test this with only a single VM running, or do it from the host while all VMs are shut down and the pool is idle.
Then, I’d suggest creating a dataset with a smaller recordsize and then test again using that to see what you get. With random 4k I/O, I’ve found that it tests better with a smaller recordsize. Since you are using RAIDZ2, you won’t want to go for 4k exactly. The write stripes are at least 12k in size assuming an ashift of 12 and RAIDZ2 → [(ashift block size (2^ashift)) * (# of parity bits + 1)] so 4k multiplied by the number of parity bits + 1 = 12k in your case. Try a recordsize of 16k since it needs to be a power of 2.
Ultimately the question I have right now is how much of the performance is due not to a necessarily poorly configured pool, but just due to how much you are throwing at it.
Dedup is also a huge performance hog. Random rule of thumb I’ve heard is 5 GB of RAM per TB of pool for the dedup table. In your case that would be 144 TB * 5 GB → 720 GB of RAM. Obviously this depends widely on a number of factors (a big one being how full your pool is) but you are way undersized to be using dedup IMO.
Also, the dedup table in RAM is capped to 1/4 of the size of the ARC. So if your ARC is half the system RAM right now (96 GB) then the dedup table can only be 24 GB max in memory before spilling to disk.
The dedup table can spill to L2ARC, but you have none configured, so it might be going to platter which is going to murder your performance. This blog is old, but I still reference it from time to time for ZFS deep dive details on how things work: Aaron Toponce : ZFS Administration, Part XI- Compression and Deduplication
Again, do a test with a dataset that has dedup turned off and see how it compares. If you want to continue to use deduplication going forward, consider either growing your RAM substantially, adding a dedicated dedup device, or both.
If you want IOPS, stay away from RAIDZ2 entirely. If performance is what you need then you should be using mirrors. Is that an option with your setup? Would you have enough space for what you need?
You could also consider something like a set of 4 x 3-disk RAIDZ1s striped with a global spare but that might be slightly risky…
This is true of your current setup as well, but dedicating these drives entirely to the LOG vdev is huge overkill (in storage). I would consider partitioning these to leave a small segment (maybe 30 GB but depends on your workload) for the LOG vdev and then create an L2ARC stripe with the rest.
If you are tuning the pool for performance on 4k writes, this will likely have the opposite effect. You want a large recordsize when you are doing large sequential writes to store large amounts of data at a time. This isn’t really conducive to a VM workload. I’m not sure what your VMs are doing, but when I do this I’m usually running a web app and a database on a VM sitting on a ZFS dataset. In that case I’m using a smaller recordsize to keep performance up.
This reminds me - are you using datasets or zvols here? I prefer datasets because snapshots don’t require a huge amount of free space on the pool like they do with zvols, but either one does work.
ZFS is specifically in opposition to this from my experience. ZFS’s primary focus is data integrity, and a write-back cache is a rather large risk to data integrity. Data the pool doesn’t know about is data that could be lost.
iops on ZFS are based on your number of vdevs (NOT disks). A vdev with 10 disks in it for example will get the iops of a single disk - every write is striped across all disks in the vdev. Increases the bandwidth but not the latency.
All the drives within a single VDEV are all acting as one. All writes and all reads are full stripe. This has some implications.
The entire VDEV can only do the iops of the slowest disk. If you want 4k off that VDEV that isn’t cached all the drives need to read the entire stripe.
ZFS will bundle writes into transaction groups and cache stuff but the underlying disks will be limited. Cache and serialising writes into transaction groups can only do so much. Eventually you have to hit the disks. Especially if the data hasn’t been read yet. Un cached random reads can hurt much more than writes (writes can be bundled and serialised)
You want more vdevs. ZFS can stripe across individual fault tolerant vdevs for more iops.
I’d run a bunch of dual disk mirror vdevs to get more underlying disk iops. Additionally due to the way ZFS uses hashes for error detection, it can (someone correct me if I’m wrong, @wendell ?) read different data from both sides of a mirror VDEV at the same time for more throughput. Ie mirrors in ZFS are special (double plus fast!).
Make sure your pools ASHIFT is set to 12 for 4k blocks.
Make sure atime is off
Make sure the drives are CMR
Make sure dedupe is OFF
De dupe is a killer. If it is on then a hash for every unique block in your pool needs to be kept in ram for de dupe to work inline.
IF the de dupe table won’t fit into ram performance will tank hard. Also you’ll be burning a huge amount of ram for it that could otherwise be cache. Rule of thumb you’d be burning up to or more than 100GB of ram in dedupe hashes (1GB ish per TB stored). That’s alternatively 100 gig of ram cache. Depending how much data is in your pool of course. Edit: seems there’s also a 5g/tb rule of thumb mentioned above. It all depends on how many unique blocks in the pool.
Get stats on your pool for de dupe ratio. Make sure it’s worth it.
Edit: most lab/dev won’t have a ratio worth shit. I see yours is 2x. I wouldn’t pay the performance penalty for that. Definitely not for vm storage. Archive maybe. But not hot data.
If you’re still set on raidzX there are some magic numbers of drives for each raid level that perform better than non optimal drive counts. I forget what they are but worth googling.
With only 12 disks though you’re going to be pretty inflexible for drives per VDEV and you’d want to get at least 2 vdevs. Depending on appetite for risk maybe 3x (or4?) raidz1. no HOT spares. Use them to get more vdevs. It’s lab after all? Not prod.
3x vdevs will net you something like 450 random iops raw off the disks
6x dual mirrors will get you 900.
Before cache. Before serialisation.
Edit +1 to everything @ghan and @Exard3k said above.
That’s typical RAID1 behavior. Double the reads, but writes are single disk performance. ZFS compares the checksums on every read. Just ask @Ghan what happens if the checksums don’t match and you want to read stuff You either get double the read speed or nothing at all.
And mirrors don’t store stripes of data (= split record/block into smaller pieces), so there is less random IO in general.
50% storage efficiency is a hard sell for most people. But RAIDZ isn’t built for random I/O. SSDs are bad at this too, but at least they’re 50x faster so this works. And we hate HDDs for their abyssmal random I/O. Limiting multiple drives to a single disk worth of random I/O is certainly a huge drawback if you care about that. You can use a SLOG to get sequential write performance on TXG sync and 100-200MB/s is plenty for most people when talking random sync writes.
But even with ARC, L2ARC and special…you can’t cache and accelerate everything for reads. So you always have to account for disk and vdev performance for reads too.
This should come with a golden plaque on every ZFS machine. I’d add “Use compression” to the list, LZ4 is just too good. Atime is a ghost haunting performance for 50 years and we won’t get rid of this UNIX relic any time soon. Should be off by default judging from my latest ZFS installations. I still always check it just out of a habit.
default is ashift=0 which is auto-detect and should result in 12. Even my NVMe reporting as 512b got a default ashift of 12 in ZFS.
QFT
My bet is on too slow SLOG/sync=always or worn out NAND/ write cache overwhelmed on consumer drive. Or those 1MB/s is async writes on a slow, full and fragmented pool while overloading dirty data max and constantly flushing TXGs.
sync=disabled will provide useful information about all this. And if you are ok with bcache backing ZFS, sync=disabled is the least of your concerns
Just checked with zfs get sync, and the pool is just using the default
I’ve read before that VMWare over NFS uses synchronous writing for VM disks. So would forcing sync=disabled help here?
Will keep dm-cache in mind for the new server, I think it’s probably something like this or bcache I’d like to try in order to really make good use of the P5800X’s. I think bcache serializes writes before flushing to disk which I hope would get me closer to the sequential performance (300-500MB/s) of the array which would be nice.
If I had the budget I would just do an all flash array, alas I do not. Will keep in mind if forcing sync=off helps though, thanks!
I just went with a 12-wide RAIDz-2 (with one hot-spare) for simplicity. I could start making smaller pools and moving VM’s around based on IOPS needs but that adds an extra layer of management I’d rather not have to commit myself to if possible. Not sure if there’s any way of automating this without a SAN/vSAN.
Looks like it does actually speed up sequential writes a bit, but I would consider random writes within margin of error, as I’ll get between 1-5MB/s on RND4KQ1 depending on what the other VM’s are doing.
So I don’t think it’s a limitation of the SSD SLOG’s then, perhaps it’s just the best I can get for random IOPS using RAIDz-2? (IOPS score was below 1000 in both tests). If so I need to re-think my strategy for the new server.
Right, I figured that’s the case on the metadata device hence the idea of running a (mirrored) writeback cache. I know it’s not as safe, but I’m trying to figure out some way or having better random write IOPS without sacrificing integrity too much (and also not being too complicated to manage).
It looks like dedup probably isn’t worth it for my data despite it being VM’s so I won’t be using it going forward. Of course I can’t undo it now but I can copy everything to the new server when it arrives and recreate the volume on this server without it.
It’s VMWare over NFS writing to VMDK files, truthfully I have no idea if a larger or smaller default block size will help me more here. I’m guessing small, as the VM workload data at the moment isn’t large sequential writes but small, more frequent ones.
All in all I just need better random write performance than I’m getting right now. I think I’ve made a mistake somewhere that’s killing my random IOPS performance and would like to “fix” whatever I did for the new server such that I can get hopefully much better IOPS (and more responsive VM’s).
Yep that’s with the VM’s running so yes it will skew the results somewhat.
It is indeed ashift 12
Thanks for the info! I’m assuming I have to recreate the dataset for this? If so I’ll note it down to use for the new server.
I was not aware that the dedup table could only be 1/4 of ARC! That could indeed be causing some of my issues. I’ll refrain from using it going forward.
I was aware that needing to do parity calculations for RAIDZ would reduce my write performance, but perhaps I underestimated by just how much. I guess I could do performance/mirror and reliable/RAIDZ volumes, but I was hoping to try and just have one volume so I don’t have to waste time managing IOPS-heavy VM’s
What I’m doing here is not actually part of my job description, so I’m trying to keep any management overhead to a minimum!
Yep I understand that doing a write-back cache is risky, but my thought process is if I use bcache/dmcache to do a 3-way mirror on the Optane drives, it would really improve my random writes (and hence the responsiveness of the VM’s) without adding too much risk. ARC seems to be handling the reads just fine for now at least.
I understand that IOPS for the HDD array would not improve based on number of disks, but I thought that an SSD SLOG would help with “perceived” IOPS by the VM. The write is sent to SLOG, VM over NFS gets the ACK it wants, write operation gets committed to HDD later. Perhaps I have misunderstood something?
That also has me wondering. If serializing writes IS what the SLOG does, maybe I just set the flush timer to a longer interval so I’m not having the disks heads fight between read/write cycles as much. Random reads looks ok, but maybe I should indeed be running a metadata vDev and L2ARC, and they would improve VM responsiveness.
atime is off but dedup is on. As mentioned, I won’t make that mistake again!
I get the idea of striped vdevs into a larger pool for increased IOPS, will keep that in mind for the new array, thanks!
ZFS 2.1.5-1. Just did a manual TRIM but not performance difference.
The NvME’s in the current server are supposedly enterprise-grade (though named as “read-intensive”), they should be fine.
Thanks for all the comments everyone! Really appreciate it.
So in summary sync=disabled, didn’t really net me any gains in random I/O, and my atime is off and basically all other settings seem to be reasonably OK.
That is except for dedup of course, which might be trashing my disks. Obviously I can’t undo dedup now. Maybe I could take one of the 960GB SSD’s out of the SLOG mirror, and partition it for dedup/metadata and see if that helps. I suspect it could be a combination of both constant read and write operations all over the disks that’s causing my issues. And at least I’m learning what not to do for my second attempt
I’d still like to experiment with a 3-way mirrored writeback cache, server isn’t here for another month or two but will report on my results. I’m sure it can work and get better numbers, but even if this isn’t prod I don’t really want to have to deal with data corruption events, hence why I’d still prefer to lean towards ZFS (RAIDZ or stripe/mirror) rather than say BTRFS or UnRAID. As mentioned a 50% loss in storage capacity is rough, but I might make the sacrifice and just have fast/slow NFS storage blocks going forward.
If you change the recordsize for an existing dataset, then new data written there will adhere to the new size, but existing data will stay as it is.
You would not need to recreate the entire dataset, but instead just create a new one on the same pool with the settings you want to test. That way you can test against the same hardware just with different ZFS settings to see what changes.
What you’re describing seems more like a tiered storage system, which is not really what ZFS does. If you want to have different performance tiers (with regard to the vdevs involved), then you’ll need different zpools configured accordingly.
There’s also the less drastic option to do something like use dedup on one dataset but not another, and/or a different recordsize on different datasets according to their workload.
Just to confirm in case it isn’t clear (or for others reading who may not know this also)… ZFS pools can contain multiple VDEVs. You don’t need to go to multiple pools - though re-structuring a pool is a pool re-creation event… unless you’re adding VDEVs (you can’t remove a vdev or re-create one when its already in a pool).
The hierarchy is as follows:
Pool → VDEVs → Disks
So you can have 1 pool with as many VDEVs as you want; the pool will stripe across all the VDEVs (i.e., raid0 across the virtual devices, but thats fine as each virtual device should be fault tolerant itself).
Big arrays do similar stuff but for example Netapp calls it raid groups (i.e., VDEVs) and aggregates (i.e., pools).
And yeah, screw manually shifting data between multiple pools with different characteristics… better to use the devices you have in a single pool and use SSD for slog/l2arc/etc. If you can.
As above caching will help absorb writes and take some load off the disks but you’re leaving a lot of performance on the table if you’re only using one VDEV and you aren’t really gaining that much in terms of resiliency (i.e., multiple RAIDz1 will sustain multiple failures so long as they’re not in the same VDEV).
It’s a slight resiliency cost but you’ll literally triple your back end disk IOPs (say 3x raidz1 vs. a single raidz2 - or 6x at least if you were to go to 6x 2 disk mirror vdevs). You give up a bit (fault tolerance), but gain a lot. For lab/dev/test … good trade imho.
Wierd. But if you can test the random writes on the server and not some 3 software stacks later, this can narrow down why this is the case. fio is pre-installed on TrueNAS and can be installed on all OSes that use ZFS.
Test on the dataset the VM disk is running on. And also make a new dataset with like 16-32k recordsize and test again.
If you want that Q32T1 @4K number to go up, (5MB/s / 4k = 1250 iops). Wrap each raidz2 constituent into dm-writecache (not bcache, not dm-cache but dm-writecache specifically).
You can give each disk anywhere from 10G to 100G of nvme as this type of cache, … and you should be getting, 150-300MB/s easily at Q32T1 (e.g. that’s only 50k+ … 4k iops, and QD32 makes it 1s+ mean per write, should be fairly easy for modern nvme using emulated slc).
As long as you don’t run out of cache space (or cache spaces, since SSDs also do their own re-arranging in the background that needs to catch-up).
You can monitor performance per device, logical or physical during your testing with atop on the host, and see if you’re seeing some pathological weirdness.
And if you like the performance after the initial testing, consider the how many devices you have and what kind of redundancy you need for your “real pool”. e.g. 3 HDDs worth of dm-writecaches on same SSD mean you lose 3 disks of data when 1 SSD dies.
Maybe you should mirror your dm-writecache disks before giving them to ZFS – buying more nvme means more pcie lanes, which sucks.
Also, if you’re in a windows guest, enable writeback caching - software that cares, including the OS itself will issues "sync"s as needed, no need for every write to be synchronous.
Also, NFS is notorious for sync-ing too much … as a consequence of non-existent cache sharing or cache-coherency features in the protocol (… no, locking doesn’t count), avoid NFS if you can.
or mount exports in async. Doesn’t mean it makes NFS awesome, but is a workaround. But I suspect NFS/VMware/misconfigured dataset/Windows stuff some combination of those seem more like to bog things down.
Yeah that’s not what I expect with a SLOG or sync=disabled. There is something fishy out there. I really like to see that measured at the source.
Something strange going on here, i blame NFS or somewhere else other than ZFS.
ZFS (at least on the disk back end) should theoretically handle random write easier than random read (cache not taken into consideration) as it serializes them all into transaction groups which are then written as a copy-on-write stripe (i.e., random writes become sequential).
edit:
skimmed, didn’t see what the network uplink is on the original box. be aware that throughput isn’t the only concern with network speed for NFS or iSCSI - you may be getting limited on smaller IOs by the clock rate of the network. i.e., you can only fit so many Ethernet frames down it in a given time for commands to the storage. This is why higher speed ethernet is massively important for storage used for VMs over ethernet. Even if you aren’t hitting the bandwidth cap, you may be getting bottlenecked by command rate to the storage.
Not sure if the pic will upload correctly so I will describe:
Assuming I did use fio correctly, the IOPS reported was 130k on TrueNAS and 99k on the Ubuntu VM for a 4k random write test.
When testing the Ubuntu VM, I could see that the SSD SLOG was indeed working correctly, absorbing all the writes for the HDD’s.
Windows is obviously doing something completely different, giving it terrible BW and IOPS despite write-back caching being enabled. (I disabled it just to check and it made no difference)
So at least this little sanity test has relieved me somewhat. It looks like my array setup and NFS configuration is actually fine. It’s a windows problem because of course it’s a windows problem
Now I just need to figure out why? This was just a fresh Win11 install with crystal disk mark the only extra thing downloaded to it.
EDIT: So while trying to troubleshoot I decided to just to a test on Windows with 7 threads (an 8 vCPU VM) aaaaaand…
I have no idea why adding more threads fixes the problem or why performance is still way off compared to linux, but I’ll just chalk it down to windows being windows.
This thread has been super helpful however, as I my original problem was slow-to-respond VM’s, and now I at least can be sure it’s actually a read issue, not a write one.
I think part of the problem was mentioned earlier, where my dedup tables might be spilling over to disk and thrashing them, so any time a lot of reads occur that aren’t in ARC it struggles to serve them.
So for the new server:
Dedup OFF
2xP5800X mirror for SLOG
2xP5800X mirror for metadata special
And maybe re-partition the SLOG drives to be half SLOG and half L2ARC. It looks like in reality I don’t have to do any write-back caching which is a relief.
Seems correct. Quick learner Fio will serve you well and is a great tool for storage diagnostics. Micron benchmarks their datacenter SSD with fio too.
That’s normal for storage. The more parallel work you give it, the better (usually). Same with Queue depth. But a lot of stuff isn’t multithreaded, so it isn’t always representative.
ZFS is a bitch to benchmark if you want specifics. The downside of sophisticated storage products in general.
We now know that your ZFS server is doing a great job and runs very well. And that the issue is with software somewhere down the software pipeline down the stack.
I would try handing out block storage to the VMs via iSCSI and zvols. NFS may be a bottleneck. Also check record/blocksize on zvol/dataset to match VMDK.
I know that you leave any cache options empty when using ZFS. Particularly with virtual machines. Can’t say how this works with VMWare as I don’t have experience with it that is younger than 20 years.
SLOG is fine if you really want more speed out of ZFS, but fio shows very good speed already. You won’t be able to fix the pipeline by improving things that are already great. Optane SLOG will perform somewhere between NVMe and sync=disabled to give you a perspective.
Metadata special: save yourself some money and get plain old standard NVMe with 2TB-4TB or a read-optimzed enterprise NVMe if you like to spend money. The difference on directories with less than 100.000 files is only measurable in benchmarks after a reboot. And putting them on a large 2-4TB actually allows storing both data+metadata so you bypass HDD completely. You can’t do this with just 400G. Accessing data via SSD instead of HDD is vastly more beneficial, while metadata is a nice bonus.
you can also check on the stats regarding metadata demand and prefetches via arc_summary -s archits . Let the server run for a day or two of normal load to see how warm and trained cache performs. And after a reboot to see cold cache performance. <90% is usually a problem or bottleneck but not worrisome shortly after a reboot.
arc_summary -s arc to check how much metadata is cached in RAM already.
Put the money into more useful stuff, like memory,L2ARC,CPU or drives. It sits idle 99.5% anyway.
Oh and check on higher compression with the new CPU. This should give both extra space and higher performance, particularly for the HDDs. And free RAM.
For dedupe, never completely max out the ram for a given architecture. Leave at least one slot available, as you may need to split your array to more computers.
I worked at ix systems for a few weeks, and one of the horror stories was a customer with a current gen xeon build filled the ram with their dedupe table, and made the array unavailable on any system for about 4 months until the next xeon processor line was released that had a larger maximum ram capacity. They had more than 60:1 compression ratio on a VPC hosting array with dedupe.
also, nfs allows splitting the computers providing hosting for various directories independently of the directory tree structure.
Say you had a university. /school/
each department could have a locally hosted server for their department without needing to round trip all file server requests to a centralized location that may be miles away.
ie /school/math/ or /school/psych/
in fact there are features to serve each portions of the directory trees by clusters that may be geographically separated. I can’t remember the details of that portion right now though.
On my personal server I am serving all of the VMs off of a zfs array, but having independent optane non-redundant block devices to serve Virtual Memory partitions. (planning stage right now, hardware installed, but busy till the end of the month.)