There are some talks to add this feature, but RAIDZ with all its disadvantages usually isn’t worth it when talking about mostly tiny stuff like metadata or small blocks. Write amplification and general waste of space (similar to what you see with ZVOLs on RAIDZ) usually makes it a poor choice. Performance being another disadvantage.
You can see double or triple the space usage just because if you want to write a 4k block (assuming ashift=12 for 4k sectors) across like 5 disks and parity, you got a problem. Only works if you write it to each disk to ensure integrity, thus wasting space, making it a de-facto “5-way mirror with parity”. The wider the RAIDZ, the worse it gets. Just make a 4k blocksize ZVOL on a RAIDZ and check the space usage, it’s horrible.
how much benefit would people expect to see adding a special device of optane 905p on a pool of sata SSDs to speed up io? Is this a waste of money or benefit? Run VMs and Kubernetes off this pool in addition to general SMB
I doubt you will see a noticeable impact. Main reason to use it is to bypass the abyssmal HDD random read performance on small stuff. SATA SSDs may not have the bandwidth, but still have NAND flash level of random read/writes.
If your SATA drives are stretched to the limit, offloading some random I/O is useful, but so is adding more drives or memory.
And optional small blocks setting doesn’t apply to ZVOLs, so that won’t be of use to any of your virtual disks either.
And you need at least two of the 905p, because redundancy.
Thanks - I have an HDD pool of 3 Zvols of 6 drive raidz2 HDDs that has mirrored optane special metadata devices with small blocks for small xmp files as well as a separate optane L2arc - and that has really helped the pool.
My SSD pool is 6 zvols of mirrored SSDs with an optane L2arc - but I have been wondering if a special metadata device would be wasted or not - but it seems even for the small level writes being done by K8s and VMs “eating up” the random IO - I wonder if it could be better there.
You neglect the fact that ZFS uses transaction groups (txg) to gather up all writes in memory and basically convert random IO to sequential IO.
Just to give you a perspective for async writes on ZFS: I’ve been playing with various backup methods two weeks ago. Stream of like 5T of user data with everything in it, millions of small files. I was transferring that from 2x striped NVMe 4.0 drives via 10Gbit connection to my pool. Turned out that ZFS was able to write the small stuff faster to my HDDs than the NVMe drives could actually read them.That’s what they call ZFS magic. The magic of transaction groups (txg).
async writes are not a problem you have to deal with, txg’s take care of everything in memory before flushing it to disk as a sequential stream.
Problem starts with sync writes where we can’t gather up random stuff to optimize write pattern on underlying hardware. The application commands to immediately write it to disk although writing single random blocks is really bad even for flash. Workaround being LOG vdev or sync=disabled.
Optane LOG or disabling sync (fastest) all together might be the better approach than throwing money at underlying hardware if indeed sync writes are a thing on your pool.
This is why ZFS performs so well even on HDDs. Because there aren’t thousands of random writes happening in a properly setup pool, it will be one big chunk of data coming from DRAM to disks every 5 seconds (default settings).
That’s how you deal with writes.
You deal with (random) reads by caching stuff in memory.
Also, as I have been digging into my HDD spinning rust pool with triple mirrored intel 900p special metadata device vdev, and I see that the GUI made the special metadata device with ashift of 9 as the intel 900p reports 512byte sectors and cannot be programmed. Whats the consensus on mixed LBA size on metadata device vs spinning rust? I had really wanted this to be 4K to keep everything the same between vdevs.
Late to the party, but adding this for completeness:
According to github com openzfs zfs blob master module zfs arc.c line 7963, this is not the case (sorry can’t include links as a new user):
There is no eviction path from the ARC to the L2ARC. Evictions from the ARC behave as usual, freeing buffers and placing headers on ghost lists. The ARC does not send buffers to the L2ARC during eviction as this would add inflated write latencies for all ARC memory pressure.
Thanks @Impatient_Crustacean . I don’t bother with L2arc myself, so am interested to learn.
Same with special device.
I have since read a Klara blog post about it…
one sec
does this look more like it:
How the L2ARC receives data
Another common misconception about the L2ARC is that it’s fed by ARC cache evictions. This isn’t quite the case. When the ARC is near full, and we are evicting blocks, we cannot afford to wait while we write data to the L2ARC, as that will block forward progress reading in the new data that an application actually wants right now. Instead we feed the L2ARC with blocks when they get near the bottom, making sure to stay ahead of eviction.
Yeah there is an entire CPU thread to feed the L2ARC from so called Ghost MFU and Ghost MRU. Rather efficient mechanic to avoid ARC/DRAM performance being bogged down by disks.
If explaining L2ARC, most people don’t usually mention it (including me) for the sake of simplicity.
0.3% is assuming ~128-256k records. Your average Zvol will produce massively more metadata depending on the volblocksize. 5T might be correct, depends on the rest of the pool and their record/volblocksize.
special vdev is made for these kinds of Zvols. You can’t really cache that amount of stuff.
Out of curiosity. How wide is your RAIDZ and how much space does the 67T Zvol actually use on the pool?
we were talking about the zvol, not the pool. The pool has a pool-wide recordsize which seems to be the default. But the zvol has a volblocksize that is different.
zfs get volblocksize tank/../zvolname
With RAIDZ, especially on wider pools, you waste quite a lot of space when using zvols. On top of the metadata inflation bogging down performance.
edit: I checked the output once more. Block histogram shows mostly 128k stuff, so I assume the zvol runs on 128k. Or there isn’t any zvol and you are just calling the pool a zvol, not sure
Yeah with 67%, pool starts to slow down. You may hit the limit on what your memory can cache. 128k isn’t problematic by any stretch and the rest of the pool is basically non-existent. With 128k as we see on your pool, that’s really average amount of metadata. But with these amounts of data, TBs worth of metadata are normal.
These numbers don’t add up. 10x14 RaidZ2 is 112T raw.
These are called datasets, not vols or zvols. Zvols are totally different things. That’s why I went into the wrong direction. But we’re on track now
So the entire pool is 128k datasets. That’s good news. No Zvol being a troublemaker.
As edited above, that results in “normal” amounts of metadata. Maybe 0.5%. But even 0.5% is more than your RAM. Pool is reaching capacity limit, fragmentation is probably becoming a concern while 10-wide RAIDZ isn’t good at random IO in the first place and your memory is less able to cache all the stuff while amount of data is increasing.
special vdev will help to some degree. I can’t say how much metadata there is and how you get the 5T. In any case you have to backup and restore to get the metadata into the special vdev. ZFS doesn’t move data by itself. Otherwise ZFS will only allocate new stuff to the special vdev.
The 5tb/172tb was from @wendell 's example in the first post.
My setup is 10 x 14TB. I’m still unsure how to calculate metadata size so I can size the appropriate flash storage.
So does this sounds reasonable for metadata:
as of now: ~64 TiB of data * .005 (0.5%) = 351.8 GiB
max 94 TiB * .005 = 516.7 GiB
ah thank you for clarifying! I must have gotten confused with LVM (device > group > volume) because I was resizing a lvolume to use as L2ARC on nvme drive for this pool.
yes for zfs: vdev > zpools > datasets
not sure how zvols fit into the picture… (looked it up they are block devices sitting atop ZFS for virtual machines)
With 128k recordsize across the pool (and the block histogram clearly shows that >90% is 128k), you can expect around 1G of metadata for each TB. So a mirror of 120GB SSDs will be fine if you stick with 128k. It can’t fix bad random read/write performance on RAIDz, nor does it help with other performance concerns mentioned above, but it will spare your HDDs from reading those ~100GB over and over again when accessing those files.
So this isn’t quite the madness I imagined when I first heard of 67TB Zvol (which is insane, but sometimes needed) and TBs of metadata.
Yeah I was under the impression you got some customer with a 67TB VM disk or running the entire pool as a source for some VM.
500GB is a safe bet in any case if you prefer the 0.5% figure. Pretty much the smallest SATA/NVMe SSD size you can buy anyway.
mirorred Z special device for metadata > 128GB each
random read/write - nature of spinning rust;
for read → add L2ARC
for write → add a ZIL (w/ battery backup)
what were the other performance issues?
This is my NAS/backup so to speak, I only zfs send/receive when upgrading disk…
started with disk about a third of the 14TB currently using.
How would you suggest optimizing the datasets if starting from scratch?
data - photos and video and system backups
database - lots of random I/O, large files but lots of random r/w activity
surveillance - images and video, append only