ZFS Metadata Special Device: Z

EricM · December 19, 2021, 12:03am

Okay, that makes sense. I’ll wait a little bit for confirmation.

I saw someone say to triple mirror the metadata, and since I’m using old SSDs I thought “why not?”

wendell · December 19, 2021, 12:09am

this is just ZFS’ way of saying ‘well the pool is raidz1 you sure you dont also want raidz1 for the metadata?’ it correctly detected that you want 2 way redundancy on the metadata but only one way redundancy for the z1 pool.

the whole pool can still only suffer one drive failure. if a second drive fails you have to be ‘lucky’ enough for it to be a metadata drive. otherwise whole pool is lost.

-f will force it to work

mirror devices will auto stripe reads across all 3 devices which is maybe nice.

TrumanHW · January 1, 2022, 9:23pm

Keepin the thread alive in the new years …

Just here to "confirm some stuff I … “read on the internet”

Even an aggressively set Metadata Pool is still going to require the main vDev store files that are the ashift size x the zVol width.

Say that’s 128K with a zVol of RAIDz2 with 8x HD … anything over 1M is going to be written to the main pool (with some mitigation by the ZIL ) …?

Which will “protect me” from the worst of the MINIMUM THROUGHPUT speeds my main vDev would give with those tiny (metadata) sized files … where spinning drive performance drops off.

Can I ask for some acumen on the following stats and logic …?

Based on:
4TB - 6TB HD
Usually HGST Ultrastars
7200 rpm
~40-50% Full
100% Read or …
100% Write …

At what file-size does the throughput drop below 50MB/s per HD

AND … how would the above drive roughly perform with these file sizes…?
100% R or 100% W … MB/s of 8KB files:
100% R or 100% W … MB/s of 16KB files:
100% R or 100% W … MB/s of 32KB files:
100% R or 100% W … MB/s of 64KB files:
100% R or 100% W … MB/s of 128KB files:

(my guesses)
8KB files: … 0.6MB/s
16KB files: … 1.5MB/s
32KB files: … 4MB/s
64KB files: … 8MB/s
128KB files: 16MB/s
256KB files: 32MB/s
512K files: 60MB/s
1MB files: ~ 100MB/s

Does that sound about accurate …? Thus, with an
8-HD 7200rpm array – and even a metadata (Fusion Pool) set to 128KB
It’d provide a “performance asymptote” approaching ~100MB/s ? (bc of RAIDz2 overhead) …?

With a good ZIL … maybe I could get the “low asymptote” up to 150MB/s …?

Does that all make sense…?

Log · January 25, 2022, 6:23am

Here’s something that has managed to escape my attention since June 2020: As of the following pull 9158 - Add a binning histogram of blocks to zdb by sailnfool · Pull Request #10315 · openzfs/zfs · GitHub zdb gives you a histogram of block sizes when using -bb or greater (-bbb,-bbbb etc), so you can quickly determine exactly what size you can get away with setting the special vdev small block redirection to.

Normally the output is in “human readable” numbers, like “64K”, but if you use -P then the output will use full numbers, like “65536”

Using -L suppossedly cuts down on the time needed to generate the data because it’s not trying to walk through and verify everything is correct.

My main pool’s output looks like this:

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:   833K   416M   416M   833K   416M   416M      0      0      0
     1K:  1.04M  1.22G  1.62G  1.04M  1.22G  1.62G      0      0      0
     2K:   889K  2.31G  3.94G   889K  2.31G  3.94G      0      0      0
     4K:  2.63M  11.6G  15.6G   963K  5.31G  9.25G      0      0      0
     8K:  2.01M  21.8G  37.4G  1.61M  18.6G  27.8G  4.83M  50.6G  50.6G
    16K:  1.44M  30.5G  67.9G  2.15M  40.6G  68.4G  2.77M  65.0G   116G
    32K:  1.46M  65.3G   133G   802K  36.3G   105G  1.49M  67.1G   183G
    64K:  3.08M   310G   443G  1.01M  93.3G   198G  1.64M   151G   334G
   128K:   401M  50.2T  50.6T   405M  50.7T  50.9T   403M  75.6T  75.9T
   256K:   602K   214G  50.8T   798K   287G  51.1T   734K   259G  76.1T
   512K:   451K   320G  51.2T   585K   411G  51.5T   542K   385G  76.5T
     1M:   331K   485G  51.6T   264K   371G  51.9T   385K   540G  77.1T
     2M:   373K  1.03T  52.7T   153K   432G  52.3T   360K  1.03T  78.1T
     4M:  7.14M  28.6T  81.2T  7.58M  30.3T  82.6T  7.35M  43.9T   122T
     8M:      0      0  81.2T      0      0  82.6T      0      0   122T
    16M:      0      0  81.2T      0      0  82.6T      0      0   122T

I believe the asize Cum. ( ͡° ͜ʖ ͡°) column is what we are concerned with here. A few observations:

LSIZE is logical size. which is the original blocksize before compression or any funny business.
PSIZE is physical size, which is how much space the block itself consumes on disk (after compression).
ASIZE is allocation size. The total physical space consumed to write out the block plus indexing overhead. Includes padding, parity and obeying vdev ashift as a lower limit and dataset recordsizes as the upperlimit.
The block sizes for 512, 1K, 2K, and 4K are blank. 512, 1K and 2K make sense, because my ashift for everything in the pool is 12. Because the 4K column is blank, I guess that the size counts are for “counts for blocksizes below this number”. But I’m not sure yet. It could be some sort of padding weirdness.
Pending better understanding of wtf these numbers are doing exactly, at the very least I can easily redirect blocks up to 32K (or 64K?) because it only takes up 334G, and my special vdevs are 1TB enterprise intel sata ssds. Hopefully I’m mistaken and I can set it to 64K. Either way this is awesome, I didn’t realize I could get so much of the worst performing blocks off my HDDs.
When block get above 64K (128K?) the block count shoots to the moon. This is what makes me think that the 4K block column is weird or I’m reading it wrong, because what seems to be happening is the vast majority of my blocks are at the default 128K recordsize. This means I need to refactor my pool recordsize because the majority of it is archived video that would benefit from large block sizes.
At the end it jumps again to 122T, so I guess it’s just tacking on the rest of the free space? This is weird.

As always, I’m probably wrong on some of this so feel free to correct me, and take what I say with a grain of salt. This is just my “current working knowledge”.

Exard3k · January 25, 2022, 7:24pm

I’m using the zdb tool too, it’s really great.

I can confirm your assumption. I was running special vdevs and special_small_blocks set to 64k across the entire pool. The asize Cum. (or was it psize cum.?) was identical to zpool list -v special vdev output for allocation.

I only have two NVMe drives and later removed the special vdev from the pool (fasten your seatbelts for immediate evacuation process!) and set up the drives as L2ARC.

All those blocks will eventually end up in the (persistent)L2ARC anyway, so chances are good you read at NVMe speed in either case.

But special vdevs contribute to pool size, their blocks can’t be evicted, delivers performance even with a cold cache, TB written is negligible so SSD wear is minimal. And when talking about (sync) writes, ARC+L2 wont help you, But NVMe vdevs will. Storing metadata certainly helps with no or cold cache scenarios, but I guess there are also people with pools out there where metadata eviction from both ARC+L2ARC is a thing.

BUT: You can’t set the small blocks properity on a Zvol and that’s really the biggest weakness in my case. It just works with datasets.

I’m reading the same. Around 400 million blocks with 128k sitting in your pool. If you feel like mass murdering tens of millions of innocent metadata entries and block pointers, go ahead.

That 122T is the total allocated space for that pool as seen in zpool list -v poolname. At least for me it is. Don’t know what you mean by free space though. You don’t have any larger blocks than the ones in your 4M dataset(s) which are 43.9T asize and adding that up with all blocks <4M (78.1T total asize), you get the 122T.

Log · January 27, 2022, 3:29am

2023 Edit: This script seemed to work with a freshly made pool, but doesn’t seem to work with my pool which is now more “mature” and has more data on the special vdevs than this script reports. As such this script should be considered ultimately inaccurate until I have time far in the future to figure out what it’s missing.

2022 Original Post: Ok so I think I’ve got something “close enough” that should tell you how much pure metadata a pool has. A special thanks to @rcxb and @oO.o for their help here: The small linux problem thread - #4956 by rcxb

//Generate zdb output file so we only have to run a slow command once:

zdb -PLbbbs kpool | tee ~/kpool_metadata_output.txt

//Sum only the relevent numbers from the ASIZE column, removing rows with redundant numbers or things that aren’t metadata. Hopefully I got everything.

(
cat ~/kpool_metadata_output.txt \
| grep -B 9999 'L1 Total' \
| grep -A 9999 'ASIZE' \
| grep -v \
 -e 'L1 object array' -e 'L0 object array' \
 -e 'L1 bpobj' -e 'L0 bpobj' \
 -e 'L2 SPA space map' -e 'L1 SPA space map' -e 'L0 SPA space map' \
 -e 'L5 DMU dnode' -e 'L4 DMU dnode' -e 'L3 DMU dnode' -e 'L2 DMU dnode' -e 'L1 DMU dnode' -e 'L0 DMU dnode' \
 -e 'L0 ZFS plain file' -e 'ZFS plain file' \
 -e 'L2 ZFS directory' -e 'L1 ZFS directory' -e 'L0 ZFS directory' \
 -e 'L3 zvol object' -e 'L2 zvol object' -e 'L1 zvol object' -e 'L0 zvol object' \
 -e 'L1 SPA history' -e 'L0 SPA history' \
 -e 'L1 deferred free' -e 'L0 deferred free' \
| awk \
 '{sum+=$4} \
 END {printf "\nTotal Metadata\n %.0f Bytes\n" " %.2f GiB\n",sum,sum/1073741824}' \
)

Output

Total Metadata
 57844416512 Bytes
 53.87 GiB

Which is close enough to the 53.2G Allocated that zpool list kpool -v gives me. It is possible that all of my metadata might not actually be in the special vdev. If anyone finds any errors let me know. Your metadata being over a TiB is a good indication the script is counting something it shouldn’t.

So that combined with my findings here about the “new” histogram that shows up with zdb -bb (as detailed here: ZFS Metadata Special Device: Z - #94 by Log) should help people quickly figure out how large their drives need to be to hold metadata and small blocks.

@wendell
Curiously, I think I’ve somehow obtained the ability to edit your post. Is this on purpose or did I hack you?

Image

wendell · January 27, 2022, 3:50am

this post is a wiki post, its meant to be updated over time. I like your script, its nice

oO.o · January 27, 2022, 3:57am

grep -B 9999 'L1 Total' ~/kpool_metadata_output.txt

rcxb · January 27, 2022, 5:48am

Nothing wrong with using cat. In particular, grep’s output changes when you’re searching one file versus multiples. cat protects you from such unpleasant surprises.

Ataraxy · February 11, 2022, 5:07pm

Trawling reddit, I found some cautionary tales of both using and removing a special vdevs.

I can’t yet post a link (forum noob), but a Google for the following will turn it up:

Special vdev questions - will it offload existing metadata and can i remove it if vdev is not mirrored

Search that page for irretrievably corrupt and Removing your metadata special VDEV will corrupt your pool if you remove it before migrating the data

Also, search GitHub openzfs for the issue:

Pool corruption after export when using special vdev #10612

Exard3k · February 11, 2022, 8:41pm

That should be obvious for anyone using ZFS. If you plug out your vdevs without telling ZFS, it’s just treated as a faulted vdev with all the consequences that may arise from that. And you certainly don’t plug out an entire top-level vdev during an evacuation process.
This isn’t specific for special vdevs, but applies to any multi-disk RAID(-like) situation.

Non-reproducible issue. Read more than headlines next time.

yeah, pretty much sums up this fearmongering.

SGC · April 4, 2022, 11:14am

So i started using the special small block device a while ago…
the latter part of last year…
however i ran it at 16k special small blocks for a while because, i wanted to see the performance gains… around the 50% mark i reduced it to 4k.

it’s slowly but surely running out of capacity, and i initially figured when i was to change the block sizes, it would purge the larger 16k blocks, but this didn’t seem to happen…

is there some way check or rebalance the blocks in the special small block device?

currently it’s a bit tricky for me to migrate the data… because reasons.

i know i am most likely forced to simply expand the ssd devices capacity, for safety reasons…

but i was hoping there was a way to atleast figure out if it actually still contains the 16k blocks or some way to purge them.
without any danger to the stored data.

and yes i know i’m a disk short… i’m getting to that… lol

picocreator · May 18, 2022, 9:24pm

Wanted to add - that the raw benchmark numbers (read/write) does not give credit to the importance of special vDevs - on HDD based storages.

Been kicking the tires on it with a large number of mixed workload and use cases, for a year, across 30 or so servers. And the difference can be really “night and day” in use cases where dir traversal occurs. Additionally, because my company workload is constantly mirrored (HA requirements). It allowed me to do apple-to-apple comparisons.

Debugging and navigating TB or PB’s of data to debug an issue, in somewhat cold storage. ls commands is dropped from 50+s to 3s in several folders.
rsync Unless your exclusively having large 10GB+ files, on mixed workload with HDD, dir traversal has a huge impact on rsync time, especially when your background processes triggers multiples of it. (HDD lookup contention)
Use NVMe where possible, even a small “200GB” special vDev in RAID 1, makes a huge difference, even when the total metadata exceeds its storage space. Sure the first few ls maybe slow, but every-time you “accidentally” go one step into a folder, and out again (and ls again). You will feel the difference. If budget, and redundancy is a huge concern, then go for 3 small mirrored special vDev instead, of 2 bigger ones.
SLOG if you are already using this, you probably already have the SSD’s required

Also in all cases above, there is a factor of, “when you need it : you really need it”. Eg. Cloning, or replicating data to replace a “down node, that died due to Murphey”

Overall from a productivity standpoint, its really a no brainer to toss in 2 (or 3) enterprise grade NVMe drives just for a mirrored special vDevs. On any HDD based storage servers. The cost to the eventual productivity gain margins (when s*** happens) is huge. It’s to the level, where I refuse to work without a special vDev anymore.

If you are really that paranoid of write wearing, you could always allocate with a large buffer of free space, to really stretch that wear level-ing. I been doing ~75% as a precaution (75% number has no basis, just plucked out of thin air).

PS: Because all our data is always redundant on a node level, having dual mirrored NVMe, is more than sufficient for our use case.

pepperwater · May 22, 2022, 11:41pm

Is there a way to get this zdb functionality on ubuntu 20.04 LTS? Running sudo zdb -bb tank does no generate similar output. Trying to identity how much fragmentation I have on my root dataset by figuring different sizes and how much space they take up.

Using a script it go these numbers but don’t really know what it means.
python3 zfs_frag.py data-forwardslash.txt There are 71798 files. There are 8696287 blocks and 8624535 fragment blocks. There are 7708985 fragmented blocks 89.38%. There are 915550 contiguous blocks 10.62%.

dc443 · May 31, 2022, 11:54am

@wendell I love this topic. Another reason why ZFS rules. I have really been enjoying doing stuff like find /pool/ -type f -printf "%M %s %t %p\\n" > filedb to make little highly searchable databases of my entire storage pool. It runs pretty decently on spinning rust, over a million file entries are returned within a few minutes. With little “dbs” like this it’s possible to generate things like disk space consumption visualization trees, fuzzy text search (e.g. fzf) to search for files, etc.

Putting metadata on an SSD is indeed very attractive for speeding up metadata based workloads like this. I look forward to trying it and reaping the huge speed benefits. I wonder if we could take this another step further by leveraging RAM. The reason is that if we could make this not just 10x faster but 100x faster it could open up entirely new interaction paradigms.

Does ARC already cache metadata? Could we do something like:

make metadata “stickier” in ARC? Sounds like there is a way to set ARC to metadata only. Sounds interesting. But sounds also like it would greatly negatively impact “normal” use cases. Something more nuanced may fare better.
pre-load metadata into RAM to get past even the initial NVMe access speed. I suppose, if metadata fits in ram, and if above scheme could be a reality, this could be trivially accomplished by a script on startup that goes ahead to just access all metadata.

Exard3k · May 31, 2022, 12:21pm

It does. There is a fraction of the ARC size reserved for metadata. This fraction is usually plenty. But if you see metadata getting evicted to often, you can change the values via tunables. I don’t use it myself, so I don’t have the tunable at hand. But zfs documentation should give you more insights there. If you are working with small recordsizes and the corresponding huge amounts of blockpointers, etc. you may find that this inflated amount of metadata is more than your default ARC can handle.

Other than that, there are also things like secondarycache=metadata if you want more metadata cached. I would advice against primarycache=metadata for obvious reasons. special vdevs are a solution, but ARC and L2ARC should be able to cache all your metadata except for very edge cases or if you went cheap on memory. I actually prefer L2ARC over special vdevs for my server, but the case for non-evictable metadata stored on fast SSD is a strong one for special vdevs, although I don’t personally experience this in my use case as I got plenty of ARC+(persistant)L2ARC

edit: Module Parameters — OpenZFS documentation

SGC · June 3, 2022, 12:06pm

The special small block device is certainly a far more risky solution than just setting secondarycache=metadata.
i suppose that is why the option to do only metadata on the special small block device doesn’t exist.

ofc the benefit from the special small block device is the ability to also store small files rather than metadata only, would be nice if the l2arc had a similar feature, where one can define the max size of the records stored in it.

i’m very happy with my special small block device because it soaks up all the small io.
but i’m sure there are many cases where recommending secondarycache=metadata would be the best choice…

i mean it can be a lot of trouble having a special small block device, while the l2arc one can basically pull will the pool is live and it doesn’t matter because all the data is replicated on the data storage vdevs.
having to replace it, keeping it as a mirror… so many annoying downsides about the special small block device…

i will certainly say people should think twice, if they really need it and make sure they cannot manage with just a l2arc setup instead.

reavessm · June 3, 2022, 12:46pm

Non-evictable, as well as something to persist reboots. There are talks about persistent L2ARC (and tbh idk how far along that is) but its physically impossible to have persistent L1ARC. Meanwhile, 4 (or 6, or 10) Samsung 980 Pros in a Stripe+Mirror config will be similar in speeds while persisting.

Although this probably only really affects desktops instead of servers, and maybe desktops don’t need massive Arc anyway?

Exard3k · June 3, 2022, 7:01pm

It is already implemented. I’m using it myself. Really helps keeping cache warm after a reboot. Persistent L1ARC would be great, but other than going hibernate or using Intel NVDIMMs, we have to live with NVMe speed while ARC is warming up ( once every other weak for my server, security updates).

I find ARC very convenient on desktop. Keeps all applications and games in memory. Certainly better than your default pagecache as you also get things like compression.

There is a tunable to tell the L2ARC that it won’t accept any data that is already present on special vdev. Both are usually equal in speed, so no need to waste L2ARC with more redundant data than necessary

dc443 · July 24, 2022, 3:50pm

I’ve got some dumb questions… Sorry if it’s been already addressed in responses.

Looks like if we want to set up the special vdev (e.g. to put metadata on NVMe):

This must be done at pool creation?
Redundancy for the metadata must be done at this vdev? This is a huge drawback and added complexity.
Does this also mean that in such a configuration the other vdevs would not get metadata?

I had typed up a whole line of questions before I did more reading and realized these drawbacks of the special vdev – my line of reasoning was this: If we could put our metadata on NVMe, and if we also got the (desired) redundancy for metadata on the other HDD vdevs, then there are all these questions about since the added speed would allow for the discrepancy between the metadata content on these devices to grow potentially unbounded if we slam the pool with write operations. Then what would happen when you cut power? …

Now that I understand that we have to make the choice to put all metadata on the special vdev from the beginning, that clearly sidesteps these thorny issues, and instead we need to provision for redundancy within the special vdev as losing the vdev means a lost pool, the same situation as with any other vdev. So it looks like my thought process was at least legitimate, that the issues there are truly thorny and ZFS had to design around them.

This also leads me to a better understanding of what happens with a secondarycache=metadata setup, which seems at this point more in line with what I’d envisioned in mind: As it is a mode for configuring L2ARC, which is a cache, the actual metadata stays on the data vdevs as usual and the L2ARC is being dedicated then to caching metadata.

Compared to the special vdev, the write performance of using secondarycache=metadata would not be enhanced because metadata has to hit the slow data vdevs alongside the regular data, since thats where it all needs to go. This is fine and probably usually what we want. Indeed since this is just cache we’re talking about it would be ridiculous to expect writes to speed up.
Everything just makes so much more sense to me now! If you need better speed for small files, you clearly need the special vdev if you still hope to retain ZFS’s data integrity properties. This is what pulls in the complexity that we have to figure out how to make this higher-performing special vdev itself redundant. Putting metadata on this vdev, from a design point of view, is merely an afterthought, a super useful one.

Honestly this is a tough choice to make whether to go with the special vdev or the secondarycache=metadata since both are capable of speeding up metadata read speed to NVMe speed (and I understand that metadata will also automatically get cached as necessary in ARC on RAM). Being able to seamlessly send smaller files to NVMe can speed things up a LOT so it wasn’t something I thought I needed, but now needs to be re-evaluated.

OK, one more dumb question, then:

Can we provision an NVMe partition and enable secondarycache=metadata L2ARC for an existing populated pool at a later time on-demand? That would be really nice.