ZFS Metadata Special Device: Z

Log · October 11, 2021, 2:43pm

Yes. Regular hard drives don’t have plp, and that’s what normally stores metadata.

I used the pile of old 250gb Samsung EVO’s I had laying around.

marshalleq · October 16, 2021, 8:18am

I found the calculation for hard drive space needed for special vdevs a little confusing to be honest, 1 zero here or there and I ended up using online calculators to ensure I was calculating correctly. What I landed at was that for my 48T Pool, a couple of 150G SSD’s would be enough, just. As it turns out, it seems it is about 10 times what I need.

Firstly, something else I didn’t see mentioned, if you type zpool list -v it actually tells you how much is used on the special device.

Here’s mine:

I have 23TB to go on the copy, but the used metadata doesn’t really seem to be going up by much. The remaining files are larger files with a 1M record size, perhaps that has something to do with it.

Finally, I did add a few smaller datasets as a dedup to see what would happen. I haven’t noticed any extra ram required (yes they’re new datasets with newly copied data) and I’m getting 1.1 as a dedup ratio across 347G of data. I assume 1.1 means 10%? So that would mean I saved 34.7G. I suspect one of the datasets at 37G is pretty much not deduplicating anything, but it’s fun to try out. I’d be pretty impressed if I saved 37G - that’s quite a lot really.

Anyway, I thought some real world stats might be interesting, rather than just some theoretical.

Edit: The 23T finished and I’m still on 11.8G for the special device. So if it did use some of it, it must have been within 100M. At this rate, I’m going to have plenty of space to add some small files as well.

marshalleq · October 23, 2021, 1:24am

So coming back for a final number, in the end (including the now 450G of dedup data) and a disk size of 46T, I’m using a total of 13G of special device space. That’s a lot less than I thought it was going to be by the calculations above. Yes, I’ve recopied the whole lot.

So now, it will be fun to see how many small files I have around and if it’s worth chucking those on there as well.

Also, the zdb command and the awk script returns two different results on the pool which is interesting.

zdb:

Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:  87.7K  43.8M  43.8M  87.7K  43.8M  43.8M      0      0      0
     1K:   164K   209M   253M   164K   209M   253M      0      0      0
     2K:  92.5K   240M   493M  92.5K   240M   493M      0      0      0
     4K:  1.37M  5.50G  5.98G   358K  1.49G  1.97G      0      0      0
     8K:   562K  5.41G  11.4G  52.4K   604M  2.56G  1.70M  14.7G  14.7G
    16K:   688K  15.1G  26.5G   376K  6.53G  9.09G   719K  12.8G  27.4G
    32K:  1015K  43.3G  69.8G  56.9K  2.38G  11.5G  1016K  42.1G  69.5G
    64K:  1.58M   150G   219G  30.5K  2.72G  14.2G  1.06M  93.2G   163G
   128K:  28.6M  3.58T  3.79T  32.9M  4.12T  4.13T  29.7M  5.07T  5.22T
   256K:  5.38K  2.15G  3.80T     10  3.25M  4.13T  3.32K  1.23G  5.23T
   512K:   353K   283G  4.07T     15  9.87M  4.13T  39.9K  34.7G  5.26T
     1M:  17.7M  17.7T  21.8T  18.1M  18.1T  22.2T  18.0M  24.0T  29.3T
     2M:      0      0  21.8T      0      0  22.2T      0      0  29.3T
     4M:      0      0  21.8T      0      0  22.2T      0      0  29.3T
     8M:      0      0  21.8T      0      0  22.2T      0      0  29.3T
    16M:      0      0  21.8T      0      0  22.2T      0      0  29.3T

script

  1k: 187402
  2k:  67049
  4k:  71165
  8k:  46139
 16k:  94997
 32k:  72828
 64k:  32618
128k:  43503
256k:  43383
512k:  19386
  1M:  38933
  2M:  50268
  4M:  45346
  8M:  39260
 16M:  66647
 32M:  20784
 64M:   6496
128M:   8510
256M:  14412
512M:   9076
  1G:   4869
  2G:   1567
  4G:    386
  8G:     64
 16G:     11
 32G:      3
 64G:    112
128G:      1

I’m going to go with zdb here, because I know I have no block size greater than 1M, which it agrees with and the script does not.

EricM · December 18, 2021, 11:31pm

I’m getting this when trying to add the special vdev:

zpool add zp-rust special mirror /dev/sdk2 /dev/sdl2 /dev/sdm2
invalid vdev specification
use '-f' to override the following errors:
mismatched replication level: pool and new vdev with different redundancy, raidz and mirror vdevs, 1 vs. 2 (3-way)

This is the pool I want to add it to:

zpool list -v zp-rust
NAME                                       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zp-rust                                   29.1T  45.8M  29.1T        -         -     0%     0%  1.00x    ONLINE  /mnt
  raidz1                                  29.1T  45.8M  29.1T        -         -     0%  0.00%      -    ONLINE
    f55e856d-b47d-4296-ab60-8ccdc34563dc      -      -      -        -         -      -      -      -    ONLINE
    8deb0a26-aa39-4b5e-9540-e18f8f67364f      -      -      -        -         -      -      -      -    ONLINE
    6766189b-ca03-4fda-b3ec-35456d06dac1      -      -      -        -         -      -      -      -    ONLINE
    a1890772-f538-42ed-9a3c-65e6c859a2cc      -      -      -        -         -      -      -      -    ONLINE
    3bdf4b46-4fa7-42df-89b0-9b1dadb1b8a1      -      -      -        -         -      -      -      -    ONLINE
    5898cde4-cdfb-49d2-8e42-5a9ff6b90b2d      -      -      -        -         -      -      -      -    ONLINE
    6d19921d-0841-4221-8448-951dcd175e32      -      -      -        -         -      -      -      -    ONLINE
    d5248ee8-32dd-4c58-8148-3e04dd1e4187      -      -      -        -         -      -      -      -    ONLINE

Derkades · December 18, 2021, 11:52pm

I’d wait for more answers just in case but I think your command is fine and ZFS warns you because mirror != raidz. It says your mirror has 2 disk failure redundancy while your raidz only has 1 disk redundancy.

This error message is intented for the opposite case: adding a lower redundancy vdev to a pool. Accidentally adding a 1 disk vdev would be bad (unremovable because you have a raidz vdev).

EricM · December 19, 2021, 12:03am

Okay, that makes sense. I’ll wait a little bit for confirmation.

I saw someone say to triple mirror the metadata, and since I’m using old SSDs I thought “why not?”

wendell · December 19, 2021, 12:09am

this is just ZFS’ way of saying ‘well the pool is raidz1 you sure you dont also want raidz1 for the metadata?’ it correctly detected that you want 2 way redundancy on the metadata but only one way redundancy for the z1 pool.

the whole pool can still only suffer one drive failure. if a second drive fails you have to be ‘lucky’ enough for it to be a metadata drive. otherwise whole pool is lost.

-f will force it to work

mirror devices will auto stripe reads across all 3 devices which is maybe nice.

TrumanHW · January 1, 2022, 9:23pm

Keepin the thread alive in the new years …

Just here to "confirm some stuff I … “read on the internet”

Even an aggressively set Metadata Pool is still going to require the main vDev store files that are the ashift size x the zVol width.

Say that’s 128K with a zVol of RAIDz2 with 8x HD … anything over 1M is going to be written to the main pool (with some mitigation by the ZIL ) …?

Which will “protect me” from the worst of the MINIMUM THROUGHPUT speeds my main vDev would give with those tiny (metadata) sized files … where spinning drive performance drops off.

Can I ask for some acumen on the following stats and logic …?

Based on:
4TB - 6TB HD
Usually HGST Ultrastars
7200 rpm
~40-50% Full
100% Read or …
100% Write …

At what file-size does the throughput drop below 50MB/s per HD

AND … how would the above drive roughly perform with these file sizes…?
100% R or 100% W … MB/s of 8KB files:
100% R or 100% W … MB/s of 16KB files:
100% R or 100% W … MB/s of 32KB files:
100% R or 100% W … MB/s of 64KB files:
100% R or 100% W … MB/s of 128KB files:

(my guesses)
8KB files: … 0.6MB/s
16KB files: … 1.5MB/s
32KB files: … 4MB/s
64KB files: … 8MB/s
128KB files: 16MB/s
256KB files: 32MB/s
512K files: 60MB/s
1MB files: ~ 100MB/s

Does that sound about accurate …? Thus, with an
8-HD 7200rpm array – and even a metadata (Fusion Pool) set to 128KB
It’d provide a “performance asymptote” approaching ~100MB/s ? (bc of RAIDz2 overhead) …?

With a good ZIL … maybe I could get the “low asymptote” up to 150MB/s …?

Does that all make sense…?

Log · January 25, 2022, 6:23am

Here’s something that has managed to escape my attention since June 2020: As of the following pull 9158 - Add a binning histogram of blocks to zdb by sailnfool · Pull Request #10315 · openzfs/zfs · GitHub zdb gives you a histogram of block sizes when using -bb or greater (-bbb,-bbbb etc), so you can quickly determine exactly what size you can get away with setting the special vdev small block redirection to.

Normally the output is in “human readable” numbers, like “64K”, but if you use -P then the output will use full numbers, like “65536”

Using -L suppossedly cuts down on the time needed to generate the data because it’s not trying to walk through and verify everything is correct.

My main pool’s output looks like this:

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:   833K   416M   416M   833K   416M   416M      0      0      0
     1K:  1.04M  1.22G  1.62G  1.04M  1.22G  1.62G      0      0      0
     2K:   889K  2.31G  3.94G   889K  2.31G  3.94G      0      0      0
     4K:  2.63M  11.6G  15.6G   963K  5.31G  9.25G      0      0      0
     8K:  2.01M  21.8G  37.4G  1.61M  18.6G  27.8G  4.83M  50.6G  50.6G
    16K:  1.44M  30.5G  67.9G  2.15M  40.6G  68.4G  2.77M  65.0G   116G
    32K:  1.46M  65.3G   133G   802K  36.3G   105G  1.49M  67.1G   183G
    64K:  3.08M   310G   443G  1.01M  93.3G   198G  1.64M   151G   334G
   128K:   401M  50.2T  50.6T   405M  50.7T  50.9T   403M  75.6T  75.9T
   256K:   602K   214G  50.8T   798K   287G  51.1T   734K   259G  76.1T
   512K:   451K   320G  51.2T   585K   411G  51.5T   542K   385G  76.5T
     1M:   331K   485G  51.6T   264K   371G  51.9T   385K   540G  77.1T
     2M:   373K  1.03T  52.7T   153K   432G  52.3T   360K  1.03T  78.1T
     4M:  7.14M  28.6T  81.2T  7.58M  30.3T  82.6T  7.35M  43.9T   122T
     8M:      0      0  81.2T      0      0  82.6T      0      0   122T
    16M:      0      0  81.2T      0      0  82.6T      0      0   122T

I believe the asize Cum. ( ͡° ͜ʖ ͡°) column is what we are concerned with here. A few observations:

LSIZE is logical size. which is the original blocksize before compression or any funny business.
PSIZE is physical size, which is how much space the block itself consumes on disk (after compression).
ASIZE is allocation size. The total physical space consumed to write out the block plus indexing overhead. Includes padding, parity and obeying vdev ashift as a lower limit and dataset recordsizes as the upperlimit.
The block sizes for 512, 1K, 2K, and 4K are blank. 512, 1K and 2K make sense, because my ashift for everything in the pool is 12. Because the 4K column is blank, I guess that the size counts are for “counts for blocksizes below this number”. But I’m not sure yet. It could be some sort of padding weirdness.
Pending better understanding of wtf these numbers are doing exactly, at the very least I can easily redirect blocks up to 32K (or 64K?) because it only takes up 334G, and my special vdevs are 1TB enterprise intel sata ssds. Hopefully I’m mistaken and I can set it to 64K. Either way this is awesome, I didn’t realize I could get so much of the worst performing blocks off my HDDs.
When block get above 64K (128K?) the block count shoots to the moon. This is what makes me think that the 4K block column is weird or I’m reading it wrong, because what seems to be happening is the vast majority of my blocks are at the default 128K recordsize. This means I need to refactor my pool recordsize because the majority of it is archived video that would benefit from large block sizes.
At the end it jumps again to 122T, so I guess it’s just tacking on the rest of the free space? This is weird.

As always, I’m probably wrong on some of this so feel free to correct me, and take what I say with a grain of salt. This is just my “current working knowledge”.

Exard3k · January 25, 2022, 7:24pm

I’m using the zdb tool too, it’s really great.

I can confirm your assumption. I was running special vdevs and special_small_blocks set to 64k across the entire pool. The asize Cum. (or was it psize cum.?) was identical to zpool list -v special vdev output for allocation.

I only have two NVMe drives and later removed the special vdev from the pool (fasten your seatbelts for immediate evacuation process!) and set up the drives as L2ARC.

All those blocks will eventually end up in the (persistent)L2ARC anyway, so chances are good you read at NVMe speed in either case.

But special vdevs contribute to pool size, their blocks can’t be evicted, delivers performance even with a cold cache, TB written is negligible so SSD wear is minimal. And when talking about (sync) writes, ARC+L2 wont help you, But NVMe vdevs will. Storing metadata certainly helps with no or cold cache scenarios, but I guess there are also people with pools out there where metadata eviction from both ARC+L2ARC is a thing.

BUT: You can’t set the small blocks properity on a Zvol and that’s really the biggest weakness in my case. It just works with datasets.

I’m reading the same. Around 400 million blocks with 128k sitting in your pool. If you feel like mass murdering tens of millions of innocent metadata entries and block pointers, go ahead.

That 122T is the total allocated space for that pool as seen in zpool list -v poolname. At least for me it is. Don’t know what you mean by free space though. You don’t have any larger blocks than the ones in your 4M dataset(s) which are 43.9T asize and adding that up with all blocks <4M (78.1T total asize), you get the 122T.

Log · January 27, 2022, 3:29am

2023 Edit: This script seemed to work with a freshly made pool, but doesn’t seem to work with my pool which is now more “mature” and has more data on the special vdevs than this script reports. As such this script should be considered ultimately inaccurate until I have time far in the future to figure out what it’s missing.

2022 Original Post: Ok so I think I’ve got something “close enough” that should tell you how much pure metadata a pool has. A special thanks to @rcxb and @oO.o for their help here: The small linux problem thread - #4956 by rcxb

//Generate zdb output file so we only have to run a slow command once:

zdb -PLbbbs kpool | tee ~/kpool_metadata_output.txt

//Sum only the relevent numbers from the ASIZE column, removing rows with redundant numbers or things that aren’t metadata. Hopefully I got everything.

(
cat ~/kpool_metadata_output.txt \
| grep -B 9999 'L1 Total' \
| grep -A 9999 'ASIZE' \
| grep -v \
 -e 'L1 object array' -e 'L0 object array' \
 -e 'L1 bpobj' -e 'L0 bpobj' \
 -e 'L2 SPA space map' -e 'L1 SPA space map' -e 'L0 SPA space map' \
 -e 'L5 DMU dnode' -e 'L4 DMU dnode' -e 'L3 DMU dnode' -e 'L2 DMU dnode' -e 'L1 DMU dnode' -e 'L0 DMU dnode' \
 -e 'L0 ZFS plain file' -e 'ZFS plain file' \
 -e 'L2 ZFS directory' -e 'L1 ZFS directory' -e 'L0 ZFS directory' \
 -e 'L3 zvol object' -e 'L2 zvol object' -e 'L1 zvol object' -e 'L0 zvol object' \
 -e 'L1 SPA history' -e 'L0 SPA history' \
 -e 'L1 deferred free' -e 'L0 deferred free' \
| awk \
 '{sum+=$4} \
 END {printf "\nTotal Metadata\n %.0f Bytes\n" " %.2f GiB\n",sum,sum/1073741824}' \
)

Output

Total Metadata
 57844416512 Bytes
 53.87 GiB

Which is close enough to the 53.2G Allocated that zpool list kpool -v gives me. It is possible that all of my metadata might not actually be in the special vdev. If anyone finds any errors let me know. Your metadata being over a TiB is a good indication the script is counting something it shouldn’t.

So that combined with my findings here about the “new” histogram that shows up with zdb -bb (as detailed here: ZFS Metadata Special Device: Z - #94 by Log) should help people quickly figure out how large their drives need to be to hold metadata and small blocks.

@wendell
Curiously, I think I’ve somehow obtained the ability to edit your post. Is this on purpose or did I hack you?

Image

wendell · January 27, 2022, 3:50am

this post is a wiki post, its meant to be updated over time. I like your script, its nice

oO.o · January 27, 2022, 3:57am

grep -B 9999 'L1 Total' ~/kpool_metadata_output.txt

rcxb · January 27, 2022, 5:48am

Nothing wrong with using cat. In particular, grep’s output changes when you’re searching one file versus multiples. cat protects you from such unpleasant surprises.

Ataraxy · February 11, 2022, 5:07pm

Trawling reddit, I found some cautionary tales of both using and removing a special vdevs.

I can’t yet post a link (forum noob), but a Google for the following will turn it up:

Special vdev questions - will it offload existing metadata and can i remove it if vdev is not mirrored

Search that page for irretrievably corrupt and Removing your metadata special VDEV will corrupt your pool if you remove it before migrating the data

Also, search GitHub openzfs for the issue:

Pool corruption after export when using special vdev #10612

Exard3k · February 11, 2022, 8:41pm

That should be obvious for anyone using ZFS. If you plug out your vdevs without telling ZFS, it’s just treated as a faulted vdev with all the consequences that may arise from that. And you certainly don’t plug out an entire top-level vdev during an evacuation process.
This isn’t specific for special vdevs, but applies to any multi-disk RAID(-like) situation.

Non-reproducible issue. Read more than headlines next time.

yeah, pretty much sums up this fearmongering.

SGC · April 4, 2022, 11:14am

So i started using the special small block device a while ago…
the latter part of last year…
however i ran it at 16k special small blocks for a while because, i wanted to see the performance gains… around the 50% mark i reduced it to 4k.

it’s slowly but surely running out of capacity, and i initially figured when i was to change the block sizes, it would purge the larger 16k blocks, but this didn’t seem to happen…

is there some way check or rebalance the blocks in the special small block device?

currently it’s a bit tricky for me to migrate the data… because reasons.

i know i am most likely forced to simply expand the ssd devices capacity, for safety reasons…

but i was hoping there was a way to atleast figure out if it actually still contains the 16k blocks or some way to purge them.
without any danger to the stored data.

and yes i know i’m a disk short… i’m getting to that… lol

picocreator · May 18, 2022, 9:24pm

Wanted to add - that the raw benchmark numbers (read/write) does not give credit to the importance of special vDevs - on HDD based storages.

Been kicking the tires on it with a large number of mixed workload and use cases, for a year, across 30 or so servers. And the difference can be really “night and day” in use cases where dir traversal occurs. Additionally, because my company workload is constantly mirrored (HA requirements). It allowed me to do apple-to-apple comparisons.

Debugging and navigating TB or PB’s of data to debug an issue, in somewhat cold storage. ls commands is dropped from 50+s to 3s in several folders.
rsync Unless your exclusively having large 10GB+ files, on mixed workload with HDD, dir traversal has a huge impact on rsync time, especially when your background processes triggers multiples of it. (HDD lookup contention)
Use NVMe where possible, even a small “200GB” special vDev in RAID 1, makes a huge difference, even when the total metadata exceeds its storage space. Sure the first few ls maybe slow, but every-time you “accidentally” go one step into a folder, and out again (and ls again). You will feel the difference. If budget, and redundancy is a huge concern, then go for 3 small mirrored special vDev instead, of 2 bigger ones.
SLOG if you are already using this, you probably already have the SSD’s required

Also in all cases above, there is a factor of, “when you need it : you really need it”. Eg. Cloning, or replicating data to replace a “down node, that died due to Murphey”

Overall from a productivity standpoint, its really a no brainer to toss in 2 (or 3) enterprise grade NVMe drives just for a mirrored special vDevs. On any HDD based storage servers. The cost to the eventual productivity gain margins (when s*** happens) is huge. It’s to the level, where I refuse to work without a special vDev anymore.

If you are really that paranoid of write wearing, you could always allocate with a large buffer of free space, to really stretch that wear level-ing. I been doing ~75% as a precaution (75% number has no basis, just plucked out of thin air).

PS: Because all our data is always redundant on a node level, having dual mirrored NVMe, is more than sufficient for our use case.

pepperwater · May 22, 2022, 11:41pm

Is there a way to get this zdb functionality on ubuntu 20.04 LTS? Running sudo zdb -bb tank does no generate similar output. Trying to identity how much fragmentation I have on my root dataset by figuring different sizes and how much space they take up.

Using a script it go these numbers but don’t really know what it means.
python3 zfs_frag.py data-forwardslash.txt There are 71798 files. There are 8696287 blocks and 8624535 fragment blocks. There are 7708985 fragmented blocks 89.38%. There are 915550 contiguous blocks 10.62%.

dc443 · May 31, 2022, 11:54am

@wendell I love this topic. Another reason why ZFS rules. I have really been enjoying doing stuff like find /pool/ -type f -printf "%M %s %t %p\\n" > filedb to make little highly searchable databases of my entire storage pool. It runs pretty decently on spinning rust, over a million file entries are returned within a few minutes. With little “dbs” like this it’s possible to generate things like disk space consumption visualization trees, fuzzy text search (e.g. fzf) to search for files, etc.

Putting metadata on an SSD is indeed very attractive for speeding up metadata based workloads like this. I look forward to trying it and reaping the huge speed benefits. I wonder if we could take this another step further by leveraging RAM. The reason is that if we could make this not just 10x faster but 100x faster it could open up entirely new interaction paradigms.

Does ARC already cache metadata? Could we do something like:

make metadata “stickier” in ARC? Sounds like there is a way to set ARC to metadata only. Sounds interesting. But sounds also like it would greatly negatively impact “normal” use cases. Something more nuanced may fare better.
pre-load metadata into RAM to get past even the initial NVMe access speed. I suppose, if metadata fits in ram, and if above scheme could be a reality, this could be trivially accomplished by a script on startup that goes ahead to just access all metadata.