Yes. Regular hard drives don’t have plp, and that’s what normally stores metadata.
I used the pile of old 250gb Samsung EVO’s I had laying around.
Yes. Regular hard drives don’t have plp, and that’s what normally stores metadata.
I used the pile of old 250gb Samsung EVO’s I had laying around.
I found the calculation for hard drive space needed for special vdevs a little confusing to be honest, 1 zero here or there and I ended up using online calculators to ensure I was calculating correctly. What I landed at was that for my 48T Pool, a couple of 150G SSD’s would be enough, just. As it turns out, it seems it is about 10 times what I need.
Firstly, something else I didn’t see mentioned, if you type zpool list -v it actually tells you how much is used on the special device.
Here’s mine:
I have 23TB to go on the copy, but the used metadata doesn’t really seem to be going up by much. The remaining files are larger files with a 1M record size, perhaps that has something to do with it.
Finally, I did add a few smaller datasets as a dedup to see what would happen. I haven’t noticed any extra ram required (yes they’re new datasets with newly copied data) and I’m getting 1.1 as a dedup ratio across 347G of data. I assume 1.1 means 10%? So that would mean I saved 34.7G. I suspect one of the datasets at 37G is pretty much not deduplicating anything, but it’s fun to try out. I’d be pretty impressed if I saved 37G - that’s quite a lot really.
Anyway, I thought some real world stats might be interesting, rather than just some theoretical.
Edit: The 23T finished and I’m still on 11.8G for the special device. So if it did use some of it, it must have been within 100M. At this rate, I’m going to have plenty of space to add some small files as well.
So coming back for a final number, in the end (including the now 450G of dedup data) and a disk size of 46T, I’m using a total of 13G of special device space. That’s a lot less than I thought it was going to be by the calculations above. Yes, I’ve recopied the whole lot.
So now, it will be fun to see how many small files I have around and if it’s worth chucking those on there as well.
Also, the zdb command and the awk script returns two different results on the pool which is interesting.
zdb:
Block Size Histogram
block psize lsize asize
size Count Size Cum. Count Size Cum. Count Size Cum.
512: 87.7K 43.8M 43.8M 87.7K 43.8M 43.8M 0 0 0
1K: 164K 209M 253M 164K 209M 253M 0 0 0
2K: 92.5K 240M 493M 92.5K 240M 493M 0 0 0
4K: 1.37M 5.50G 5.98G 358K 1.49G 1.97G 0 0 0
8K: 562K 5.41G 11.4G 52.4K 604M 2.56G 1.70M 14.7G 14.7G
16K: 688K 15.1G 26.5G 376K 6.53G 9.09G 719K 12.8G 27.4G
32K: 1015K 43.3G 69.8G 56.9K 2.38G 11.5G 1016K 42.1G 69.5G
64K: 1.58M 150G 219G 30.5K 2.72G 14.2G 1.06M 93.2G 163G
128K: 28.6M 3.58T 3.79T 32.9M 4.12T 4.13T 29.7M 5.07T 5.22T
256K: 5.38K 2.15G 3.80T 10 3.25M 4.13T 3.32K 1.23G 5.23T
512K: 353K 283G 4.07T 15 9.87M 4.13T 39.9K 34.7G 5.26T
1M: 17.7M 17.7T 21.8T 18.1M 18.1T 22.2T 18.0M 24.0T 29.3T
2M: 0 0 21.8T 0 0 22.2T 0 0 29.3T
4M: 0 0 21.8T 0 0 22.2T 0 0 29.3T
8M: 0 0 21.8T 0 0 22.2T 0 0 29.3T
16M: 0 0 21.8T 0 0 22.2T 0 0 29.3T
script
1k: 187402
2k: 67049
4k: 71165
8k: 46139
16k: 94997
32k: 72828
64k: 32618
128k: 43503
256k: 43383
512k: 19386
1M: 38933
2M: 50268
4M: 45346
8M: 39260
16M: 66647
32M: 20784
64M: 6496
128M: 8510
256M: 14412
512M: 9076
1G: 4869
2G: 1567
4G: 386
8G: 64
16G: 11
32G: 3
64G: 112
128G: 1
I’m going to go with zdb here, because I know I have no block size greater than 1M, which it agrees with and the script does not.
I’m getting this when trying to add the special vdev:
zpool add zp-rust special mirror /dev/sdk2 /dev/sdl2 /dev/sdm2
invalid vdev specification
use '-f' to override the following errors:
mismatched replication level: pool and new vdev with different redundancy, raidz and mirror vdevs, 1 vs. 2 (3-way)
This is the pool I want to add it to:
zpool list -v zp-rust
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
zp-rust 29.1T 45.8M 29.1T - - 0% 0% 1.00x ONLINE /mnt
raidz1 29.1T 45.8M 29.1T - - 0% 0.00% - ONLINE
f55e856d-b47d-4296-ab60-8ccdc34563dc - - - - - - - - ONLINE
8deb0a26-aa39-4b5e-9540-e18f8f67364f - - - - - - - - ONLINE
6766189b-ca03-4fda-b3ec-35456d06dac1 - - - - - - - - ONLINE
a1890772-f538-42ed-9a3c-65e6c859a2cc - - - - - - - - ONLINE
3bdf4b46-4fa7-42df-89b0-9b1dadb1b8a1 - - - - - - - - ONLINE
5898cde4-cdfb-49d2-8e42-5a9ff6b90b2d - - - - - - - - ONLINE
6d19921d-0841-4221-8448-951dcd175e32 - - - - - - - - ONLINE
d5248ee8-32dd-4c58-8148-3e04dd1e4187 - - - - - - - - ONLINE
I’d wait for more answers just in case but I think your command is fine and ZFS warns you because mirror != raidz. It says your mirror has 2 disk failure redundancy while your raidz only has 1 disk redundancy.
This error message is intented for the opposite case: adding a lower redundancy vdev to a pool. Accidentally adding a 1 disk vdev would be bad (unremovable because you have a raidz vdev).
Okay, that makes sense. I’ll wait a little bit for confirmation.
I saw someone say to triple mirror the metadata, and since I’m using old SSDs I thought “why not?”
this is just ZFS’ way of saying ‘well the pool is raidz1 you sure you dont also want raidz1 for the metadata?’ it correctly detected that you want 2 way redundancy on the metadata but only one way redundancy for the z1 pool.
the whole pool can still only suffer one drive failure. if a second drive fails you have to be ‘lucky’ enough for it to be a metadata drive. otherwise whole pool is lost.
-f will force it to work
mirror devices will auto stripe reads across all 3 devices which is maybe nice.
Keepin the thread alive in the new years …
Just here to "confirm some stuff I … “read on the internet”
Even an aggressively set Metadata Pool is still going to require the main vDev store files that are the ashift size x the zVol width.
Say that’s 128K with a zVol of RAIDz2 with 8x HD … anything over 1M is going to be written to the main pool (with some mitigation by the ZIL ) …?
Which will “protect me” from the worst of the MINIMUM THROUGHPUT speeds my main vDev would give with those tiny (metadata) sized files … where spinning drive performance drops off.
Can I ask for some acumen on the following stats and logic …?
Based on:
4TB - 6TB HD
Usually HGST Ultrastars
7200 rpm
~40-50% Full
100% Read or …
100% Write …
At what file-size does the throughput drop below 50MB/s per HD
AND … how would the above drive roughly perform with these file sizes…?
100% R or 100% W … MB/s of 8KB files:
100% R or 100% W … MB/s of 16KB files:
100% R or 100% W … MB/s of 32KB files:
100% R or 100% W … MB/s of 64KB files:
100% R or 100% W … MB/s of 128KB files:
(my guesses)
8KB files: … 0.6MB/s
16KB files: … 1.5MB/s
32KB files: … 4MB/s
64KB files: … 8MB/s
128KB files: 16MB/s
256KB files: 32MB/s
512K files: 60MB/s
1MB files: ~ 100MB/s
Does that sound about accurate …? Thus, with an
8-HD 7200rpm array – and even a metadata (Fusion Pool) set to 128KB
It’d provide a “performance asymptote” approaching ~100MB/s ? (bc of RAIDz2 overhead) …?
With a good ZIL … maybe I could get the “low asymptote” up to 150MB/s …?
Does that all make sense…?
Here’s something that has managed to escape my attention since June 2020: As of the following pull 9158 - Add a binning histogram of blocks to zdb by sailnfool · Pull Request #10315 · openzfs/zfs · GitHub zdb
gives you a histogram of block sizes when using -bb
or greater (-bbb
,-bbbb
etc), so you can quickly determine exactly what size you can get away with setting the special vdev small block redirection to.
Normally the output is in “human readable” numbers, like “64K”, but if you use -P
then the output will use full numbers, like “65536”
Using -L
suppossedly cuts down on the time needed to generate the data because it’s not trying to walk through and verify everything is correct.
My main pool’s output looks like this:
block psize lsize asize
size Count Size Cum. Count Size Cum. Count Size Cum.
512: 833K 416M 416M 833K 416M 416M 0 0 0
1K: 1.04M 1.22G 1.62G 1.04M 1.22G 1.62G 0 0 0
2K: 889K 2.31G 3.94G 889K 2.31G 3.94G 0 0 0
4K: 2.63M 11.6G 15.6G 963K 5.31G 9.25G 0 0 0
8K: 2.01M 21.8G 37.4G 1.61M 18.6G 27.8G 4.83M 50.6G 50.6G
16K: 1.44M 30.5G 67.9G 2.15M 40.6G 68.4G 2.77M 65.0G 116G
32K: 1.46M 65.3G 133G 802K 36.3G 105G 1.49M 67.1G 183G
64K: 3.08M 310G 443G 1.01M 93.3G 198G 1.64M 151G 334G
128K: 401M 50.2T 50.6T 405M 50.7T 50.9T 403M 75.6T 75.9T
256K: 602K 214G 50.8T 798K 287G 51.1T 734K 259G 76.1T
512K: 451K 320G 51.2T 585K 411G 51.5T 542K 385G 76.5T
1M: 331K 485G 51.6T 264K 371G 51.9T 385K 540G 77.1T
2M: 373K 1.03T 52.7T 153K 432G 52.3T 360K 1.03T 78.1T
4M: 7.14M 28.6T 81.2T 7.58M 30.3T 82.6T 7.35M 43.9T 122T
8M: 0 0 81.2T 0 0 82.6T 0 0 122T
16M: 0 0 81.2T 0 0 82.6T 0 0 122T
I believe the asize Cum.
( ͡° ͜ʖ ͡°) column is what we are concerned with here. A few observations:
LSIZE is logical size. which is the original blocksize before compression or any funny business.
PSIZE is physical size, which is how much space the block itself consumes on disk (after compression).
ASIZE is allocation size. The total physical space consumed to write out the block plus indexing overhead. Includes padding, parity and obeying vdev ashift as a lower limit and dataset recordsizes as the upperlimit.
The block sizes for 512, 1K, 2K, and 4K are blank. 512, 1K and 2K make sense, because my ashift for everything in the pool is 12. Because the 4K column is blank, I guess that the size counts are for “counts for blocksizes below this number”. But I’m not sure yet. It could be some sort of padding weirdness.
Pending better understanding of wtf these numbers are doing exactly, at the very least I can easily redirect blocks up to 32K (or 64K?) because it only takes up 334G, and my special vdevs are 1TB enterprise intel sata ssds. Hopefully I’m mistaken and I can set it to 64K. Either way this is awesome, I didn’t realize I could get so much of the worst performing blocks off my HDDs.
When block get above 64K (128K?) the block count shoots to the moon. This is what makes me think that the 4K block column is weird or I’m reading it wrong, because what seems to be happening is the vast majority of my blocks are at the default 128K recordsize. This means I need to refactor my pool recordsize because the majority of it is archived video that would benefit from large block sizes.
At the end it jumps again to 122T, so I guess it’s just tacking on the rest of the free space? This is weird.
As always, I’m probably wrong on some of this so feel free to correct me, and take what I say with a grain of salt. This is just my “current working knowledge”.
I’m using the zdb tool too, it’s really great.
I can confirm your assumption. I was running special vdevs and special_small_blocks
set to 64k across the entire pool. The asize Cum.
(or was it psize cum.?) was identical to zpool list -v
special vdev output for allocation.
I only have two NVMe drives and later removed the special vdev from the pool (fasten your seatbelts for immediate evacuation process!) and set up the drives as L2ARC.
All those blocks will eventually end up in the (persistent)L2ARC anyway, so chances are good you read at NVMe speed in either case.
But special vdevs contribute to pool size, their blocks can’t be evicted, delivers performance even with a cold cache, TB written is negligible so SSD wear is minimal. And when talking about (sync) writes, ARC+L2 wont help you, But NVMe vdevs will. Storing metadata certainly helps with no or cold cache scenarios, but I guess there are also people with pools out there where metadata eviction from both ARC+L2ARC is a thing.
BUT: You can’t set the small blocks properity on a Zvol and that’s really the biggest weakness in my case. It just works with datasets.
I’m reading the same. Around 400 million blocks with 128k sitting in your pool. If you feel like mass murdering tens of millions of innocent metadata entries and block pointers, go ahead.
That 122T is the total allocated space for that pool as seen in zpool list -v poolname
. At least for me it is. Don’t know what you mean by free space though. You don’t have any larger blocks than the ones in your 4M dataset(s) which are 43.9T asize and adding that up with all blocks <4M (78.1T total asize), you get the 122T.
2022 Original Post: Ok so I think I’ve got something “close enough” that should tell you how much pure metadata a pool has. A special thanks to @rcxb and @oO.o for their help here: The small linux problem thread - #4956 by rcxb
//Generate zdb output file so we only have to run a slow command once:
zdb -PLbbbs kpool | tee ~/kpool_metadata_output.txt
//Sum only the relevent numbers from the ASIZE column, removing rows with redundant numbers or things that aren’t metadata. Hopefully I got everything.
(
cat ~/kpool_metadata_output.txt \
| grep -B 9999 'L1 Total' \
| grep -A 9999 'ASIZE' \
| grep -v \
-e 'L1 object array' -e 'L0 object array' \
-e 'L1 bpobj' -e 'L0 bpobj' \
-e 'L2 SPA space map' -e 'L1 SPA space map' -e 'L0 SPA space map' \
-e 'L5 DMU dnode' -e 'L4 DMU dnode' -e 'L3 DMU dnode' -e 'L2 DMU dnode' -e 'L1 DMU dnode' -e 'L0 DMU dnode' \
-e 'L0 ZFS plain file' -e 'ZFS plain file' \
-e 'L2 ZFS directory' -e 'L1 ZFS directory' -e 'L0 ZFS directory' \
-e 'L3 zvol object' -e 'L2 zvol object' -e 'L1 zvol object' -e 'L0 zvol object' \
-e 'L1 SPA history' -e 'L0 SPA history' \
-e 'L1 deferred free' -e 'L0 deferred free' \
| awk \
'{sum+=$4} \
END {printf "\nTotal Metadata\n %.0f Bytes\n" " %.2f GiB\n",sum,sum/1073741824}' \
)
Output
Total Metadata
57844416512 Bytes
53.87 GiB
Which is close enough to the 53.2G
Allocated that zpool list kpool -v
gives me. It is possible that all of my metadata might not actually be in the special vdev. If anyone finds any errors let me know. Your metadata being over a TiB is a good indication the script is counting something it shouldn’t.
So that combined with my findings here about the “new” histogram that shows up with zdb -bb
(as detailed here: ZFS Metadata Special Device: Z - #94 by Log) should help people quickly figure out how large their drives need to be to hold metadata and small blocks.
@wendell
Curiously, I think I’ve somehow obtained the ability to edit your post. Is this on purpose or did I hack you?
this post is a wiki post, its meant to be updated over time. I like your script, its nice
grep -B 9999 'L1 Total' ~/kpool_metadata_output.txt
Nothing wrong with using cat. In particular, grep’s output changes when you’re searching one file versus multiples. cat protects you from such unpleasant surprises.
Trawling reddit, I found some cautionary tales of both using and removing a special vdevs.
I can’t yet post a link (forum noob), but a Google for the following will turn it up:
Special vdev questions - will it offload existing metadata and can i remove it if vdev is not mirrored
Search that page for irretrievably corrupt
and Removing your metadata special VDEV will corrupt your pool if you remove it before migrating the data
Also, search GitHub openzfs for the issue:
Pool corruption after export when using special vdev #10612
That should be obvious for anyone using ZFS. If you plug out your vdevs without telling ZFS, it’s just treated as a faulted vdev with all the consequences that may arise from that. And you certainly don’t plug out an entire top-level vdev during an evacuation process.
This isn’t specific for special
vdevs, but applies to any multi-disk RAID(-like) situation.
Non-reproducible issue. Read more than headlines next time.
yeah, pretty much sums up this fearmongering.
So i started using the special small block device a while ago…
the latter part of last year…
however i ran it at 16k special small blocks for a while because, i wanted to see the performance gains… around the 50% mark i reduced it to 4k.
it’s slowly but surely running out of capacity, and i initially figured when i was to change the block sizes, it would purge the larger 16k blocks, but this didn’t seem to happen…
is there some way check or rebalance the blocks in the special small block device?
currently it’s a bit tricky for me to migrate the data… because reasons.
i know i am most likely forced to simply expand the ssd devices capacity, for safety reasons…
but i was hoping there was a way to atleast figure out if it actually still contains the 16k blocks or some way to purge them.
without any danger to the stored data.
and yes i know i’m a disk short… i’m getting to that… lol
Wanted to add - that the raw benchmark numbers (read/write) does not give credit to the importance of special vDevs - on HDD based storages.
Been kicking the tires on it with a large number of mixed workload and use cases, for a year, across 30 or so servers. And the difference can be really “night and day” in use cases where dir traversal occurs. Additionally, because my company workload is constantly mirrored (HA requirements). It allowed me to do apple-to-apple comparisons.
ls
commands is dropped from 50+s to 3s in several folders.rsync
Unless your exclusively having large 10GB+ files, on mixed workload with HDD, dir traversal has a huge impact on rsync
time, especially when your background processes triggers multiples of it. (HDD lookup contention)ls
maybe slow, but every-time you “accidentally” go one step into a folder, and out again (and ls again). You will feel the difference. If budget, and redundancy is a huge concern, then go for 3 small mirrored special vDev instead, of 2 bigger ones.SLOG
if you are already using this, you probably already have the SSD’s requiredAlso in all cases above, there is a factor of, “when you need it : you really need it”. Eg. Cloning, or replicating data to replace a “down node, that died due to Murphey”
Overall from a productivity standpoint, its really a no brainer to toss in 2 (or 3) enterprise grade NVMe drives just for a mirrored special vDevs. On any HDD based storage servers. The cost to the eventual productivity gain margins (when s*** happens) is huge. It’s to the level, where I refuse to work without a special vDev anymore.
If you are really that paranoid of write wearing, you could always allocate with a large buffer of free space, to really stretch that wear level-ing. I been doing ~75% as a precaution (75% number has no basis, just plucked out of thin air).
PS: Because all our data is always redundant on a node level, having dual mirrored NVMe, is more than sufficient for our use case.
Is there a way to get this zdb functionality on ubuntu 20.04 LTS? Running sudo zdb -bb tank does no generate similar output. Trying to identity how much fragmentation I have on my root dataset by figuring different sizes and how much space they take up.
Using a script it go these numbers but don’t really know what it means.
python3 zfs_frag.py data-forwardslash.txt There are 71798 files. There are 8696287 blocks and 8624535 fragment blocks. There are 7708985 fragmented blocks 89.38%. There are 915550 contiguous blocks 10.62%.
@wendell I love this topic. Another reason why ZFS rules. I have really been enjoying doing stuff like find /pool/ -type f -printf "%M %s %t %p\\n" > filedb
to make little highly searchable databases of my entire storage pool. It runs pretty decently on spinning rust, over a million file entries are returned within a few minutes. With little “dbs” like this it’s possible to generate things like disk space consumption visualization trees, fuzzy text search (e.g. fzf
) to search for files, etc.
Putting metadata on an SSD is indeed very attractive for speeding up metadata based workloads like this. I look forward to trying it and reaping the huge speed benefits. I wonder if we could take this another step further by leveraging RAM. The reason is that if we could make this not just 10x faster but 100x faster it could open up entirely new interaction paradigms.
Does ARC already cache metadata? Could we do something like: