ZFS Metadata Special Device: Z

Here’s something that has managed to escape my attention since June 2020: As of the following pull 9158 - Add a binning histogram of blocks to zdb by sailnfool · Pull Request #10315 · openzfs/zfs · GitHub zdb gives you a histogram of block sizes when using -bb or greater (-bbb,-bbbb etc), so you can quickly determine exactly what size you can get away with setting the special vdev small block redirection to.

Normally the output is in “human readable” numbers, like “64K”, but if you use -P then the output will use full numbers, like “65536”

Using -L suppossedly cuts down on the time needed to generate the data because it’s not trying to walk through and verify everything is correct.

My main pool’s output looks like this:

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:   833K   416M   416M   833K   416M   416M      0      0      0
     1K:  1.04M  1.22G  1.62G  1.04M  1.22G  1.62G      0      0      0
     2K:   889K  2.31G  3.94G   889K  2.31G  3.94G      0      0      0
     4K:  2.63M  11.6G  15.6G   963K  5.31G  9.25G      0      0      0
     8K:  2.01M  21.8G  37.4G  1.61M  18.6G  27.8G  4.83M  50.6G  50.6G
    16K:  1.44M  30.5G  67.9G  2.15M  40.6G  68.4G  2.77M  65.0G   116G
    32K:  1.46M  65.3G   133G   802K  36.3G   105G  1.49M  67.1G   183G
    64K:  3.08M   310G   443G  1.01M  93.3G   198G  1.64M   151G   334G
   128K:   401M  50.2T  50.6T   405M  50.7T  50.9T   403M  75.6T  75.9T
   256K:   602K   214G  50.8T   798K   287G  51.1T   734K   259G  76.1T
   512K:   451K   320G  51.2T   585K   411G  51.5T   542K   385G  76.5T
     1M:   331K   485G  51.6T   264K   371G  51.9T   385K   540G  77.1T
     2M:   373K  1.03T  52.7T   153K   432G  52.3T   360K  1.03T  78.1T
     4M:  7.14M  28.6T  81.2T  7.58M  30.3T  82.6T  7.35M  43.9T   122T
     8M:      0      0  81.2T      0      0  82.6T      0      0   122T
    16M:      0      0  81.2T      0      0  82.6T      0      0   122T

I believe the asize Cum. ( ͡° ͜ʖ ͡°) column is what we are concerned with here. A few observations:

  • LSIZE is logical size. which is the original blocksize before compression or any funny business.

  • PSIZE is physical size, which is how much space the block itself consumes on disk (after compression).

  • ASIZE is allocation size. The total physical space consumed to write out the block plus indexing overhead. Includes padding, parity and obeying vdev ashift as a lower limit and dataset recordsizes as the upperlimit.

  • The block sizes for 512, 1K, 2K, and 4K are blank. 512, 1K and 2K make sense, because my ashift for everything in the pool is 12. Because the 4K column is blank, I guess that the size counts are for “counts for blocksizes below this number”. But I’m not sure yet. It could be some sort of padding weirdness.

  • Pending better understanding of wtf these numbers are doing exactly, at the very least I can easily redirect blocks up to 32K (or 64K?) because it only takes up 334G, and my special vdevs are 1TB enterprise intel sata ssds. Hopefully I’m mistaken and I can set it to 64K. Either way this is awesome, I didn’t realize I could get so much of the worst performing blocks off my HDDs.

  • When block get above 64K (128K?) the block count shoots to the moon. This is what makes me think that the 4K block column is weird or I’m reading it wrong, because what seems to be happening is the vast majority of my blocks are at the default 128K recordsize. This means I need to refactor my pool recordsize because the majority of it is archived video that would benefit from large block sizes.

  • At the end it jumps again to 122T, so I guess it’s just tacking on the rest of the free space? This is weird.

As always, I’m probably wrong on some of this so feel free to correct me, and take what I say with a grain of salt. This is just my “current working knowledge”.

2 Likes