ZFS Metadata Special Device: Z

pondermatic · April 13, 2023, 2:18pm

I can get the -f switch to work with a special mirror vdev,

zpool create -f tank /dev/sdb special mirror /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1

but not with a special raidz vdev.

zpool create -f tank /dev/sdb special raidz /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1
invalid vdev specification: unsupported 'special' device: raidz

Exard3k · April 13, 2023, 2:36pm

RAIDZ isn’t supported for special VDEVs

There are some talks to add this feature, but RAIDZ with all its disadvantages usually isn’t worth it when talking about mostly tiny stuff like metadata or small blocks. Write amplification and general waste of space (similar to what you see with ZVOLs on RAIDZ) usually makes it a poor choice. Performance being another disadvantage.

You can see double or triple the space usage just because if you want to write a 4k block (assuming ashift=12 for 4k sectors) across like 5 disks and parity, you got a problem. Only works if you write it to each disk to ensure integrity, thus wasting space, making it a de-facto “5-way mirror with parity”. The wider the RAIDZ, the worse it gets. Just make a 4k blocksize ZVOL on a RAIDZ and check the space usage, it’s horrible.

Gregory_Fricker · April 13, 2023, 3:57pm

how much benefit would people expect to see adding a special device of optane 905p on a pool of sata SSDs to speed up io? Is this a waste of money or benefit? Run VMs and Kubernetes off this pool in addition to general SMB

Exard3k · April 13, 2023, 4:30pm

I doubt you will see a noticeable impact. Main reason to use it is to bypass the abyssmal HDD random read performance on small stuff. SATA SSDs may not have the bandwidth, but still have NAND flash level of random read/writes.
If your SATA drives are stretched to the limit, offloading some random I/O is useful, but so is adding more drives or memory.
And optional small blocks setting doesn’t apply to ZVOLs, so that won’t be of use to any of your virtual disks either.

And you need at least two of the 905p, because redundancy.

Gregory_Fricker · April 14, 2023, 4:22am

Thanks - I have an HDD pool of 3 Zvols of 6 drive raidz2 HDDs that has mirrored optane special metadata devices with small blocks for small xmp files as well as a separate optane L2arc - and that has really helped the pool.

My SSD pool is 6 zvols of mirrored SSDs with an optane L2arc - but I have been wondering if a special metadata device would be wasted or not - but it seems even for the small level writes being done by K8s and VMs “eating up” the random IO - I wonder if it could be better there.

Exard3k · April 14, 2023, 6:05am

You neglect the fact that ZFS uses transaction groups (txg) to gather up all writes in memory and basically convert random IO to sequential IO.

Just to give you a perspective for async writes on ZFS: I’ve been playing with various backup methods two weeks ago. Stream of like 5T of user data with everything in it, millions of small files. I was transferring that from 2x striped NVMe 4.0 drives via 10Gbit connection to my pool. Turned out that ZFS was able to write the small stuff faster to my HDDs than the NVMe drives could actually read them.That’s what they call ZFS magic. The magic of transaction groups (txg).

async writes are not a problem you have to deal with, txg’s take care of everything in memory before flushing it to disk as a sequential stream.

Problem starts with sync writes where we can’t gather up random stuff to optimize write pattern on underlying hardware. The application commands to immediately write it to disk although writing single random blocks is really bad even for flash. Workaround being LOG vdev or sync=disabled.

Optane LOG or disabling sync (fastest) all together might be the better approach than throwing money at underlying hardware if indeed sync writes are a thing on your pool.

This is why ZFS performs so well even on HDDs. Because there aren’t thousands of random writes happening in a properly setup pool, it will be one big chunk of data coming from DRAM to disks every 5 seconds (default settings).

That’s how you deal with writes.
You deal with (random) reads by caching stuff in memory.

Gregory_Fricker · April 16, 2023, 4:35am

thanks for this - appreciate the explanation

Gregory_Fricker · April 17, 2023, 10:56pm

Also, as I have been digging into my HDD spinning rust pool with triple mirrored intel 900p special metadata device vdev, and I see that the GUI made the special metadata device with ashift of 9 as the intel 900p reports 512byte sectors and cannot be programmed. Whats the consensus on mixed LBA size on metadata device vs spinning rust? I had really wanted this to be 4K to keep everything the same between vdevs.

Impatient_Crustacean · April 19, 2023, 9:23pm

Late to the party, but adding this for completeness:

According to github com openzfs zfs blob master module zfs arc.c line 7963, this is not the case (sorry can’t include links as a new user):

There is no eviction path from the ARC to the L2ARC. Evictions from the ARC behave as usual, freeing buffers and placing headers on ghost lists. The ARC does not send buffers to the L2ARC during eviction as this would add inflated write latencies for all ARC memory pressure.

Trooper_ish · April 19, 2023, 9:44pm

Thanks @Impatient_Crustacean . I don’t bother with L2arc myself, so am interested to learn.
Same with special device.

I have since read a Klara blog post about it…

~~one sec~~

does this look more like it:

How the L2ARC receives data

Another common misconception about the L2ARC is that it’s fed by ARC cache evictions. This isn’t quite the case. When the ARC is near full, and we are evicting blocks, we cannot afford to wait while we write data to the L2ARC, as that will block forward progress reading in the new data that an application actually wants right now. Instead we feed the L2ARC with blocks when they get near the bottom, making sure to stay ahead of eviction.

Exard3k · April 19, 2023, 9:47pm

Welcome, just keep posting!

Yeah there is an entire CPU thread to feed the L2ARC from so called Ghost MFU and Ghost MRU. Rather efficient mechanic to avoid ARC/DRAM performance being bogged down by disks.
If explaining L2ARC, most people don’t usually mention it (including me) for the sake of simplicity.

e97 · April 20, 2023, 9:49pm

Hi

I have a 64tb zvol (67% cap) raidz-2 and ls and find . -iname are slow … so the ZFS Special device sounds like it might solve my performance issues.

I ran both:

# zdb -Lbbbs <zpool_name>

and

find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'

but I’m lost at what to do with the output, even chatgpt hasn’t been able to help.

I believe I need find total usuable space using the following: (total allocated space - parity) * some percentage.

@wendell says it should be about 0.3% of the pool size. However, in his example it shows 5tb/172tb which is 3%… which is correct?

From the zdbb Which row(s) should I be looking at for metadata ASIZE?

Thanks

Exard3k · April 20, 2023, 10:11pm

0.3% is assuming ~128-256k records. Your average Zvol will produce massively more metadata depending on the volblocksize. 5T might be correct, depends on the rest of the pool and their record/volblocksize.
special vdev is made for these kinds of Zvols. You can’t really cache that amount of stuff.

Out of curiosity. How wide is your RAIDZ and how much space does the 67T Zvol actually use on the pool?

e97 · April 20, 2023, 10:59pm

Its a 10 x 14TB raidz2. Its an older system with dual Xeon v2 and ~250GB RAM

# zfs get recordsize tank
NAME     PROPERTY    VALUE    SOURCE
tank  recordsize  128K     default

# zfs list
NAME                USED  AVAIL     REFER  MOUNTPOINT
tank                62.7T  31.3T      219K  /tank
tank/data           38.9T  31.3T     38.9T  /tank/data
tank/database       13.1T  31.3T     13.1T  /tank/database
tank/surveillance   10.7T  31.3T     10.7T  /tank/surveillance

# zpool list -o capacity,alloc,free,expandsize,dedupratio,health
  CAP  ALLOC   FREE  EXPANDSZ  DEDUP    HEALTH
  64%  82.3T  45.0T         -  1.00x    ONLINE

output of zdb and find:

$ zdb -Lbbbs tank

Traversing all blocks ...

82.3T completed (4059MB/s) estimated time remaining: 10691hr 22min 00sec
        bp count:             689534270
        ganged count:                 0
        bp logical:      81996814331904      avg: 118916
        bp physical:     67222308160000      avg:  97489     compression:   1.22
        bp allocated:    90447276257280      avg: 131171     compression:   0.91
        bp deduped:                   0    ref>1:      0   deduplication:   1.00
        Normal class:    90432375140352     used: 64.60%

        additional, non-pointer bps of type 0:    8287590
         number of (compressed) bytes:  number of bps
                         17:  11464 *
                         18:  36819 *
                         19:   7108 *
                         20:  14007 *
                         21:  12185 *
                         22:   8179 *
                         23:   8118 *
                         24:  31013 *
                         25:  14916 *
                         26:   7324 *
                         27:   7676 *
                         28:   9251 *
                         29:   8500 *
                         30:   1491 *
                         31:  13204 *
                         32:  22783 *
                         33: 173986 *****
                         34:   9574 *
                         35:  12451 *
                         36:  16165 *
                         37:  15612 *
                         38:  13577 *
                         39:  19061 *
                         40:  18772 *
                         41:  17979 *
                         42:  21008 *
                         43:  14110 *
                         44:  12147 *
                         45:  14700 *
                         46: 196050 *****
                         47:  18076 *
                         48:  21033 *
                         49:  29458 *
                         50:  19323 *
                         51:  29391 *
                         52: 113753 ***
                         53:  67465 **
                         54: 125717 ***
                         55:  84661 **
                         56:  84002 **
                         57: 115790 ***
                         58: 259506 *******
                         59: 132745 ****
                         60: 123229 ***
                         61: 211594 *****
                         62: 209710 *****
                         63: 127805 ***
                         64: 154614 ****
                         65:  62928 **
                         66:  64459 **
                         67:  80159 **
                         68:  92735 ***
                         69: 212592 *****
                         70: 136472 ****
                         71: 187823 *****
                         72: 125797 ***
                         73:  80840 **
                         74:  88533 ***
                         75:  84059 **
                         76:  76497 **
                         77:  60183 **
                         78:  66207 **
                         79:  62351 **
                         80:  58502 **
                         81:  82693 **
                         82:  57489 **
                         83:  50976 **
                         84:  63656 **
                         85:  57190 **
                         86:  52961 **
                         87:  59251 **
                         88:  75505 **
                         89: 245231 ******
                         90: 1725712 ****************************************
                         91: 201635 *****
                         92:  79220 **
                         93:  98902 ***
                         94:  60891 **
                         95:  63860 **
                         96:  77003 **
                         97:  99050 ***
                         98:  71553 **
                         99:  58373 **
                        100:  46081 **
                        101:  61569 **
                        102:  59641 **
                        103:  46696 **
                        104:  56727 **
                        105:  54225 **
                        106:  74209 **
                        107:  61631 **
                        108:  65276 **
                        109:  51528 **
                        110:  60918 **
                        111:  56761 **
                        112:  73938 **
        Dittoed blocks on same vdev: 11889260
        Dittoed blocks in same metaslab: 3

Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
     -      -       -       -       -       -        -  unallocated
     2    32K      8K     72K     36K    4.00     0.00  object directory
     1   128K     12K     72K     72K   10.67     0.00      L1 object array
   128    64K   63.5K   4.46M   35.7K    1.01     0.00      L0 object array
   129   192K   75.5K   4.54M     36K    2.54     0.00  object array
     2    32K      4K     36K     18K    8.00     0.00  packed nvlist
     -      -       -       -       -       -        -  packed nvlist size
     -      -       -       -       -       -        -  bpobj
     -      -       -       -       -       -        -  bpobj header
     -      -       -       -       -       -        -  SPA space map header
 7.33K   117M   29.3M    264M   36.0K    4.00     0.00      L1 SPA space map
 74.0K  1.64G    829M   4.78G   66.1K    2.03     0.01      L0 SPA space map
 81.3K  1.75G    858M   5.03G   63.4K    2.09     0.01  SPA space map
     3   200K    200K    288K     96K    1.00     0.00  ZIL intent log
     4   512K     16K     96K     24K   32.00     0.00      L5 DMU dnode
     4   512K     16K     96K     24K   32.00     0.00      L4 DMU dnode
     4   512K     16K     96K     24K   32.00     0.00      L3 DMU dnode
    10  1.25M    248K    780K     78K    5.16     0.00      L2 DMU dnode
 4.50K   576M    215M    646M    144K    2.67     0.00      L1 DMU dnode
 4.48M  71.7G   17.9G    108G   24.0K    4.00     0.13      L0 DMU dnode
 4.48M  72.2G   18.1G    108G   24.1K    3.98     0.13  DMU dnode
     5    20K     20K    132K   26.4K    1.00     0.00  DMU objset
     -      -       -       -       -       -        -  DSL directory
     7  3.50K     512     36K   5.14K    7.00     0.00  DSL directory child map
     -      -       -       -       -       -        -  DSL dataset snap map
     8    35K      8K     72K      9K    4.38     0.00  DSL props
     -      -       -       -       -       -        -  DSL dataset
     -      -       -       -       -       -        -  ZFS znode
     -      -       -       -       -       -        -  ZFS V0 ACL
     2   256K      8K     48K     24K   32.00     0.00      L3 ZFS plain file
 70.0K  8.75G    282M   1.65G   24.1K   31.79     0.00      L2 ZFS plain file
 3.72M   476G   36.3G    142G   38.3K   13.11     0.17      L1 ZFS plain file
  640M  74.0T   61.1T   81.9T    131K    1.21    99.60      L0 ZFS plain file
  644M  74.5T   61.1T   82.1T    130K    1.22    99.77  ZFS plain file
  248K  31.0G    993M   5.81G   24.0K   31.92     0.01      L1 ZFS directory
 8.54M  17.1G   6.34G   72.6G   8.50K    2.70     0.09      L0 ZFS directory
 8.78M  48.1G   7.31G   78.4G   8.93K    6.58     0.09  ZFS directory
     4     2K      2K     96K     24K    1.00     0.00  ZFS master node
     4    55K      4K     24K      6K   13.75     0.00  ZFS delete queue
     -      -       -       -       -       -        -  zvol object
     -      -       -       -       -       -        -  zvol prop
     -      -       -       -       -       -        -  other uint8[]
     -      -       -       -       -       -        -  other uint64[]
     -      -       -       -       -       -        -  other ZAP
     -      -       -       -       -       -        -  persistent error log
     1   128K      8K     72K     72K   16.00     0.00  SPA history
     -      -       -       -       -       -        -  SPA history offsets
     -      -       -       -       -       -        -  Pool properties
     -      -       -       -       -       -        -  DSL permissions
     -      -       -       -       -       -        -  ZFS ACL
     -      -       -       -       -       -        -  ZFS SYSACL
     -      -       -       -       -       -        -  FUID table
     -      -       -       -       -       -        -  FUID table size
     -      -       -       -       -       -        -  DSL dataset next clones
     -      -       -       -       -       -        -  scan work queue
    12    10K      6K     96K      8K    1.67     0.00  ZFS user/group/project used
     -      -       -       -       -       -        -  ZFS user/group/project quota
     -      -       -       -       -       -        -  snapshot refcount tags
     -      -       -       -       -       -        -  DDT ZAP algorithm
     -      -       -       -       -       -        -  DDT statistics
 5.25K  5.51M   5.51M    126M     24K    1.00     0.00  System attributes
     -      -       -       -       -       -        -  SA master node
     4     6K      6K     96K     24K    1.00     0.00  SA attr registration
     8   128K     32K    192K     24K    4.00     0.00  SA attr layouts
     -      -       -       -       -       -        -  scan translations
     -      -       -       -       -       -        -  deduplicated block
     -      -       -       -       -       -        -  DSL deadlist map
     -      -       -       -       -       -        -  DSL deadlist map hdr
     -      -       -       -       -       -        -  DSL dir clones
     -      -       -       -       -       -        -  bpobj subobj
     -      -       -       -       -       -        -  deferred free
     -      -       -       -       -       -        -  dedup ditto
     1   128K      4K     36K     36K   32.00     0.00      L1 other
    54   701K    302K   2.64M     50K    2.32     0.00      L0 other
    55   829K    306K   2.67M   49.7K    2.71     0.00  other
     4   512K     16K     96K     24K   32.00     0.00      L5 Total
     4   512K     16K     96K     24K   32.00     0.00      L4 Total
     6   768K     24K    144K     24K   32.00     0.00      L3 Total
 70.0K  8.76G    282M   1.65G   24.1K   31.77     0.00      L2 Total
 3.97M   508G   37.5G    149G   37.5K   13.53     0.18      L1 Total
  654M  74.1T   61.1T   82.1T    129K    1.21    99.82      L0 Total
  658M  74.6T   61.1T   82.3T    128K    1.22   100.00  Total

Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:  8.42M  4.21G  4.21G  8.42M  4.21G  4.21G      0      0      0
     1K:  9.90M  11.6G  15.9G  9.90M  11.6G  15.9G      0      0      0
     2K:  13.0M  33.8G  49.6G  13.0M  33.8G  49.6G      0      0      0
     4K:  18.0M  73.7G   123G  6.13M  32.5G  82.1G      0      0      0
     8K:  7.15M  68.1G   191G  4.24M  47.1G   129G  38.1M   457G   457G
    16K:  7.04M   160G   351G  8.10M   148G   277G  19.4M   466G   923G
    32K:   127M  4.48T  4.82T  3.86M   201G   478G   126M  6.81T  7.71T
    64K:  27.2M  2.36T  7.18T  9.92M   856G  1.30T  25.2M  2.46T  10.2T
   128K:   432M  54.0T  61.1T   586M  73.3T  74.6T   441M  72.1T  82.3T
   256K:      0      0  61.1T      0      0  74.6T    872   296M  82.3T
   512K:      0      0  61.1T      0      0  74.6T      0      0  82.3T
     1M:      0      0  61.1T      0      0  74.6T      0      0  82.3T
     2M:      0      0  61.1T      0      0  74.6T      0      0  82.3T
     4M:      0      0  61.1T      0      0  74.6T      0      0  82.3T
     8M:      0      0  61.1T      0      0  74.6T      0      0  82.3T
    16M:      0      0  61.1T      0      0  74.6T      0      0  82.3T

                            capacity   operations   bandwidth  ---- errors ----
description                used avail  read write  read write  read write cksum
tank                   82.2T 45.1T   496     0 3.87M     0     0     0     0
  raidz2                  82.2T 45.1T   496     0 3.87M     0     0     0     0
    /dev/mapper/0                49     0  396K     0     0     0    15
    /dev/mapper/1                49     0  395K     0     0     0    15
    /dev/mapper/2                49     0  397K     0     0     0    15
    /dev/mapper/3                49     0  396K     0     0     0    14
    /dev/mapper/4                49     0  396K     0     0     0    15
    /dev/mapper/5                49     0  396K     0     0     0    16
    /dev/mapper/6                49     0  397K     0     0     0    17
    /dev/mapper/7                49     0  396K     0     0     0    16
    /dev/mapper/8                49     0  396K     0     0     0    14
    /dev/mapper/9                49     0  396K     0     0     0    14

$ find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'

 1k: 25556693
  2k: 9758600
  4k: 5124996
  8k: 4144854
 16k: 2883986
 32k: 4097472
 64k: 10308032
128k: 1304380
256k: 786550
512k: 305546
  1M: 276663
  2M: 231548
  4M: 113336
  8M: 165001
 16M:  60267
 32M:  77183
 64M:  30999
128M:  19839
256M:  23322
512M:  13339
  1G:   6948
  2G:   5669
  4G:   2421
  8G:    360
 16G:     48
 32G:     16
 64G:      1
256G:      1
512G:      1

Exard3k · April 20, 2023, 11:12pm

we were talking about the zvol, not the pool. The pool has a pool-wide recordsize which seems to be the default. But the zvol has a volblocksize that is different.

zfs get volblocksize tank/../zvolname

With RAIDZ, especially on wider pools, you waste quite a lot of space when using zvols. On top of the metadata inflation bogging down performance.

edit: I checked the output once more. Block histogram shows mostly 128k stuff, so I assume the zvol runs on 128k. Or there isn’t any zvol and you are just calling the pool a zvol, not sure

Yeah with 67%, pool starts to slow down. You may hit the limit on what your memory can cache. 128k isn’t problematic by any stretch and the rest of the pool is basically non-existent. With 128k as we see on your pool, that’s really average amount of metadata. But with these amounts of data, TBs worth of metadata are normal.

These numbers don’t add up. 10x14 RaidZ2 is 112T raw.

e97 · April 21, 2023, 12:14am

Don’t see volblocksize, I’m using ZFS on Linux

My vols use the same parameters as my pool, probably newbie configuration on my part.

$ zfs get volblocksize,recordsize {tank, tank/*}
NAME               PROPERTY      VALUE     SOURCE
tank               volblocksize  -         -
tank               recordsize    128K      default

tank/data          volblocksize  -         -
tank/data          recordsize    128K      default

tank/database      volblocksize  -         -
tank/database      recordsize    128K      default

tank/surveillance  volblocksize  -         -
tank/surveillance  recordsize    128K      default

Exard3k · April 21, 2023, 12:18am

These are called datasets, not vols or zvols. Zvols are totally different things. That’s why I went into the wrong direction. But we’re on track now

So the entire pool is 128k datasets. That’s good news. No Zvol being a troublemaker.

As edited above, that results in “normal” amounts of metadata. Maybe 0.5%. But even 0.5% is more than your RAM. Pool is reaching capacity limit, fragmentation is probably becoming a concern while 10-wide RAIDZ isn’t good at random IO in the first place and your memory is less able to cache all the stuff while amount of data is increasing.

special vdev will help to some degree. I can’t say how much metadata there is and how you get the 5T. In any case you have to backup and restore to get the metadata into the special vdev. ZFS doesn’t move data by itself. Otherwise ZFS will only allocate new stuff to the special vdev.

e97 · April 21, 2023, 12:23am

The 5tb/172tb was from @wendell 's example in the first post.

My setup is 10 x 14TB. I’m still unsure how to calculate metadata size so I can size the appropriate flash storage.

So does this sounds reasonable for metadata:

as of now: ~64 TiB of data * .005 (0.5%) = 351.8 GiB
max 94 TiB * .005 = 516.7 GiB

ah thank you for clarifying! I must have gotten confused with LVM (device > group > volume) because I was resizing a lvolume to use as L2ARC on nvme drive for this pool.

yes for zfs: vdev > zpools > datasets
not sure how zvols fit into the picture… (looked it up they are block devices sitting atop ZFS for virtual machines)

Exard3k · April 21, 2023, 12:32am

With 128k recordsize across the pool (and the block histogram clearly shows that >90% is 128k), you can expect around 1G of metadata for each TB. So a mirror of 120GB SSDs will be fine if you stick with 128k. It can’t fix bad random read/write performance on RAIDz, nor does it help with other performance concerns mentioned above, but it will spare your HDDs from reading those ~100GB over and over again when accessing those files.

So this isn’t quite the madness I imagined when I first heard of 67TB Zvol (which is insane, but sometimes needed) and TBs of metadata.

Yeah I was under the impression you got some customer with a 67TB VM disk or running the entire pool as a source for some VM.

500GB is a safe bet in any case if you prefer the 0.5% figure. Pretty much the smallest SATA/NVMe SSD size you can buy anyway.

e97 · April 21, 2023, 12:46am

How do I address those other performance issues?

mirorred Z special device for metadata > 128GB each
random read/write - nature of spinning rust;
for read → add L2ARC
for write → add a ZIL (w/ battery backup)
what were the other performance issues?

This is my NAS/backup so to speak, I only zfs send/receive when upgrading disk…
started with disk about a third of the 14TB currently using.

How would you suggest optimizing the datasets if starting from scratch?

data - photos and video and system backups
database - lots of random I/O, large files but lots of random r/w activity
surveillance - images and video, append only