ZFS Metadata Special Device: Z

I guess just because I can’t see exactly what it is and there wasn’t exactly a tonne of info about this kind of setup out in the ether.

As I rewrote the data to the pool both grew at pretty much exactly the same rate*. Which before your comment I was certain it was just mirroring to both SPECIAL vdevs at the same time.

*Within the rounding of GB rather than bytes or something more smaller and measurable.

To be clear, its not that I don’t believe you. (In fact I hadn’t thought of the balancing of ZFS writes at all until you said it), but alongside the small bit of write up for the dual mirror I’ve done here, I’d like to see something that actually concretely shows it.

I’ve been messing around with stats from zdb and its definitely strong support for the balanced writes. That is there are differences at a smaller level (not at home so don’t have the output right now).

I’m probably at a stage where unless someone has a particular trick for inspecting the metadata itself on each drive, between the slight differences in stats and your comment that its balanced writes. I’m okay with it, but small hole in the overall project as it were :smiley:

Went with the following:

# zpool status
  pool: tank
 state: ONLINE
config:

        NAME         STATE     READ WRITE CKSUM
        tank         ONLINE       0     0     0
          mirror-0   ONLINE       0     0     0
            sdc      ONLINE       0     0     0
            sdd      ONLINE       0     0     0
          mirror-1   ONLINE       0     0     0
            sde      ONLINE       0     0     0
            sdf      ONLINE       0     0     0
          mirror-2   ONLINE       0     0     0
            sdg      ONLINE       0     0     0
            sdh      ONLINE       0     0     0
          mirror-3   ONLINE       0     0     0
            sdi      ONLINE       0     0     0
            sdj      ONLINE       0     0     0
          mirror-4   ONLINE       0     0     0
            sdk      ONLINE       0     0     0
            sdl      ONLINE       0     0     0
        special
          mirror-5   ONLINE       0     0     0
            nvme1n1  ONLINE       0     0     0
            nvme2n1  ONLINE       0     0     0

Everything seems to be working:

# time zdb -Lbbbs tank/smb
Dataset tank/smb [ZPL], ID 400, cr_txg 55, 7.07T, 756679 objects

real    0m1.303s
user    0m0.730s
sys     0m0.673s

the special is 2 of those 118G Optanes

1 Like

notice its way faster over spinning rust??

New system with fio test 10GB file. not sure if metadata cache is adding anything to these results but overall things are good:


   READ: bw=167MiB/s (175MB/s), 167MiB/s-167MiB/s (175MB/s-175MB/s), io=5007MiB (5251MB), run=30001-30001msec
   READ: bw=1461MiB/s (1532MB/s), 1461MiB/s-1461MiB/s (1532MB/s-1532MB/s), io=10.0GiB (10.7GB), run=7011-7011msec
  WRITE: bw=693KiB/s (709kB/s), 693KiB/s-693KiB/s (709kB/s-709kB/s), io=20.3MiB (21.3MB), run=30008-30008msec
  WRITE: bw=568KiB/s (581kB/s), 568KiB/s-568KiB/s (581kB/s-581kB/s), io=16.6MiB (17.4MB), run=30006-30006msec

Old system running same test, BTRFS NAS



   READ: bw=597KiB/s (611kB/s), 597KiB/s-597KiB/s (611kB/s-611kB/s), io=17.5MiB (18.3MB), run=30003-30003msec
   READ: bw=591MiB/s (620MB/s), 591MiB/s-591MiB/s (620MB/s-620MB/s), io=10.0GiB (10.7GB), run=17319-17319msec
  WRITE: bw=119KiB/s (122kB/s), 119KiB/s-119KiB/s (122kB/s-122kB/s), io=3580KiB (3666kB), run=30004-30004msec
  WRITE: bw=123KiB/s (126kB/s), 123KiB/s-123KiB/s (126kB/s-126kB/s), io=3684KiB (3772kB), run=30020-30020msec

However, there is one thing running a lot slower, and that is smb file share, and counting the number of files in a directory with 200K files (using windows to get folder properties on a smb file share). still working on that, not a deal breaker and not worth going back to BTRFS and giving up the ZFS advantages and performance

Yeah, this is where I fall down unless someone has a solution for raidz2 users. :frowning:

The solution is -f. zpool is throwing a warning when trying to create special vdev with a different level of redundancy. You override that warning by forcing the creation with the -f command line option.

2 Likes

as of a couple versions ago the truenas gui will let you do this without any warnings. I just added a mirror spevial device to a 3 mechanical drive z1 pool and it didn’t even bother saying this is slightly atypical. its not weird to have a 3 way mirror or a striped mirror of special devices either.

3 Likes

Hello, n00b here on the forums.

Hoping someone can help me make sense of the script output, and suggest an appropriate special device size:

Traversing all blocks ...


        bp count:            4689092545
        ganged count:                 0
        bp logical:      38976361862144      avg:   8312
        bp physical:     38560017262080      avg:   8223     compression:   1.01
        bp allocated:    38766664744960      avg:   8267     compression:   1.01
        bp deduped:                   0    ref>1:      0   deduplication:   1.00
        Normal class:    38766658138112     used: 71.91%
        Embedded log class         131072     used:  0.00%

        additional, non-pointer bps of type 0:     120615
         number of (compressed) bytes:  number of bps
                         27:      3 *
                         28:      4 *
                         29:     10 *
                         30:      0
                         31:      1 *
                         32:      0
                         33:      1 *
                         34:      0
                         35:      0
                         36:      0
                         37:      0
                         38:      0
                         39:      1 *
                         40:      0
                         41:      1 *
                         42:      1 *
                         43:      0
                         44:      1 *
                         45:      1 *
                         46:      0
                         47:  50039 ****************************************
                         48:   1228 *
                         49:    269 *
                         50:    977 *
                         51:   2682 ***
                         52:    909 *
                         53:   1676 **
                         54:   1568 **
                         55:    887 *
                         56:    434 *
                         57:    616 *
                         58:   2201 **
                         59:    356 *
                         60:    615 *
                         61:    837 *
                         62:   1117 *
                         63:    378 *
                         64:    527 *
                         65:   3916 ****
                         66:   1447 **
                         67:   4372 ****
                         68:   1242 *
                         69:    781 *
                         70:    935 *
                         71:   1138 *
                         72:    861 *
                         73:   1508 **
                         74:    899 *
                         75:   1155 *
                         76:   3964 ****
                         77:   4028 ****
                         78:   1367 **
                         79:    949 *
                         80:    922 *
                         81:    950 *
                         82:    633 *
                         83:    331 *
                         84:    369 *
                         85:    603 *
                         86:   1691 **
                         87:   1102 *
                         88:    281 *
                         89:    529 *
                         90:   1098 *
                         91:   2482 **
                         92:    553 *
                         93:    429 *
                         94:    779 *
                         95:    857 *
                         96:    538 *
                         97:    332 *
                         98:    355 *
                         99:    472 *
                        100:    461 *
                        101:   1073 *
                        102:    826 *
                        103:    950 *
                        104:    464 *
                        105:    617 *
                        106:    949 *
                        107:    543 *
                        108:    375 *
                        109:    265 *
                        110:   1362 **
                        111:   1087 *
                        112:    435 *
        Dittoed blocks on same vdev: 23134

Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
     -      -       -       -       -       -        -  unallocated
     2    32K      8K     24K     12K    4.00     0.00  object directory
     3   384K     12K     36K     12K   32.00     0.00      L1 object array
    51  25.5K   25.5K    612K     12K    1.00     0.00      L0 object array
    54   410K   37.5K    648K     12K   10.92     0.00  object array
     1    16K      4K     12K     12K    4.00     0.00  packed nvlist
     -      -       -       -       -       -        -  packed nvlist size
     -      -       -       -       -       -        -  bpobj
     -      -       -       -       -       -        -  bpobj header
     -      -       -       -       -       -        -  SPA space map header
   420  6.56M   1.64M   4.92M     12K    4.00     0.00      L1 SPA space map
 3.63K   465M    109M    328M   90.4K    4.25     0.00      L0 SPA space map
 4.04K   471M    111M    333M   82.5K    4.24     0.00  SPA space map
     2   132K    132K    132K     66K    1.00     0.00  ZIL intent log
     2   256K      8K     16K      8K   32.00     0.00      L5 DMU dnode
     2   256K      8K     16K      8K   32.00     0.00      L4 DMU dnode
     2   256K      8K     16K      8K   32.00     0.00      L3 DMU dnode
     3   384K     12K     28K   9.33K   32.00     0.00      L2 DMU dnode
     5   640K    100K    292K   58.4K    6.40     0.00      L1 DMU dnode
 1.13K  18.2M   4.54M   13.6M   12.0K    4.00     0.00      L0 DMU dnode
 1.15K  19.9M   4.67M   14.0M   12.2K    4.26     0.00  DMU dnode
     3    12K     12K     28K   9.33K    1.00     0.00  DMU objset
     -      -       -       -       -       -        -  DSL directory
     -      -       -       -       -       -        -  DSL directory child map
     -      -       -       -       -       -        -  DSL dataset snap map
     -      -       -       -       -       -        -  DSL props
     -      -       -       -       -       -        -  DSL dataset
     -      -       -       -       -       -        -  ZFS znode
     -      -       -       -       -       -        -  ZFS V0 ACL
     -      -       -       -       -       -        -  ZFS plain file
     -      -       -       -       -       -        -  ZFS directory
     1    512     512      8K      8K    1.00     0.00  ZFS master node
     -      -       -       -       -       -        -  ZFS delete queue
     1   128K      4K      8K      8K   32.00     0.00      L4 zvol object
     6   768K    300K    600K    100K    2.56     0.00      L3 zvol object
 4.37K   559M    228M    455M    104K    2.45     0.00      L2 zvol object
 4.36M   559G    192G    384G   88.0K    2.91     1.06      L1 zvol object
 4.36G  34.9T   34.9T   34.9T   8.00K    1.00    98.93      L0 zvol object
 4.37G  35.4T   35.1T   35.3T   8.07K    1.01   100.00  zvol object
     -      -       -       -       -       -        -  zvol prop
     -      -       -       -       -       -        -  other uint8[]
     -      -       -       -       -       -        -  other uint64[]
     -      -       -       -       -       -        -  other ZAP
     -      -       -       -       -       -        -  persistent error log
     1   128K      4K     12K     12K   32.00     0.00  SPA history
     -      -       -       -       -       -        -  SPA history offsets
     -      -       -       -       -       -        -  Pool properties
     -      -       -       -       -       -        -  DSL permissions
     -      -       -       -       -       -        -  ZFS ACL
     -      -       -       -       -       -        -  ZFS SYSACL
     -      -       -       -       -       -        -  FUID table
     -      -       -       -       -       -        -  FUID table size
     -      -       -       -       -       -        -  DSL dataset next clones
     -      -       -       -       -       -        -  scan work queue
     -      -       -       -       -       -        -  ZFS user/group/project used
     -      -       -       -       -       -        -  ZFS user/group/project quota
     -      -       -       -       -       -        -  snapshot refcount tags
     -      -       -       -       -       -        -  DDT ZAP algorithm
     -      -       -       -       -       -        -  DDT statistics
     -      -       -       -       -       -        -  System attributes
     -      -       -       -       -       -        -  SA master node
     1  1.50K   1.50K      8K      8K    1.00     0.00  SA attr registration
     2    32K      8K     16K      8K    4.00     0.00  SA attr layouts
     -      -       -       -       -       -        -  scan translations
     -      -       -       -       -       -        -  deduplicated block
     -      -       -       -       -       -        -  DSL deadlist map
     -      -       -       -       -       -        -  DSL deadlist map hdr
     -      -       -       -       -       -        -  DSL dir clones
     -      -       -       -       -       -        -  bpobj subobj
     -      -       -       -       -       -        -  deferred free
     -      -       -       -       -       -        -  dedup ditto
    20   460K     34K    132K   6.60K   13.53     0.00  other
     2   256K      8K     16K      8K   32.00     0.00      L5 Total
     3   384K     12K     24K      8K   32.00     0.00      L4 Total
     8     1M    308K    616K     77K    3.32     0.00      L3 Total
 4.37K   559M    228M    456M    104K    2.46     0.00      L2 Total
 4.36M   559G    192G    384G   88.0K    2.91     1.06      L1 Total
 4.36G  34.9T   34.9T   34.9T   8.00K    1.00    98.94      L0 Total
 4.37G  35.4T   35.1T   35.3T   8.07K    1.01   100.00  Total

Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:     52    26K    26K     52    26K    26K      0      0      0
     1K:      4  5.50K  31.5K      4  5.50K  31.5K      0      0      0
     2K:      1     2K  33.5K      1     2K  33.5K      0      0      0
     4K:  4.89M  19.6G  19.6G      4    16K  49.5K  4.89M  19.6G  19.6G
     8K:  4.36G  34.9T  34.9T  4.36G  34.9T  34.9T  4.36G  34.9T  34.9T
    16K:    375  8.09M  34.9T  1.55K  24.8M  34.9T    708  16.4M  34.9T
    32K:  4.37M   192G  35.1T      1    33K  34.9T    417  17.7M  34.9T
    64K:    873  69.2M  35.1T      0      0  34.9T  4.37M   384G  35.3T
   128K:      1   128K  35.1T  4.37M   560G  35.4T    894   179M  35.3T
   256K:      0      0  35.1T      0      0  35.4T    275  72.0M  35.3T
   512K:      0      0  35.1T      0      0  35.4T      0      0  35.3T
     1M:      0      0  35.1T      0      0  35.4T      0      0  35.3T
     2M:      0      0  35.1T      0      0  35.4T      0      0  35.3T
     4M:      0      0  35.1T      0      0  35.4T      0      0  35.3T
     8M:      0      0  35.1T      0      0  35.4T      0      0  35.3T
    16M:      0      0  35.1T      0      0  35.4T      0      0  35.3T

The pool itself is 6x18TB HDDs in a mirror configuration (ZFS raid 10), which is about ~75% full. The system is a proxmox host with a CCTV server (migrated from a previous build, hence the already high usage), which uses sync writes (not SLOG in use now, but that’s coming). Performance is about what I’d expect from spinning rust, with some vioscsi problems every ~6 days. The hardware itself is fine, and I’ve tried all the VirtIO drivers from .204 up to latest, and have found the latest lasts the longest before an error.

Anyways, I say all that because I’m trying to eliminate other potential issues by optimizing the storage pool (adding SLOG and a special device).

Going by the rule of 0.3% for special device, that would see me need a special device of at least 165.9GB, but that doesn’t necessarily seem right?

So again, I’m hoping someone can help me make sense of all this.

Thank you!

Signed, long time lurker.

Or you just disable sync in ZFS. ZFS just tells the application “alright everything’s on disk” while it is still in memory to speed things up.

“bp” stands for block pointer. The data that keeps all metadata and txg stuff and is created for each block. Size is a maximum of 128 Bytes. . I’m not sure what else contributes to that. But more block pointer = more metadata. And the less records/blocks you got, the less metadata.

special device may result in better performance. But ARC and L2ARC also store metadata. So if caching is fine, special isn’t that useful. Also make sure to include small blocks in your special. Size entirely depends on how many small files you have. I’m fine with 1TB for metadata+small blocks for 6x16TB although pool isn’t at 75%. That percentage alone will reduce performance. Anything >50% is noticeable.

Oh and you have to restore the pool from backup when you add a special vdev. All the metadata and small blocks are on your HDDs and ZFS never moves data by itself. New data gets properly distributed however.

Note that special_small_blocks is not applied to zvol blocks. zvol data will not be stored in special device no matter the block size, although the metadata is still stored in special device.
slog is highly recommended. I run zfs_txg_timeout = 300 (5 minutes). I solely rely on slog to keep data integrity in case system crash.

1 Like

I can get the -f switch to work with a special mirror vdev,

zpool create -f tank /dev/sdb special mirror /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1

but not with a special raidz vdev.

zpool create -f tank /dev/sdb special raidz /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1
invalid vdev specification: unsupported 'special' device: raidz

RAIDZ isn’t supported for special VDEVs

There are some talks to add this feature, but RAIDZ with all its disadvantages usually isn’t worth it when talking about mostly tiny stuff like metadata or small blocks. Write amplification and general waste of space (similar to what you see with ZVOLs on RAIDZ) usually makes it a poor choice. Performance being another disadvantage.

You can see double or triple the space usage just because if you want to write a 4k block (assuming ashift=12 for 4k sectors) across like 5 disks and parity, you got a problem. Only works if you write it to each disk to ensure integrity, thus wasting space, making it a de-facto “5-way mirror with parity”. The wider the RAIDZ, the worse it gets. Just make a 4k blocksize ZVOL on a RAIDZ and check the space usage, it’s horrible.

1 Like

how much benefit would people expect to see adding a special device of optane 905p on a pool of sata SSDs to speed up io? Is this a waste of money or benefit? Run VMs and Kubernetes off this pool in addition to general SMB

I doubt you will see a noticeable impact. Main reason to use it is to bypass the abyssmal HDD random read performance on small stuff. SATA SSDs may not have the bandwidth, but still have NAND flash level of random read/writes.
If your SATA drives are stretched to the limit, offloading some random I/O is useful, but so is adding more drives or memory.
And optional small blocks setting doesn’t apply to ZVOLs, so that won’t be of use to any of your virtual disks either.

And you need at least two of the 905p, because redundancy.

1 Like

Thanks - I have an HDD pool of 3 Zvols of 6 drive raidz2 HDDs that has mirrored optane special metadata devices with small blocks for small xmp files as well as a separate optane L2arc - and that has really helped the pool.

My SSD pool is 6 zvols of mirrored SSDs with an optane L2arc - but I have been wondering if a special metadata device would be wasted or not - but it seems even for the small level writes being done by K8s and VMs “eating up” the random IO - I wonder if it could be better there.

You neglect the fact that ZFS uses transaction groups (txg) to gather up all writes in memory and basically convert random IO to sequential IO.

Just to give you a perspective for async writes on ZFS: I’ve been playing with various backup methods two weeks ago. Stream of like 5T of user data with everything in it, millions of small files. I was transferring that from 2x striped NVMe 4.0 drives via 10Gbit connection to my pool. Turned out that ZFS was able to write the small stuff faster to my HDDs than the NVMe drives could actually read them.That’s what they call ZFS magic. The magic of transaction groups (txg).

async writes are not a problem you have to deal with, txg’s take care of everything in memory before flushing it to disk as a sequential stream.

Problem starts with sync writes where we can’t gather up random stuff to optimize write pattern on underlying hardware. The application commands to immediately write it to disk although writing single random blocks is really bad even for flash. Workaround being LOG vdev or sync=disabled.

Optane LOG or disabling sync (fastest) all together might be the better approach than throwing money at underlying hardware if indeed sync writes are a thing on your pool.

This is why ZFS performs so well even on HDDs. Because there aren’t thousands of random writes happening in a properly setup pool, it will be one big chunk of data coming from DRAM to disks every 5 seconds (default settings).

That’s how you deal with writes.
You deal with (random) reads by caching stuff in memory.

3 Likes

thanks for this - appreciate the explanation

Also, as I have been digging into my HDD spinning rust pool with triple mirrored intel 900p special metadata device vdev, and I see that the GUI made the special metadata device with ashift of 9 as the intel 900p reports 512byte sectors and cannot be programmed. Whats the consensus on mixed LBA size on metadata device vs spinning rust? I had really wanted this to be 4K to keep everything the same between vdevs.

Late to the party, but adding this for completeness:

According to github com openzfs zfs blob master module zfs arc.c line 7963, this is not the case (sorry can’t include links as a new user):

There is no eviction path from the ARC to the L2ARC. Evictions from the ARC behave as usual, freeing buffers and placing headers on ghost lists. The ARC does not send buffers to the L2ARC during eviction as this would add inflated write latencies for all ARC memory pressure.

2 Likes

Thanks @Impatient_Crustacean . I don’t bother with L2arc myself, so am interested to learn.
Same with special device.

I have since read a Klara blog post about it…

one sec

does this look more like it:

How the L2ARC receives data

Another common misconception about the L2ARC is that it’s fed by ARC cache evictions. This isn’t quite the case. When the ARC is near full, and we are evicting blocks, we cannot afford to wait while we write data to the L2ARC, as that will block forward progress reading in the new data that an application actually wants right now. Instead we feed the L2ARC with blocks when they get near the bottom, making sure to stay ahead of eviction.

3 Likes