ZFS Metadata Special Device: Z

So (First post Hi!) This has been taking all of my attention in recent weeks and is amazing. I picked up 4 of the Optane 118GB NVMe and added them to a card with PCI Bifurcation. Everything looks great.

Log - Most if not all of your posts here have been fantastically informative and great stuff contained. I really appreciate it.

I’m getting ready to pull the trigger on adding the SPECIAL device to the pool, but have a few lingering questions.

My setup
~120GB pool that is about 75% full across mirrored vdevs (some 24 odd drives)
-4x Optane 118GB bifurcated drives

Rough Metadata ASIZE around 101GB so not the largest amount of wiggle room on a set of 2 mirror drives.

  1. Most implementations of the SPECIAL device seem to suggest a mirror of two drives. Can I run two SPECIAL vdevs and expect it to play nicely? (2x2OptaneMirrors for a total of 236GB across two vdevs)

  2. Given that it appears removing the SPECIAL device can happen IFF Vdevs are mirrored and ASHIFT size is equal. Is that a reasonable assumption? (The man docs on zfs remove suggest it heavily it and some folks here have had experiences) I don’t expect this, but safety net and all.

  3. Because its a pre-existing pool only new metadata gets added to the SPECIAL device. The typical recommendation is to rewrite everything to get metadata in place. Can you somewhat do a hacky zfs send | zfs recv on the same pool to get this result?

Cheers!

I thought there was a setting to configure the minimum; although, I use it through TrueNAS, so I’m not sure.

Yeah I have set that to 512K, but for some reason, it seems like the setting is not being obeyed. I just wondered if it was something dumb like I’ve set the threshold, but not actually enabled the feature or something

So some testing -

It appears that creating two SPECIAL devices can work (?)

Created in a VM, and just testing the various writing with some random dd’ed files

root@truenas[~]# zpool list -v
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
boot-pool   31.5G  28.7G  2.82G        -         -    16%    91%  1.00x    ONLINE  -
  ada0p2    32.0G  28.7G  2.82G        -         -    16%  91.0%      -    ONLINE
test         112G  40.0G  72.5G        -         -     0%    35%  1.00x    ONLINE  /mnt
  ada1        50G  40.0G  9.54G        -         -     0%  80.7%      -    ONLINE
special         -      -      -        -         -      -      -      -  -
  mirror-1  31.5G  19.6M  31.5G        -         -     0%  0.06%      -    ONLINE
    ada2      32G      -      -        -         -      -      -      -    ONLINE
    ada3      32G      -      -        -         -      -      -      -    ONLINE
  mirror-2  31.5G  19.3M  31.5G        -         -     0%  0.05%      -    ONLINE
    ada4      32G      -      -        -         -      -      -      -    ONLINE
    ada5      32G      -      -        -         -      -      -      -    ONLINE

So its looking good for the Multiple mirrored pools.

I tried this using the CLI in Truenas (as a quick spin up), I’m slightly concerned that the GUI for pool creation doesn’t allow for multiple SPECIAL Vdevs, but everything in openzfs manual doesn’t suggest this isn’t possible, so it may just be a case of iX keeping it simplified.

The inplace rewrite also appears to work

root@truenas[~]# zfs send -R test@tesa | zfs recv -F test/test
cannot receive new filesystem stream: out of space
warning: cannot send 'test@tesa': signal received

But having enough space to write again will be the bigger challenge. I’m constructing a plan to do dataset by dataset and probably some reliance on backups to get me over that line.

Finally removing the special vdev mirror appears to be fine too.

root@truenas[~]# zpool remove test mirror-1
root@truenas[~]# zpool remove test mirror-2
root@truenas[~]# cd /mnt/test
root@truenas[/mnt/test]# ls -la
total 41831922
drwxr-xr-x  2 root  wheel         133 Feb 11 15:16 .
drwxr-xr-x  3 root  wheel         128 Feb 11 14:55 ..
-rw-r--r--  1 root  wheel  1073741824 Feb 11 14:58 file1
  pool: test
 state: ONLINE
remove: Removal of vdev 2 copied 34.7M in 0h0m, completed on Sun Feb 12 04:25:06 2023
	16.9K memory used for removed device mappings
config:

	NAME          STATE     READ WRITE CKSUM
	test          ONLINE       0     0     0
	  ada1        ONLINE       0     0     0

So at this point I’m reasonably comfortable that for the Mirrored vdev setup this is
A. Reversable
B. Doable over two Mirrored Vdevs for SPECIAL
C. Somewhat tricky if you have a pre-existing pool around 70% filled, but backups are the winner here.

This of course comes with a standard disclaimer of this was using some pretty non-standard half arsed testing data. I tried to vary the file sizes and block sizes but I can’t be sure how that plays out against real world data.

Minor update as I do the migration of data.

It appears that the Two Mirrored Vdevs write the same data and don’t use them seperately.

special                                                -      -      -        -         -      -      -      -  -
  mirror-12                                         110G  24.4G  85.6G        -         -    15%  22.1%      -    ONLINE
    nvme0n1                                         110G      -      -        -         -      -      -      -    ONLINE
    nvme1n1                                         110G      -      -        -         -      -      -      -    ONLINE
  mirror-13                                         110G  24.4G  85.6G        -         -    15%  22.1%      -    ONLINE
    nvme2n1                                         110G      -      -        -         -      -      -      -    ONLINE
    nvme3n1                                         110G      -      -        -         -      -      -      -    ONLINE

The USED value has been steadily rising in parallel across both Vdevs.

I should still have plenty of space but its a bit of a shame. Will investigate further.

ZFS is balancing writes between them just like it does when you have multiple data vdevs.

1 Like

That’d be awesome, but I’m also suspect that the balance is nearly perfectly symmetrical. Is there a known way of inspecting what the metadata is on the devices?

…why?

I guess just because I can’t see exactly what it is and there wasn’t exactly a tonne of info about this kind of setup out in the ether.

As I rewrote the data to the pool both grew at pretty much exactly the same rate*. Which before your comment I was certain it was just mirroring to both SPECIAL vdevs at the same time.

*Within the rounding of GB rather than bytes or something more smaller and measurable.

To be clear, its not that I don’t believe you. (In fact I hadn’t thought of the balancing of ZFS writes at all until you said it), but alongside the small bit of write up for the dual mirror I’ve done here, I’d like to see something that actually concretely shows it.

I’ve been messing around with stats from zdb and its definitely strong support for the balanced writes. That is there are differences at a smaller level (not at home so don’t have the output right now).

I’m probably at a stage where unless someone has a particular trick for inspecting the metadata itself on each drive, between the slight differences in stats and your comment that its balanced writes. I’m okay with it, but small hole in the overall project as it were :smiley:

Went with the following:

# zpool status
  pool: tank
 state: ONLINE
config:

        NAME         STATE     READ WRITE CKSUM
        tank         ONLINE       0     0     0
          mirror-0   ONLINE       0     0     0
            sdc      ONLINE       0     0     0
            sdd      ONLINE       0     0     0
          mirror-1   ONLINE       0     0     0
            sde      ONLINE       0     0     0
            sdf      ONLINE       0     0     0
          mirror-2   ONLINE       0     0     0
            sdg      ONLINE       0     0     0
            sdh      ONLINE       0     0     0
          mirror-3   ONLINE       0     0     0
            sdi      ONLINE       0     0     0
            sdj      ONLINE       0     0     0
          mirror-4   ONLINE       0     0     0
            sdk      ONLINE       0     0     0
            sdl      ONLINE       0     0     0
        special
          mirror-5   ONLINE       0     0     0
            nvme1n1  ONLINE       0     0     0
            nvme2n1  ONLINE       0     0     0

Everything seems to be working:

# time zdb -Lbbbs tank/smb
Dataset tank/smb [ZPL], ID 400, cr_txg 55, 7.07T, 756679 objects

real    0m1.303s
user    0m0.730s
sys     0m0.673s

the special is 2 of those 118G Optanes

1 Like

notice its way faster over spinning rust??

New system with fio test 10GB file. not sure if metadata cache is adding anything to these results but overall things are good:


   READ: bw=167MiB/s (175MB/s), 167MiB/s-167MiB/s (175MB/s-175MB/s), io=5007MiB (5251MB), run=30001-30001msec
   READ: bw=1461MiB/s (1532MB/s), 1461MiB/s-1461MiB/s (1532MB/s-1532MB/s), io=10.0GiB (10.7GB), run=7011-7011msec
  WRITE: bw=693KiB/s (709kB/s), 693KiB/s-693KiB/s (709kB/s-709kB/s), io=20.3MiB (21.3MB), run=30008-30008msec
  WRITE: bw=568KiB/s (581kB/s), 568KiB/s-568KiB/s (581kB/s-581kB/s), io=16.6MiB (17.4MB), run=30006-30006msec

Old system running same test, BTRFS NAS



   READ: bw=597KiB/s (611kB/s), 597KiB/s-597KiB/s (611kB/s-611kB/s), io=17.5MiB (18.3MB), run=30003-30003msec
   READ: bw=591MiB/s (620MB/s), 591MiB/s-591MiB/s (620MB/s-620MB/s), io=10.0GiB (10.7GB), run=17319-17319msec
  WRITE: bw=119KiB/s (122kB/s), 119KiB/s-119KiB/s (122kB/s-122kB/s), io=3580KiB (3666kB), run=30004-30004msec
  WRITE: bw=123KiB/s (126kB/s), 123KiB/s-123KiB/s (126kB/s-126kB/s), io=3684KiB (3772kB), run=30020-30020msec

However, there is one thing running a lot slower, and that is smb file share, and counting the number of files in a directory with 200K files (using windows to get folder properties on a smb file share). still working on that, not a deal breaker and not worth going back to BTRFS and giving up the ZFS advantages and performance

Yeah, this is where I fall down unless someone has a solution for raidz2 users. :frowning:

The solution is -f. zpool is throwing a warning when trying to create special vdev with a different level of redundancy. You override that warning by forcing the creation with the -f command line option.

2 Likes

as of a couple versions ago the truenas gui will let you do this without any warnings. I just added a mirror spevial device to a 3 mechanical drive z1 pool and it didn’t even bother saying this is slightly atypical. its not weird to have a 3 way mirror or a striped mirror of special devices either.

3 Likes

Hello, n00b here on the forums.

Hoping someone can help me make sense of the script output, and suggest an appropriate special device size:

Traversing all blocks ...


        bp count:            4689092545
        ganged count:                 0
        bp logical:      38976361862144      avg:   8312
        bp physical:     38560017262080      avg:   8223     compression:   1.01
        bp allocated:    38766664744960      avg:   8267     compression:   1.01
        bp deduped:                   0    ref>1:      0   deduplication:   1.00
        Normal class:    38766658138112     used: 71.91%
        Embedded log class         131072     used:  0.00%

        additional, non-pointer bps of type 0:     120615
         number of (compressed) bytes:  number of bps
                         27:      3 *
                         28:      4 *
                         29:     10 *
                         30:      0
                         31:      1 *
                         32:      0
                         33:      1 *
                         34:      0
                         35:      0
                         36:      0
                         37:      0
                         38:      0
                         39:      1 *
                         40:      0
                         41:      1 *
                         42:      1 *
                         43:      0
                         44:      1 *
                         45:      1 *
                         46:      0
                         47:  50039 ****************************************
                         48:   1228 *
                         49:    269 *
                         50:    977 *
                         51:   2682 ***
                         52:    909 *
                         53:   1676 **
                         54:   1568 **
                         55:    887 *
                         56:    434 *
                         57:    616 *
                         58:   2201 **
                         59:    356 *
                         60:    615 *
                         61:    837 *
                         62:   1117 *
                         63:    378 *
                         64:    527 *
                         65:   3916 ****
                         66:   1447 **
                         67:   4372 ****
                         68:   1242 *
                         69:    781 *
                         70:    935 *
                         71:   1138 *
                         72:    861 *
                         73:   1508 **
                         74:    899 *
                         75:   1155 *
                         76:   3964 ****
                         77:   4028 ****
                         78:   1367 **
                         79:    949 *
                         80:    922 *
                         81:    950 *
                         82:    633 *
                         83:    331 *
                         84:    369 *
                         85:    603 *
                         86:   1691 **
                         87:   1102 *
                         88:    281 *
                         89:    529 *
                         90:   1098 *
                         91:   2482 **
                         92:    553 *
                         93:    429 *
                         94:    779 *
                         95:    857 *
                         96:    538 *
                         97:    332 *
                         98:    355 *
                         99:    472 *
                        100:    461 *
                        101:   1073 *
                        102:    826 *
                        103:    950 *
                        104:    464 *
                        105:    617 *
                        106:    949 *
                        107:    543 *
                        108:    375 *
                        109:    265 *
                        110:   1362 **
                        111:   1087 *
                        112:    435 *
        Dittoed blocks on same vdev: 23134

Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
     -      -       -       -       -       -        -  unallocated
     2    32K      8K     24K     12K    4.00     0.00  object directory
     3   384K     12K     36K     12K   32.00     0.00      L1 object array
    51  25.5K   25.5K    612K     12K    1.00     0.00      L0 object array
    54   410K   37.5K    648K     12K   10.92     0.00  object array
     1    16K      4K     12K     12K    4.00     0.00  packed nvlist
     -      -       -       -       -       -        -  packed nvlist size
     -      -       -       -       -       -        -  bpobj
     -      -       -       -       -       -        -  bpobj header
     -      -       -       -       -       -        -  SPA space map header
   420  6.56M   1.64M   4.92M     12K    4.00     0.00      L1 SPA space map
 3.63K   465M    109M    328M   90.4K    4.25     0.00      L0 SPA space map
 4.04K   471M    111M    333M   82.5K    4.24     0.00  SPA space map
     2   132K    132K    132K     66K    1.00     0.00  ZIL intent log
     2   256K      8K     16K      8K   32.00     0.00      L5 DMU dnode
     2   256K      8K     16K      8K   32.00     0.00      L4 DMU dnode
     2   256K      8K     16K      8K   32.00     0.00      L3 DMU dnode
     3   384K     12K     28K   9.33K   32.00     0.00      L2 DMU dnode
     5   640K    100K    292K   58.4K    6.40     0.00      L1 DMU dnode
 1.13K  18.2M   4.54M   13.6M   12.0K    4.00     0.00      L0 DMU dnode
 1.15K  19.9M   4.67M   14.0M   12.2K    4.26     0.00  DMU dnode
     3    12K     12K     28K   9.33K    1.00     0.00  DMU objset
     -      -       -       -       -       -        -  DSL directory
     -      -       -       -       -       -        -  DSL directory child map
     -      -       -       -       -       -        -  DSL dataset snap map
     -      -       -       -       -       -        -  DSL props
     -      -       -       -       -       -        -  DSL dataset
     -      -       -       -       -       -        -  ZFS znode
     -      -       -       -       -       -        -  ZFS V0 ACL
     -      -       -       -       -       -        -  ZFS plain file
     -      -       -       -       -       -        -  ZFS directory
     1    512     512      8K      8K    1.00     0.00  ZFS master node
     -      -       -       -       -       -        -  ZFS delete queue
     1   128K      4K      8K      8K   32.00     0.00      L4 zvol object
     6   768K    300K    600K    100K    2.56     0.00      L3 zvol object
 4.37K   559M    228M    455M    104K    2.45     0.00      L2 zvol object
 4.36M   559G    192G    384G   88.0K    2.91     1.06      L1 zvol object
 4.36G  34.9T   34.9T   34.9T   8.00K    1.00    98.93      L0 zvol object
 4.37G  35.4T   35.1T   35.3T   8.07K    1.01   100.00  zvol object
     -      -       -       -       -       -        -  zvol prop
     -      -       -       -       -       -        -  other uint8[]
     -      -       -       -       -       -        -  other uint64[]
     -      -       -       -       -       -        -  other ZAP
     -      -       -       -       -       -        -  persistent error log
     1   128K      4K     12K     12K   32.00     0.00  SPA history
     -      -       -       -       -       -        -  SPA history offsets
     -      -       -       -       -       -        -  Pool properties
     -      -       -       -       -       -        -  DSL permissions
     -      -       -       -       -       -        -  ZFS ACL
     -      -       -       -       -       -        -  ZFS SYSACL
     -      -       -       -       -       -        -  FUID table
     -      -       -       -       -       -        -  FUID table size
     -      -       -       -       -       -        -  DSL dataset next clones
     -      -       -       -       -       -        -  scan work queue
     -      -       -       -       -       -        -  ZFS user/group/project used
     -      -       -       -       -       -        -  ZFS user/group/project quota
     -      -       -       -       -       -        -  snapshot refcount tags
     -      -       -       -       -       -        -  DDT ZAP algorithm
     -      -       -       -       -       -        -  DDT statistics
     -      -       -       -       -       -        -  System attributes
     -      -       -       -       -       -        -  SA master node
     1  1.50K   1.50K      8K      8K    1.00     0.00  SA attr registration
     2    32K      8K     16K      8K    4.00     0.00  SA attr layouts
     -      -       -       -       -       -        -  scan translations
     -      -       -       -       -       -        -  deduplicated block
     -      -       -       -       -       -        -  DSL deadlist map
     -      -       -       -       -       -        -  DSL deadlist map hdr
     -      -       -       -       -       -        -  DSL dir clones
     -      -       -       -       -       -        -  bpobj subobj
     -      -       -       -       -       -        -  deferred free
     -      -       -       -       -       -        -  dedup ditto
    20   460K     34K    132K   6.60K   13.53     0.00  other
     2   256K      8K     16K      8K   32.00     0.00      L5 Total
     3   384K     12K     24K      8K   32.00     0.00      L4 Total
     8     1M    308K    616K     77K    3.32     0.00      L3 Total
 4.37K   559M    228M    456M    104K    2.46     0.00      L2 Total
 4.36M   559G    192G    384G   88.0K    2.91     1.06      L1 Total
 4.36G  34.9T   34.9T   34.9T   8.00K    1.00    98.94      L0 Total
 4.37G  35.4T   35.1T   35.3T   8.07K    1.01   100.00  Total

Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:     52    26K    26K     52    26K    26K      0      0      0
     1K:      4  5.50K  31.5K      4  5.50K  31.5K      0      0      0
     2K:      1     2K  33.5K      1     2K  33.5K      0      0      0
     4K:  4.89M  19.6G  19.6G      4    16K  49.5K  4.89M  19.6G  19.6G
     8K:  4.36G  34.9T  34.9T  4.36G  34.9T  34.9T  4.36G  34.9T  34.9T
    16K:    375  8.09M  34.9T  1.55K  24.8M  34.9T    708  16.4M  34.9T
    32K:  4.37M   192G  35.1T      1    33K  34.9T    417  17.7M  34.9T
    64K:    873  69.2M  35.1T      0      0  34.9T  4.37M   384G  35.3T
   128K:      1   128K  35.1T  4.37M   560G  35.4T    894   179M  35.3T
   256K:      0      0  35.1T      0      0  35.4T    275  72.0M  35.3T
   512K:      0      0  35.1T      0      0  35.4T      0      0  35.3T
     1M:      0      0  35.1T      0      0  35.4T      0      0  35.3T
     2M:      0      0  35.1T      0      0  35.4T      0      0  35.3T
     4M:      0      0  35.1T      0      0  35.4T      0      0  35.3T
     8M:      0      0  35.1T      0      0  35.4T      0      0  35.3T
    16M:      0      0  35.1T      0      0  35.4T      0      0  35.3T

The pool itself is 6x18TB HDDs in a mirror configuration (ZFS raid 10), which is about ~75% full. The system is a proxmox host with a CCTV server (migrated from a previous build, hence the already high usage), which uses sync writes (not SLOG in use now, but that’s coming). Performance is about what I’d expect from spinning rust, with some vioscsi problems every ~6 days. The hardware itself is fine, and I’ve tried all the VirtIO drivers from .204 up to latest, and have found the latest lasts the longest before an error.

Anyways, I say all that because I’m trying to eliminate other potential issues by optimizing the storage pool (adding SLOG and a special device).

Going by the rule of 0.3% for special device, that would see me need a special device of at least 165.9GB, but that doesn’t necessarily seem right?

So again, I’m hoping someone can help me make sense of all this.

Thank you!

Signed, long time lurker.

Or you just disable sync in ZFS. ZFS just tells the application “alright everything’s on disk” while it is still in memory to speed things up.

“bp” stands for block pointer. The data that keeps all metadata and txg stuff and is created for each block. Size is a maximum of 128 Bytes. . I’m not sure what else contributes to that. But more block pointer = more metadata. And the less records/blocks you got, the less metadata.

special device may result in better performance. But ARC and L2ARC also store metadata. So if caching is fine, special isn’t that useful. Also make sure to include small blocks in your special. Size entirely depends on how many small files you have. I’m fine with 1TB for metadata+small blocks for 6x16TB although pool isn’t at 75%. That percentage alone will reduce performance. Anything >50% is noticeable.

Oh and you have to restore the pool from backup when you add a special vdev. All the metadata and small blocks are on your HDDs and ZFS never moves data by itself. New data gets properly distributed however.

Note that special_small_blocks is not applied to zvol blocks. zvol data will not be stored in special device no matter the block size, although the metadata is still stored in special device.
slog is highly recommended. I run zfs_txg_timeout = 300 (5 minutes). I solely rely on slog to keep data integrity in case system crash.

1 Like

I can get the -f switch to work with a special mirror vdev,

zpool create -f tank /dev/sdb special mirror /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1

but not with a special raidz vdev.

zpool create -f tank /dev/sdb special raidz /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1
invalid vdev specification: unsupported 'special' device: raidz

RAIDZ isn’t supported for special VDEVs

There are some talks to add this feature, but RAIDZ with all its disadvantages usually isn’t worth it when talking about mostly tiny stuff like metadata or small blocks. Write amplification and general waste of space (similar to what you see with ZVOLs on RAIDZ) usually makes it a poor choice. Performance being another disadvantage.

You can see double or triple the space usage just because if you want to write a 4k block (assuming ashift=12 for 4k sectors) across like 5 disks and parity, you got a problem. Only works if you write it to each disk to ensure integrity, thus wasting space, making it a de-facto “5-way mirror with parity”. The wider the RAIDZ, the worse it gets. Just make a 4k blocksize ZVOL on a RAIDZ and check the space usage, it’s horrible.

1 Like