Return to

ZFS Metadata Special Device: Z


ZFS Allocation Classes: It isn’t storage tiers or caching, but gosh darn it, you can really REALLY speed up your zfs pool.

From the manual:

Special Allocation Class
The allocations in the special class are dedicated to specific block types. By default this includes all metadata, the indirect blocks of user data, and any deduplication tables. The class can also be provisioned to accept small file blocks.

A pool must always have at least one normal (non-dedup/special) vdev before other devices can be assigned to the special class. If the special class becomes full, then allocations intended for it will spill back into the normal class.

Inclusion of small file blocks in the special class is opt-in. Each dataset can control the size of small file blocks allowed in the special class by setting the special_small_blocks dataset property. It defaults to zero, so you must opt-in by setting it to a non-zero value.

ZFS dataset property special_small_blocks=size - This value represents the threshold block size for including small file blocks into the special allocation class. Blocks smaller than or equal to this value will be assigned to the special allocation class while greater blocks will be assigned to the regular class.* Valid values are zero or a power of two from 512B up to 128K. The default size is 0 which means no small file blocks will be allocated in the special class. Before setting this property, a special class vdev must be added to the pool.*

VDEV type special - A device dedicated solely for allocating various kinds of internal metadata, and optionally small file blocks. The redundancy of this device should match the redundancy of the other normal devices in the pool. If more than one special device is specified, then allocations are load-balanced between those devices.

The ZFS Special Device can store meta data (where files are, allocation tables, etc) AND, optionally, small files up to a user-defined size on this special vdev. This would be a great use-case for solid-state storage.

This will dramatically speed up random I/O and cut, by half, the average number of spinning-rust I/Os needed to find and access a file. Operations like ls -lR will be insanely way faster.

The ZFS Special vdev should still have a good level of redundancy (like all Vdevs, if it fails, your pool fails).

In my mind there are two questions:

How much space do you need for Metadata storage?

In the leftover space, how many files below size X can also be stored on the special vdev (for speeeeed).

How much space do you need for Metadata storage?

The rule of thumb is 0.3% of the pool size for a typical workload. It can vary!

I was looking at the existing Level1Techs ZFS dataset for production workloads:

# zdb -Lbbbs elbereth

Traversing all blocks ...

 348M completed (  39MB/s) estimated time remaining: 432hr 31min 50sec

The ASIZE column should help you. Our dataset was only 1.9%, which makes sense since most of our space is consumed by Very Large Video Files.

How much space do you need for Metadata storage?

What we need is a histogram of file sizes, and how many files are in that size range.

Here’s some CLI magic to give you that

 find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'

And the Output:

  1k: 270448
  2k:  72548
  4k:  74510
  8k:  62799
 16k:  36386
 32k:  74070
 64k:  21528
128k:  30624
256k:  25892
512k:  83264
  1M: 140595
  2M:  23716
  4M:  14846
  8M:   4843
 16M:   3735
 32M:   2996
 64M:   4876
128M:   6730
256M:   5312
512M:   3726
  1G:   2235
  2G:   1819
  4G:    908
  8G:    700
 16G:    277
 32G:     42
 64G:     10
128G:      1

The metadata special device stops allocating more stuff when it is more than 75% full as well, though this can be tweaked. I’d be more comfortable around 10% +/-.

Looks like if I’ve got 8tb of space on the special device (8tb of flash in other words) I’d be safe storing small files non the flash, no problem.

Technically I’d want ~5tb of space just for metadata with a full pool (~172tb space is ~5tb for metadata).

These calculations seem a bit like voodoo to me, so it is something I’ll be keeping an eye on as pool usage changes.

The “files stored on special vdev” is set on a per-dataset basis.

Adding the special vdev to an existing zpool

#implied stripe of mirrors, for this special device 
zpool add elbereth special mirror /dev/nvme0n1 /dev/nvme1n1 mirror /dev/nvme2n1 /dev/nvme3n1

Setting the Small Blocks size:

# zfs set special_small_blocks=128K elbereth 

Or if you have a “videos” dataset like ours:

# zfs set special_small_blocks=128K elbereth/videos 

Also note my ZFS record size is 512kb. The default is 128kb and the default limit of small blocks is 128kb, so you want to set this value smaller than the overall zfs record size because files that are larger are broken into more than one record. This may be non-intuitive – so let me restate another way:

If your zfs recordsize is 128kb, then files larger than that are broken into chunks that big. The magic of special_small_blocks, ideally, is only on files that are smaller than your zfs record size. If you set it to the same size as your zfs record size, then everything will go to the special device (not an ideal situation).

… and now, with that, enjoy blazing fast video editing and directory load/list commands.


I think you have a typo in those last commands, 128M should be 128K.

For inexperienced users, it should be noted that the default ZFS recordsize (the maximum blocksize for a dataset) is 128K, and if you set the special vdev to take 128K, then every block will go to the special vdev. Which is fine if you set it on an intended dataset, allowing you to accelerate it without the additional management of having a separate SSD pool, but it’s not what you want at the whole pool level.

Supposedly, special vdevs can be removed from a pool, IF the special and regular vdevs use the same ASHIFT, and if everything is mirrors. So it won’t work if you have raidz vdevs or mixed Ashift.

Also, if the special vdev dies, the whole pool dies, just like a normal vdevs. I’d really recommend a triple mirror at least (A single disk in a mirror can detect, but not correct errors). There are also power loss safety concerns with SSD’s that use a cache, so have a UPS to help minimize that risk. And backups, as always.

You can also view how full the various vdevs are with this command:

zpool list -v yourpool


Awk for the win!

There is a good section on SSDs and power loss protection in the OpenZFS wiki Hardware page:


This post was flagged by the community and is temporarily hidden.