ZFS Metadata Special Device: Z

wendell · July 25, 2020, 7:44pm

Introduction

ZFS Allocation Classes: It isn’t storage tiers or caching, but gosh darn it, you can really REALLY speed up your zfs pool.

From the manual:

Special Allocation Class
The allocations in the special class are dedicated to specific block types. By default this includes all metadata, the indirect blocks of user data, and any deduplication tables. The class can also be provisioned to accept small file blocks.

A pool must always have at least one normal (non-dedup/special) vdev before other devices can be assigned to the special class. If the special class becomes full, then allocations intended for it will spill back into the normal class.

Inclusion of small file blocks in the special class is opt-in. Each dataset can control the size of small file blocks allowed in the special class by setting the special_small_blocks dataset property. It defaults to zero, so you must opt-in by setting it to a non-zero value.

ZFS dataset property special_small_blocks=size - This value represents the threshold block size for including small file blocks into the special allocation class. Blocks smaller than or equal to this value will be assigned to the special allocation class while greater blocks will be assigned to the regular class. Valid values are zero or a power of two from 512B to 1M normally, and up to 16M when zfs_max_recordsize is set accordingly ^pull#9355. The default size is 0 which means no small file blocks will be allocated in the special class. Before setting this property, a special class vdev must be added to the pool.

VDEV type special - A device dedicated solely for allocating various kinds of internal metadata, and optionally small file blocks. The redundancy of this device should match the redundancy of the other normal devices in the pool. If more than one special device is specified, then allocations are load-balanced between those devices.

The ZFS Special Device can store meta data (where files are, allocation tables, etc) AND, optionally, small files up to a user-defined size on this special vdev. This would be a great use-case for solid-state storage.

This will dramatically speed up random I/O and cut, by half, the average number of spinning-rust I/Os needed to find and access a file. Operations like ls -lR will be insanely way faster.

The ZFS Special vdev should still have a good level of redundancy (like all Vdevs, if it fails, your pool fails).

In my mind there are two questions:

How much space do you need for Metadata storage?

In the leftover space, how many files below size X can also be stored on the special vdev (for speeeeed).

How much space do you need for Metadata storage?

The rule of thumb is 0.3% of the pool size for a typical workload. It can vary!

I was looking at the existing Level1Techs ZFS dataset for production workloads:

# zdb -Lbbbs elbereth

Traversing all blocks ...

 348M completed (  39MB/s) estimated time remaining: 432hr 31min 50sec

The ASIZE column should help you. Our dataset was only 1.9%, which makes sense since most of our space is consumed by Very Large Video Files.

How much space do you need for Metadata storage?

What we need is a histogram of file sizes, and how many files are in that size range.

Here’s some CLI magic to give you that

 find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'

And the Output:

  1k: 270448
  2k:  72548
  4k:  74510
  8k:  62799
 16k:  36386
 32k:  74070
 64k:  21528
128k:  30624
256k:  25892
512k:  83264
  1M: 140595
  2M:  23716
  4M:  14846
  8M:   4843
 16M:   3735
 32M:   2996
 64M:   4876
128M:   6730
256M:   5312
512M:   3726
  1G:   2235
  2G:   1819
  4G:    908
  8G:    700
 16G:    277
 32G:     42
 64G:     10
128G:      1

The metadata special device stops allocating more stuff when it is more than 75% full as well, though this can be tweaked. I’d be more comfortable around 10% +/-.

Looks like if I’ve got 8tb of space on the special device (8tb of flash in other words) I’d be safe storing small files non the flash, no problem.

Technically I’d want ~5tb of space just for metadata with a full pool (~172tb space is ~5tb for metadata).

These calculations seem a bit like voodoo to me, so it is something I’ll be keeping an eye on as pool usage changes.

The “files stored on special vdev” is set on a per-dataset basis.

Adding the special vdev to an existing zpool

#implied stripe of mirrors, for this special device 
zpool add elbereth -o ashift=12 special mirror /dev/nvme0n1 /dev/nvme1n1 mirror /dev/nvme2n1 /dev/nvme3n1

Setting the Small Blocks size:

# zfs set special_small_blocks=128K elbereth

Or if you have a “videos” dataset like ours:

# zfs set special_small_blocks=128K elbereth/videos

Also note my ZFS record size is 512kb. The default is 128kb and the default limit of small blocks is 128kb, so you want to set this value smaller than the overall zfs record size because files that are larger are broken into more than one record. This may be non-intuitive – so let me restate another way:

If your zfs recordsize is 128kb, then files larger than that are broken into chunks that big. The magic of special_small_blocks, ideally, is only on files that are smaller than your zfs record size. If you set it to the same size as your zfs record size, then everything will go to the special device (not an ideal situation).

… and now, with that, enjoy blazing fast video editing and directory load/list commands.

Log · July 25, 2020, 9:24pm

~~I think you have a typo in those last commands, 128M should be 128K.~~

For inexperienced users, it should be noted that the default ZFS recordsize (the maximum blocksize for a dataset) is 128K, and if you set the special vdev to take 128K, then every block will go to the special vdev. Which is fine if you set it on an intended dataset, allowing you to accelerate it without the additional management of having a separate SSD pool, but it’s not what you want at the whole pool level.

Supposedly, special vdevs can be removed from a pool, IF the special and regular vdevs use the same ASHIFT, and if everything is mirrors. So it won’t work if you have raidz vdevs or mixed Ashift.

Also, if the special vdev dies, the whole pool dies, just like a normal vdevs. I’d really recommend a triple mirror at least (A single disk in a mirror can detect, but not correct errors). There are also power loss safety concerns with SSD’s that use a cache, so have a UPS to help minimize that risk. And backups, as always.

You can also view how full the various vdevs are with this command:

zpool list -v yourpool

freqlabs · July 26, 2020, 4:48am

Awk for the win!

There is a good section on SSDs and power loss protection in the OpenZFS wiki Hardware page:
https://openzfs.org/wiki/Hardware#NAND_Flash_SSDs

glop102 · September 10, 2020, 6:54pm

Traversing all blocks ...
7.04T completed (18468MB/s) estimated time remaining: 0hr 25min 46sec

Traversing all blocks ...
14.0T completed (22875MB/s) estimated time remaining: 0hr 15min 46sec

I am running that zdb command and it is giving me some rather … overzealous throughput numbers. Is this from having cached the majority of the metadata already since i first looked at the filesize histogram?

gcs8 · September 11, 2020, 9:00pm

zdb -Lbbbs The-Tits did not work for me, but find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }' did so I thought I would at least share that for the sake of “oh, neat, data”.

  2k: 1244509
  4k: 855919
  8k: 657447
 16k: 928653
 32k: 1434982
 64k: 1074763
128k: 771807
256k: 689394
512k: 683003
  1M: 795979
  2M: 1052508
  4M: 211914
  8M: 314324
 16M:  40600
 32M:  34653
 64M:  40199
128M:  95098
256M: 103696
512M:  82326
  1G:  44262
  2G:  14805
  4G:  27809
  8G:   1573
 16G:    898
 32G:    186
 64G:     28
128G:      5
256G:      1

Gregory_Fricker · September 20, 2020, 10:38pm

So for clarity, should you mirror this? I have a freenas pool that I am about to upgrade to truenas and I have an optane 280gb that I am not using to full potential. Do I risk data loss by only having the one drive?

wendell · September 20, 2020, 10:49pm

Yes, the metadata special device should minimally be mirrored.

Gregory_Fricker · September 20, 2020, 10:51pm

What is the consequence for not mirroring it? Is there a way to rebuild the metadata tables in the event the device dies?

wendell · September 20, 2020, 11:00pm

no.

It’d be like stripping off the file allocation table on a regular file system, leaving behind just data blocks.

Imagine a jisgaw puzzle with 10,000 pieces. Now imagine each file larger than one block on disk (most of them) as its own jigsaw puzzle. Then imagine mixing all those jigsaw puzzles together.

Technically, you wouldn’t lose any data with the loss of the metadata device. But effectively you would. Unless you really like infinitely hard jigsaw puzzles were each piece is a square and it isn’t clear if it’s at the right spot in the puzzle until all the other pieces are in place, too.

the metadata is the index of where all the pieces are, in the right order, for each thing stored in the storage pool. There is no “extra” information from where it could be rebuilt or reconstructed. It is itself the special vdev that contains this data. If you want redundancy, it’s up to you to do that when you build the vdev.

Gregory_Fricker · September 20, 2020, 11:05pm

Thanks Wendell, that was my fear! I have a pool with two raidz2 vdevs and I have an optane 900p that I am using for L2arc and another Im not using but I feel like I would want the same level of redundancy. I feel like optane would be perfect for this but hard to get multiple cards in. Are the redundancy options for the metadata special device the same as for VDEVs?

wendell · September 20, 2020, 11:07pm

Yep, same options. If you can sell off the 900p and get some of the 380gb m.2 optane on one of those splitter/bifrucation cards, that’d be ideal. Ive been looking to do that myself if I can snag a deal on a u.2 optane. But they are rare and $$$$$

chaos4u · September 20, 2020, 11:48pm

Just to be clear on this , to see if i understand .

this would require a backup of the regular volume and also the special volume to ensure you have a proper backup of the dataset.

any unforseen gotchas on restoring this backup ??

reavessm · September 21, 2020, 1:50am

I imagine if you back up using native ZFS tools (zfs send, zfs recv) then it will pull it automagically

Nikolay_Mihaylov · September 21, 2020, 6:48am

For my money, L2ARC is the way to go. The main problem with the L2ARC is that it disappears after reboot and needs to warm after after that which will make the first accesses slower (also the frequency dimension is lost). But this will soon be fixed (in ZFS 2.0.0 I think). Then you can control per dataset whether you want to cache the metadata, data or both.

Also, with L2ARC you don’t really need redundancy because it is checksummed so any error will be caught. And the authoritative source of information is somewhere on a redundant VDEV in the pool so it can be retrieved from there. You can use the entire SSD capcity for actual stuff with no worries. I really don’t see how you can get any better than that.

BTW, ZFS has limits on the speed with which it will populate the L2ARC and the default is the ridiculously slow 8MB/s. So if it needs to evict ARC data faster than that it will not end up in the L2ARC thus diminishing its effectiveness. Everyone who uses L2ARC should take a careful look at the relevant parameters to make sure he’s not hurt by decisions made 20 years ago.

Log · September 21, 2020, 7:17am

You shouldn’t need to do anything special, only ZFS sees the metadata, and manages the allocation it on it’s own. So you can take a pool that has a special device, and send it to a different pool that doesn’t and it just works. It’s also invisible at the filesystem level.

If you were to do something strange, like remove the disks and clone them, then yes you’d need to do the same with the disks that the metadata is on. They are as much a part of the pool as any other disk.

Log · September 21, 2020, 7:37am

~~Special vdevs can only do mirrors, not raidz. I think.~~ Looks like raidz is fine, see here and here

Ideally you want triple redundancy. If you have 1 disk, you can detect data corruption, but not correct it. 2 disks can correct corruption. 3 disks lets you lose a disk and still fix things. Note that ZFS by default makes multiple copies of metadata, and I’m not clear how it behaves when there is a special vdev.

My own special vdev is just 4x ~~samsung 860 evo’s 250gb that I had laying around. You can find the things cheap on ebay.~~ Edit: it turns out evos sometimes like to generate a few weird errors, and I’ve seen a few others with this issue. I’ve moved on to old enterprise drives with real PLP (actually has capacitors), which are also cheap enough from eBay.
I’m currently in the middle of rearranging my pool properly, but it’s currently at 26.9TiB, with most datasets set to 4MiB recordsize (vs the standard 128KiB recordsize. Most of my files are small so I wouldn’t get anywhere near 32x amplification of metadata) and my special metadata takes up 10.3GiB

Your optane would better serve as an L2ARC. It’ll be persistant once ZFS 2.0 comes out, and it’s not the ram hog it once was long ago.

Trooper_ish · September 21, 2020, 8:27am

Isn’t the L2ARC formed from stuff that gets evicted/ drops out of ARC? But the data is just a hot copy of what is still in the pool proper.
IIRC the special devices move the metadata out of the pool, so Really needs to be redundant else the pool is toast when a special device gets hosed?

wendell · September 21, 2020, 2:03pm

L2arc will persist across reboots quite soon. If not already doing that in the bleeding edge. Afaik

Trooper_ish · September 21, 2020, 2:31pm

Bleeding edge… so might take a while to filter through to slower release distro’s?

Might be worth going git-master for. Despite ease of package managed tools

Nikolay_Mihaylov · September 21, 2020, 2:58pm

Sorry, I don’t know what is going on with this forum system. I was trying to reply to a reply then to my post and both times it ends at the top level.