ZFS Metadata Special Device: Z

Traversing all blocks ...
7.04T completed (18468MB/s) estimated time remaining: 0hr 25min 46sec

Traversing all blocks ...
14.0T completed (22875MB/s) estimated time remaining: 0hr 15min 46sec

I am running that zdb command and it is giving me some rather … overzealous throughput numbers. Is this from having cached the majority of the metadata already since i first looked at the filesize histogram?

2 Likes

zdb -Lbbbs The-Tits did not work for me, but find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }' did so I thought I would at least share that for the sake of “oh, neat, data”.

  2k: 1244509
  4k: 855919
  8k: 657447
 16k: 928653
 32k: 1434982
 64k: 1074763
128k: 771807
256k: 689394
512k: 683003
  1M: 795979
  2M: 1052508
  4M: 211914
  8M: 314324
 16M:  40600
 32M:  34653
 64M:  40199
128M:  95098
256M: 103696
512M:  82326
  1G:  44262
  2G:  14805
  4G:  27809
  8G:   1573
 16G:    898
 32G:    186
 64G:     28
128G:      5
256G:      1
5 Likes

So for clarity, should you mirror this? I have a freenas pool that I am about to upgrade to truenas and I have an optane 280gb that I am not using to full potential. Do I risk data loss by only having the one drive?

1 Like

Yes, the metadata special device should minimally be mirrored.

3 Likes

What is the consequence for not mirroring it? Is there a way to rebuild the metadata tables in the event the device dies?

no.

It’d be like stripping off the file allocation table on a regular file system, leaving behind just data blocks.

Imagine a jisgaw puzzle with 10,000 pieces. Now imagine each file larger than one block on disk (most of them) as its own jigsaw puzzle. Then imagine mixing all those jigsaw puzzles together.

Technically, you wouldn’t lose any data with the loss of the metadata device. But effectively you would. Unless you really like infinitely hard jigsaw puzzles were each piece is a square and it isn’t clear if it’s at the right spot in the puzzle until all the other pieces are in place, too.

the metadata is the index of where all the pieces are, in the right order, for each thing stored in the storage pool. There is no “extra” information from where it could be rebuilt or reconstructed. It is itself the special vdev that contains this data. If you want redundancy, it’s up to you to do that when you build the vdev.

5 Likes

Thanks Wendell, that was my fear! I have a pool with two raidz2 vdevs and I have an optane 900p that I am using for L2arc and another Im not using but I feel like I would want the same level of redundancy. I feel like optane would be perfect for this but hard to get multiple cards in. Are the redundancy options for the metadata special device the same as for VDEVs?

Yep, same options. If you can sell off the 900p and get some of the 380gb m.2 optane on one of those splitter/bifrucation cards, that’d be ideal. Ive been looking to do that myself if I can snag a deal on a u.2 optane. But they are rare and $$$$$ :frowning:

2 Likes

Just to be clear on this , to see if i understand .

this would require a backup of the regular volume and also the special volume to ensure you have a proper backup of the dataset.

any unforseen gotchas on restoring this backup ??

2 Likes

I imagine if you back up using native ZFS tools (zfs send, zfs recv) then it will pull it automagically

2 Likes

For my money, L2ARC is the way to go. The main problem with the L2ARC is that it disappears after reboot and needs to warm after after that which will make the first accesses slower (also the frequency dimension is lost). But this will soon be fixed (in ZFS 2.0.0 I think). Then you can control per dataset whether you want to cache the metadata, data or both.

Also, with L2ARC you don’t really need redundancy because it is checksummed so any error will be caught. And the authoritative source of information is somewhere on a redundant VDEV in the pool so it can be retrieved from there. You can use the entire SSD capcity for actual stuff with no worries. I really don’t see how you can get any better than that.

BTW, ZFS has limits on the speed with which it will populate the L2ARC and the default is the ridiculously slow 8MB/s. So if it needs to evict ARC data faster than that it will not end up in the L2ARC thus diminishing its effectiveness. Everyone who uses L2ARC should take a careful look at the relevant parameters to make sure he’s not hurt by decisions made 20 years ago.

3 Likes

You shouldn’t need to do anything special, only ZFS sees the metadata, and manages the allocation it on it’s own. So you can take a pool that has a special device, and send it to a different pool that doesn’t and it just works. It’s also invisible at the filesystem level.

If you were to do something strange, like remove the disks and clone them, then yes you’d need to do the same with the disks that the metadata is on. They are as much a part of the pool as any other disk.

2 Likes

Special vdevs can only do mirrors, not raidz. I think. Looks like raidz is fine, see here and here

Ideally you want triple redundancy. If you have 1 disk, you can detect data corruption, but not correct it. 2 disks can correct corruption. 3 disks lets you lose a disk and still fix things. Note that ZFS by default makes multiple copies of metadata, and I’m not clear how it behaves when there is a special vdev.

My own special vdev is just 4x samsung 860 evo’s 250gb that I had laying around. You can find the things cheap on ebay. Edit: it turns out evos sometimes like to generate a few weird errors, and I’ve seen a few others with this issue. I’ve moved on to old enterprise drives with real PLP (actually has capacitors), which are also cheap enough from eBay.
I’m currently in the middle of rearranging my pool properly, but it’s currently at 26.9TiB, with most datasets set to 4MiB recordsize (vs the standard 128KiB recordsize. Most of my files are small so I wouldn’t get anywhere near 32x amplification of metadata) and my special metadata takes up 10.3GiB

Your optane would better serve as an L2ARC. It’ll be persistant once ZFS 2.0 comes out, and it’s not the ram hog it once was long ago.

2 Likes

Isn’t the L2ARC formed from stuff that gets evicted/ drops out of ARC? But the data is just a hot copy of what is still in the pool proper.
IIRC the special devices move the metadata out of the pool, so Really needs to be redundant else the pool is toast when a special device gets hosed?

1 Like

L2arc will persist across reboots quite soon. If not already doing that in the bleeding edge. Afaik

4 Likes

Bleeding edge… so might take a while to filter through to slower release distro’s?

Might be worth going git-master for. Despite ease of package managed tools

Sorry, I don’t know what is going on with this forum system. I was trying to reply to a reply then to my post and both times it ends at the top level.

1 Like

Yeah it was added in the 2.0 RC1 (along with zstd compression): https://github.com/openzfs/zfs/releases/tag/zfs-2.0.0-rc1

A few days ago 2.0 RC2 dropped: https://github.com/openzfs/zfs/releases/tag/zfs-2.0.0-rc2

2 Likes

Well, both ARC and L2Arc are currently non-persistent*. I’m not sure if they will have ARC stored to disk on shutdown as well?

*on my distro’s distributed zfs version

Sorry, I cannot wrap my head around replying to comments. This thread is now littered with my attempts to reply to you. I have to give up trying lest I get the scorn of everyone here…

1 Like