Traversing all blocks ...
7.04T completed (18468MB/s) estimated time remaining: 0hr 25min 46sec
Traversing all blocks ...
14.0T completed (22875MB/s) estimated time remaining: 0hr 15min 46sec
I am running that zdb command and it is giving me some rather … overzealous throughput numbers. Is this from having cached the majority of the metadata already since i first looked at the filesize histogram?
zdb -Lbbbs The-Tits did not work for me, but find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }' did so I thought I would at least share that for the sake of “oh, neat, data”.
So for clarity, should you mirror this? I have a freenas pool that I am about to upgrade to truenas and I have an optane 280gb that I am not using to full potential. Do I risk data loss by only having the one drive?
It’d be like stripping off the file allocation table on a regular file system, leaving behind just data blocks.
Imagine a jisgaw puzzle with 10,000 pieces. Now imagine each file larger than one block on disk (most of them) as its own jigsaw puzzle. Then imagine mixing all those jigsaw puzzles together.
Technically, you wouldn’t lose any data with the loss of the metadata device. But effectively you would. Unless you really like infinitely hard jigsaw puzzles were each piece is a square and it isn’t clear if it’s at the right spot in the puzzle until all the other pieces are in place, too.
the metadata is the index of where all the pieces are, in the right order, for each thing stored in the storage pool. There is no “extra” information from where it could be rebuilt or reconstructed. It is itself the special vdev that contains this data. If you want redundancy, it’s up to you to do that when you build the vdev.
Thanks Wendell, that was my fear! I have a pool with two raidz2 vdevs and I have an optane 900p that I am using for L2arc and another Im not using but I feel like I would want the same level of redundancy. I feel like optane would be perfect for this but hard to get multiple cards in. Are the redundancy options for the metadata special device the same as for VDEVs?
Yep, same options. If you can sell off the 900p and get some of the 380gb m.2 optane on one of those splitter/bifrucation cards, that’d be ideal. Ive been looking to do that myself if I can snag a deal on a u.2 optane. But they are rare and $$$$$
For my money, L2ARC is the way to go. The main problem with the L2ARC is that it disappears after reboot and needs to warm after after that which will make the first accesses slower (also the frequency dimension is lost). But this will soon be fixed (in ZFS 2.0.0 I think). Then you can control per dataset whether you want to cache the metadata, data or both.
Also, with L2ARC you don’t really need redundancy because it is checksummed so any error will be caught. And the authoritative source of information is somewhere on a redundant VDEV in the pool so it can be retrieved from there. You can use the entire SSD capcity for actual stuff with no worries. I really don’t see how you can get any better than that.
BTW, ZFS has limits on the speed with which it will populate the L2ARC and the default is the ridiculously slow 8MB/s. So if it needs to evict ARC data faster than that it will not end up in the L2ARC thus diminishing its effectiveness. Everyone who uses L2ARC should take a careful look at the relevant parameters to make sure he’s not hurt by decisions made 20 years ago.
You shouldn’t need to do anything special, only ZFS sees the metadata, and manages the allocation it on it’s own. So you can take a pool that has a special device, and send it to a different pool that doesn’t and it just works. It’s also invisible at the filesystem level.
If you were to do something strange, like remove the disks and clone them, then yes you’d need to do the same with the disks that the metadata is on. They are as much a part of the pool as any other disk.
Special vdevs can only do mirrors, not raidz. I think. Looks like raidz is fine, see here and here
Ideally you want triple redundancy. If you have 1 disk, you can detect data corruption, but not correct it. 2 disks can correct corruption. 3 disks lets you lose a disk and still fix things. Note that ZFS by default makes multiple copies of metadata, and I’m not clear how it behaves when there is a special vdev.
My own special vdev is just 4x samsung 860 evo’s 250gb that I had laying around. You can find the things cheap on ebay.Edit: it turns out evos sometimes like to generate a few weird errors, and I’ve seen a few others with this issue. I’ve moved on to old enterprise drives with real PLP (actually has capacitors), which are also cheap enough from eBay.
I’m currently in the middle of rearranging my pool properly, but it’s currently at 26.9TiB, with most datasets set to 4MiB recordsize (vs the standard 128KiB recordsize. Most of my files are small so I wouldn’t get anywhere near 32x amplification of metadata) and my special metadata takes up 10.3GiB
Your optane would better serve as an L2ARC. It’ll be persistant once ZFS 2.0 comes out, and it’s not the ram hog it once was long ago.
Isn’t the L2ARC formed from stuff that gets evicted/ drops out of ARC? But the data is just a hot copy of what is still in the pool proper.
IIRC the special devices move the metadata out of the pool, so Really needs to be redundant else the pool is toast when a special device gets hosed?
Sorry, I cannot wrap my head around replying to comments. This thread is now littered with my attempts to reply to you. I have to give up trying lest I get the scorn of everyone here…