ZFS Metadata Special Device: Z

wendell · September 23, 2020, 10:05pm

ah, apologies for that. The documentation itself is perhaps not the best.

Because I plan to store a lot more than just metadata on mine, is why it is so much larger.

I don’t have enough experience with this yet to really say for sure. Nothing super bad happens if you undersize or fill your special device – it just stops using it once it gets full.

This writeup seems decent

carnogaunt · September 24, 2020, 5:23am

There is an edge case where you can lose (some) data with a non-redundant SLOG, specifically if the SLOG dies before its contents are read back after an unexpected shutdown. So mirrored SLOG does address a data integrity concern.

I would guess that you also have the issue of being able to detect, but not correct, errors in a non-redundant SLOG when reading it back.

Trooper_ish · September 24, 2020, 8:00am

Ram is also non redundant, so the redundancy only starts when the data is written to the pool

reavessm · September 24, 2020, 9:48pm

Which is why ZFS gurus swear by ECC Ram. I’m not mistaken, ECC is similar to RAID 2? Just bit/byte xor on chip? I may be way off base with that but yeah

Trooper_ish · September 24, 2020, 10:02pm

good point, it does have parity

bugfixin · September 24, 2020, 10:56pm

Memory mirroring is certainly a real thing on most server platforms as well as simple ECC being ubiquitous

Trooper_ish · September 24, 2020, 10:58pm

I’m pretty sure ECC itself is parity enough.

Mirrored memory sounds more for uptime in rare case of hardware failure, rather than error detection. Or high availability/failover?

carnogaunt · September 25, 2020, 1:26am

Synchronous writes are acknowledged when they are written to the SLOG. So from the perspective of applications, data is written to the pool when it is written to the SLOG.

The argument goes that if you’re willing to sacrifice recent writes in exchange for performance, you can achieve that better by setting sync=disabled on the dataset(s). If you instead choose to use a SLOG to improve the performance, that implies that you care about its contents.

alpha754293 · September 25, 2020, 5:44pm

Just a friendly reminder for everybody here though – remember that SSDs are consumable devices – i.e. the NAND flash memory chips/cells/modules (whatever you want to call them) have a finite number of write/erase/program cycles to them.

Please be aware of that.

I have an Intel 750 Series 400 GB PCIe 3.0 x4 NVMe SSD that I am using as a swap drive. It got RMA’d by last year because I wore out the the write endurance limit on it.

After about a year and 3 months now, the drive health is already at 67% percent.

My point is that if you have are writing a lot of small files to the SSD special device, keep in mind that you DO run the risk that you will consume the finite number of write/erase/program cycles early, which means that if that SSD dies, (unless, as it is shown here, you have multiple devices so you have some fault tolerance built into your deployment), you will lose that data and I don’t know what will happen if ZFS no longer finds your special metadata device.

Just keep that in mind please.

(There’s a really good reason why I decommissioned ZFS back in 2007 and I have decommissioned all deployed SSDs I have in my entire network this year.)

Trooper_ish · September 25, 2020, 6:01pm

Um, yeah, all drives die, that is expected behaviour. That is why redundancy is a thing?
And heck yeah log/cache devices can use the P/E cycles on an SSD.
The trade off is faster data in the mean time.
Special devices will not wear As Much or As Quickly as log/cache devices, but simply using nand (even on your phone) is destructive.

But flash drives typically die before the Nand chips wear out.
In fact, the last study I saw of MLC drives had them lasting Faaaar in excess of their warrantied P/E rating, but yeah, all drives die.

redocbew · September 25, 2020, 6:04pm

Insert game of thrones meme here: All drives must die

wendell · September 25, 2020, 7:55pm

I might suggest the surplus Intel 4600 that pop up on ebay from time to time .

Those have an endurance of 4 complete drive overwrites per day. Mainly because e.g. the 4tb models have a raw capacity many tb over that capacity.

They still have 2-3 years of warranty left typically. That Intel 750 was @~0.3dwpd iirc

Log · September 25, 2020, 7:59pm

Indeed, this is why L2ARC is by default limited to filling at a rate of 8MiB/s

Most of us aren’t doing 24/7 enterprise write loads though and our SSD’s tend to outlive the usefulness of their capacity first.

In any case, the correct answer to “what if something bad happens” is always “restore from backups”, regardless of the software used.

Micha · September 26, 2020, 5:54am

'The metadata special device stops allocating more stuff when it is more than 75% full as well, though this can be tweaked. I’d be more comfortable around 10% +/-."

How can i tune this? I haven’t Found this in the manpage. Is it a module parameter?
Best Micha

DavidG · October 2, 2020, 1:17pm

Where would you temporarily move that data to? A new feature in ZFS could be added that can put a vdev in deallocation mode, and off-load and strip everything to all other vdev’s in the background except itself. Now, that would be cool.

DavidG · October 2, 2020, 1:19pm

Can this benefit virtualization in any way? Virtual Hard Disks are large files and there is a low number of them.

Trooper_ish · October 2, 2020, 2:09pm

I guess I’m a bit wasteful, and have spare drives I can move stuff to in the mean time; 4TB drives can be had for pretty good price, and a few of them makes for a nice array.
But I understand not everyone has spare drives.

Also, I don’t have any drives larger than 4tb, so a complete resilver of a drive is reasonably quick
All these new 10TB+ drives are a good price/GB, but I’d rather spend the same money of a few older drives, and split the pain when one dies, rather than have one large one die.

As for de-allocate, there is a device removal, but it;s not something I know much about, apart from it using a bunch of ram/arc for pointers to where old blocks were moved to?

Micha · October 21, 2020, 1:24pm

zfs_special_class_metadata_reserve_pct

parm: zfs_special_class_metadata_reserve_pct:Small file blocks in special vdevs depends on this much free space available (int)

gcs8 · December 23, 2020, 5:12pm

Hey @wendell do you know if there is an option to flush a copy of your metadata down to the pool periodically? I have not had time to RTFM yet.

I am also curious if it is smart enough when you add a metadata device to an already established pool if it will suck the metadata up from the spinning rust or only does new metadata going forwards like changing dedup/record size/ compression on a dataset, would be a real bitch to have to copy files back and forth between datasets for that.

I am also curious what happens if you hit ~75% usage and it goes back to writing metadata to the pool, then you stripe a new mirror into the metadata vdev will it sync the delta or just go back to writing new metadata there.

Log · December 23, 2020, 6:49pm

ZFS deliberately never moves data around that’s already in place.

To get metadata into a special vdev, you would create a new dataset, copy the data over, delete the old dataset, and then rename the new one. This is also how you would change various properties such as recordsize, encryption, and compression.

There is no option to mirror/copy the special vdev data to the other vdevs. In ZFS, redundancy is only at the vdev level. If you lose any vdev, even if it’s special, you lose the whole pool.

ZFS’s highest priority is returning correct data, not disaster protection. It is made with the assumption that you have appropriate redundancy and backups for data you are trying to preserve.