ZFS Metadata Special Device: Z

Log · September 21, 2020, 3:00pm

Yeah it was added in the 2.0 RC1 (along with zstd compression): https://github.com/openzfs/zfs/releases/tag/zfs-2.0.0-rc1

A few days ago 2.0 RC2 dropped: https://github.com/openzfs/zfs/releases/tag/zfs-2.0.0-rc2

Trooper_ish · September 21, 2020, 3:00pm

Well, both ARC and L2Arc are currently non-persistent*. I’m not sure if they will have ARC stored to disk on shutdown as well?

*on my distro’s distributed zfs version

Nikolay_Mihaylov · September 21, 2020, 3:06pm

Sorry, I cannot wrap my head around replying to comments. This thread is now littered with my attempts to reply to you. I have to give up trying lest I get the scorn of everyone here…

wendell · September 21, 2020, 3:21pm

just use the edit function

Log · September 21, 2020, 3:54pm

ARC is not saved by default, though apparently you can adjust some settings and have ARC stored in L2ARC as well.

Trooper_ish · September 21, 2020, 4:13pm

Nice. I presumed it would be silly to retain/rebuild L2 if L1 has been blown away with a reboot.
I guess that would also go with it if you export one pool?
ARC is shared at the moment, right?
But L2 is specific to a pool?

Wait, I am yet again derailing another thread. My bad

Log · September 21, 2020, 4:20pm

No these are good questions. I’m glad you brought them up. I have no idea, but I’ll post back if I get an answer.

For the first part, I imagine that in many cases 8-16gb of ARC will fill up quickly enough In real use that it’s not a huge loss, as the L2ARC is generally going to be much larger. And if you’ve got 512gb of ARC, then you’re L2ARC is likely going to be set to be mostly used as an ARC storage rather than for increased capacity.

Trooper_ish · September 21, 2020, 4:27pm

Don’t mind me, rambling again

Especially if you had a “fast” SSD pool, which wouldn’t really need an L2ARC, and a “slow” HDD pool with an L2ARC.
Maybe the ARC in main memory is shared between the two, and the L2 pointers from slow pool would come out of it’s share of the ARC.
Unless ZFS doesn’t care about the source, and just populated the whole ARC with data that is read (if it’s not on an eviction list) and only the “slow” pools stuff is relegated to L2 when evicted.

A couple of years ago when they were still on the Illuminati base talking persistence, I was dreaming of a wide raidz3 with a few flash drives for L2, but gave up and just made fast+slow.
Maybe I can dream again? But I still shutdown every night, so not going to loose sleep over it

Log · September 21, 2020, 4:38pm

For pools there’s at least an On/off/metadata only option when it comes to ARC and L2ARC, and possibly applies to datasets as well.

I’ve not personally messed with it though.

mikensan · September 21, 2020, 4:55pm

For my freenas brothers…
zdb -Lbbbs -U /data/zfs/zpool.cache Poolname/Dataset

Or just throw your poolname at it.

zdb assumes the cache is in /boot… whereas freenas it’s in /data.

mikensan · September 21, 2020, 5:12pm

So I’m a little fuzzy on how to determine the size of the vdev needed for a specific dataset. When I run zdb against my movie dataset - I don’t get quite the details as I would against the pool it resides in. So the below is a snippet of the output I assume to be relevant. Currently around 7.24TB used of 32.6TB (net). Looks like a majority of my data is store in 128kb blocks. I have some ESXI servers using it for storage, hence the handful of 1M blocks.

So I assume 0.3% was written correctly, that would be .003 * 7.24, which comes out to around 4GB… Or if I was proactive and did the whole 32TB that would be 96gb?

charles7 · September 22, 2020, 2:08pm

Obligatory warning of incoming ZFS novice:

Would using a metadata special device be recommended for someone who does not have L2ARC, ZFS Intent Log (I believe usually called ZIL?) or a large amount of ARC RAM?

I.e. Which is a more worthwhile addition for someone without any of the above items?

I presume the answer would be “it depends” without any context, I have provided relevant context below while trying to stay brief.

Context:
FreeNAS 2x4TB mirror, 8GB RAM, 1Gbe networking. Only service running is Nextcloud.
Not used for VM storage or databases. Plain old file server that I store my crap on.

I’ve really liked what I’ve found with ZFS compared to Unraid. ZFS is very good at just storing data. Backups are so easy with ZFS send. My discovery & usage of ZFS is mostly due to Wendell’s great YT videos. Thanks Wendell!

mikensan · September 22, 2020, 4:42pm

The first question you have to ask yourself - is there anything to fix? Do you find transfer speeds slow, loading directories (the time it takes to show every file), or general latency? If there’s no problem to fix, then don’t complicate it.

For me - sync writes were killing my performance so I added a SSD for ZIL (SLC). Now, I want to address how long folders with a high number of contents takes to load (~2TB of Music for instance) and speedup the general “seek” time to find a file. So I want to add a pair of SSDs for the metadata.

reavessm · September 22, 2020, 8:40pm

It seems like the special device is largely a win-win for all uses cases, but if your workload is mostly or entirely cached in the ARC then it’s possible that you would be bottlenecked by your network before your disks. So yeah, it depends, but if you can pinpoint what part of the system is the slowest, then you can move on from there

charles7 · September 22, 2020, 9:49pm

Mike you are far more realistic than I am. If my use case were for business then I would completely agree with you. I would also save a lot of money, but this stuff is enjoyable.

My use case is personal/hobby. Currently the system performs fine, I wouldn’t say it’s problematic. However… if I could improve it while learning about some new things along the way, that is plenty enough reason for me!

The necessity for a redundant special device likely will be a showstopper for me at this point. My current system is not easily expandable to include another 2x 2.5 or 3.5 drives. However I do have a free M.2 slot, so I now plan to learn more about the L2ARC and ZIL.

That is a great point that I had failed to consider. I likely am very close to network bottlenecks. The system has 2x 1GbE onboard so maybe some research on SMB Multichannel or link aggregation is my next step. Thanks!

Trooper_ish · September 22, 2020, 10:23pm

I have 2cents to consider?

I would recommend* against a special device if it is not redundant. In your case, I would suggest using the m.2 socket for a L2ARC (maybe with a partition for a SLOG, up to you)

_{*if you want a stable pool, long term, with the benefits of stable uptime ZFS affords with raid. see ** a the end}

The special device is a vdev. If you loose any vdev entirely, the whole pool stops.

All disks die, even flash, even NVMe flash.

Mirrored SSD’s are reccommened, because When one of the mirrored devices fails, a replacement can quickly copy the remaining good drive and the pool can carry on. (and they are faster. which is kinda the point)

This is different to SLOG / L2ARC.
L2ARC is just a faster copy of data still on the main pool, but a little slower than the ARC in main memory. L2ARC can cease / die / be removed with no lasting effects, so a single drive is fine, but mirrored is available so the L2ARC continues when one drive dies.

A SLOG is just an added drive to cache writes. If the re is no SLOG, then the pool uses a ZIL, which is part of the pool (but not really used when there is a SLOG)
When a SLOG dies, any incomplete transactions are disregarded, and the pool carries on as it was, reverting back to the ZIL. SLOG’s are often just a single drive, but just as often seem to be mirrored, again, so the system continues unphased.

** You can of course use a single drive for ZFS, or for a part of the pool, but the system will only be as stable as the weakest link. In a laptop for example, there might only be a single 2.5" socket, and an M.2 socket. Then one could use the 2.5" for a large capacity HDD, and the M.2 for a special vdev, but even then, I would probably not go for it myself.

charles7 · September 23, 2020, 1:18am

I always welcome the 2 cents!

I am in complete agreement with you. As mentioned previously, I think the need for redundancy is a show-stopper.

I moved most of my data from Unraid into a FreeNAS system because of all the fantastic ZFS benefits. Unraid is a fantastic jack of all trades OS. However a jack of all trades is often a master of none. It would be silly to jeopardize all the remarkable benefits of ZFS by introducing a non-redundant vdev into my ZFS pool.

Trooper_ish · September 23, 2020, 1:20am

The funny thing is, ZFS will allow you to do all sorts of things, it’s only Later you might realise that a small SLOG device was accidentally added as a whole vdev, and now you have teo tear down a 40TB system to remove a 200GB drive

mikensan · September 23, 2020, 2:00pm

Oddly enough this is for my homelab as well lol. It started small and just kept growing. If the pool has data you can’t recover, I wouldn’t add this special vdev without a redundancy. I’m all for learning too, so I would never say no for that reason - just wanted to make sure you weren’t digging into a hole otherwise. Just used to working with bosses who see a new feature and have to have it - despite any real gains lol.

As far as your network goes - this isn’t really meant to increase transfer speeds entirely. It’s only storing the index on faster storage - so think of it as improving latency - time it takes to find the files. So short answer - your network is not a bottleneck.

bugfixin · September 23, 2020, 8:16pm

I’m seeing a fair bit of inconsistency in the first post here - particularly around the storage calculations. You mention the rule of thumb is 0.3% of the pool size for metadata size, then say your dataset is “only” 1.9% - isn’t this significantly worse than rule of thumb? Maybe it would make more sense if the zdb command was completed and listed the columns as finished?
I think there’s also an unintentional duplication of headers - the first calculation has a bold header of “How much space do you need for metadata storage?”, then the second calculation has the exact same header - perhaps that was intended to be How many files can also be stored on the special vdev?

Then you go on to mention you’d want 5TB of metadata space for a full pool, saying 172tb of space is the full pool, but 5TB is 3% – not 0.3% – of 172TB.

Long story short, I’m still pretty confused in terms of how large the metadata special vdevs need to be or how to calculate it with the commands given.