ZFS Metadata Special Device: Z

ah thank you.

1k: 270733
2k: 1665
4k: 2116
8k: 9244
16k: 15392
32k: 29942
64k: 86380
128k: 374239
256k: 343185
512k: 1084520
1M: 1189533
2M: 415210
1G: 3
4G: 3
8G: 11
16G: 8
32G: 19

1 Like

Thanks for the reply!

Yes I am aware of the ARC providing primary caching. The reason I am so interested in the SvDev is because I am wanting the meta-data on persistent (faster than 84 spinning Rust). Indexing that much storage can be a significant performance crawl. I was fortunate enough to jump on the optane opportunity early and grab a significant amount of it for pennies. I want to leverage the low latency of the Optane in a number of storage tiers, which contain a variety of RAM quantities (1.5TB-RAM → 256GB-Ram).

Under normal workloads, this RAM allocation(s) should do most of the heavy lifting. The problem I run into is that I am dealing in a lot of relatively massive files (video). That can chew up the ARC rather quickly, and so I anticipate the ARC being flushed frequently, and thus the meta-data fetching will also foreseeably flush.

I am tiering my storage, both in quantity and in speed based on the Hot->Cold Storage philosophy. However, that frequent flushing as a result of this tiering, will foreseeably put undo stress on the system(s), VS a proper and redundant SvDev deployment.

At least that is what I am trying to avoid (system stress and slow-down).

I get what you are saying about the L2ARC. But my understanding is that the only 2 dials you can turn on that relate to MFU and MRU. Which doesn’t solve my frequent flush perceived issue.

I managed to get the TrueNAS guys to go into more detail on their T3 Podcast, and got all of the info I needed out of their answers. Specifically related to a rule of thumb for planning and initial deployment numbers for meta-data specifically and exclusively.

I can plan and allocate based on that, then fine tune with some wiggle room and potential scaleability in mind.

Thanks for your reply though. I get what you are saying.

l2arc_mfuonly = 2 was targeted directly at your use-case. I can attest that an L2ARC configured thusly populates slowly with minimal churn/turnover. There are scripts to “walk” the directory structure via cron job if you want to “heat up” the metadata and encourage its population into L2ARC.

1 Like

Same here. Tunables make the world go round. Another approach is to set dataset properties for media datasets to exclude them from cache.
Even HDD sequential reads are usually plenty and user data doesn’t necessarily need to be cached. If this is the case and the fact that already compressed video files are the least efficient stuff you can put into a cache, you can so sth like this:

 zfs set primarycache=metadata secondarycache=metadata

Do NOT do this if the dataset contains valuable non-media data. I know a lot of people like to use only a single dataset for a pool “to rule them all”.

I see the point if you reboot a lot and want to warm up ARC quickly and not wait for warmup/training of your read cache. L2 will fill organically over time and with persistent L2ARC, you don’t lose it on reboot anyway.
I personally just let it run and let it fill up by itself…I reboot everything about once every two weeks for updates and don’t really worry about it.

Great advice on your side.

Awesome reply and thank you!

However, unless I am mistaken (which is common) your advice still does not solve the issue for my workload. In fact if I am understanding it correctly, it is for the opposite workflow?

I suppose my question is ultimately “Is zfs primary cache” not RAM? Please correct my ignorance kindly.

My purpose is to place metadata specifically (small files be damned) on a persistent high speed low latency media.

I get that under normal workloads, ARC and L2ARC, and the fine tuneables would solve this issues in a number of combinations. In fact with a stupidly proficient budget, I could solve this with a giant SvDEV, triple mirror and RAID10 the problem away…

I don’t have a stupid budget, and trying to fine-tune what I do have to best tune performance.

So to refresh the needs and the issues as I see them:

  1. Chucking in 1.5TB of RAM, is going to get flushed in a couple of video clips that has not only been requested recently (MRU), but also requested frequently (MFU), by the very nature of video editing.

  2. L2ARC will of course lend a hand when that ARC is saturated. However, metadata is not stored there. L2ARC is an overflow for when the ARC is flushed and requires help. Please correct me if I am wrong.

  3. Not sure (if primary cache is ARC) why I would limit my metadata to volatile memory. This defeats the entire purpose (If I understand that correctly).

I realize my workload is a unicorn among black sheep. But does it not bring forth a higher level ZFS step into a workflow that could ultimately turbocharge performance?

Separate the meta-data from the small-file allocations entirely.

Why?

Because as soon as you commit to the SvDev there is no going back. The risk involved in the SvDev is IMMENSE because of its pool integrity importance. The cost is also IMMENSE on a large pool due to its nature.

I could simply store small files on a faster pool/vdev and tier it all. My plan as it were. But I still need to index them, never mind Filesystem manage them… They are no more or less Frequently or recently accessed, and now I STILL need to treat it as critical to THAT pool/vdev. I am no further ahead. I have only increased my complexity.

I get (and appreciate!) your suggestions. I am onboard with the solution being an L2ARC direction. I suppose I am proposing an SvDev and L2ARC hybrid as it pertains specifically to small file size persistent allocation.

Coupling the meta-data with the small file-size allocation is not really a bonus. It’s a money pit.

By the very nature of the difference of meta-data and small file size, the endurance, redundancy and physical allocation requirements make it an absolute nightmare. particularly on large pools with large sequential files, that are frequently accompanied by small file-based meta-files.

MFU and MRU work in normal workflows. If we could somehow detach the SvDev small files to the L2ARC instead of the SvDev, in a persistent caching methodology… The L2ARC would become an absolute POWERHOUSE, vs it’s current “also ran” designation.

Or maybe it can be done now and I just haven’t found it (yet).

I truly appreciate your help/replies.

This is not good advice memory only saves your disk from reading metadata, not writing new metadata, for each block written you have to also write the metadata for that block, in addition to the parity data. Point is, if I assume a block size of 1M and i write a 1M file to the pool that is really 3 write operations.

  1. Writing the 1M block
  2. Writing the parirty data in a separate block,
  3. Writing the metadata in a speperate metadata block.

Thats 3 operations to write one file. Having a special vdev effectively is offloading 1/3 of all iops onto the special vdev. When working with iops constrained hdd’s this is HUGE. It makes pools perform way faster in iops intensive workloads.

But wait theirs more! Zfs doesn’t just write the meta data once to a vdev, it writes it multiple times such that its very unlikely to be lost. So really its quite noticeable and mroe than the 1/3.

That is the point of the arc having both a MFU and MRU, in my exprience zfs >= 2.0 does a graat job balncing MFU(most freqently used) and MRU(most recently used) i would not do:

the only time i would consider this is when you know your pool cannot read fast enough for your use case and you need the video files in either arc or l2arc. OR you have a very ram constrained system, in which case dedicaiting arc to metadata is more benifical than sharing it with non metadata(Think minipc with 8gb of ram, connected to a 20tb pool 3x10tb hdd pool via usb disk enclosure.)

Zfs will not flush the MFU in favor of the MRU especial while the data(your metadata) in the MFU is hot. My personal experience is commonly used metadata is pretty sticky in arc. You should not be worried about this.

This is not correct the meta data for the video clips is tiny in comparison to the video itself, non video meta data scales relative to the non video file size in comparison to the video file size if your opening files with any frequency the 1gb of meta data for possibly 1000’s of files will have no problem sticking int the mfu over the pressure from the mru and video files metadata wich is kbs or less in size.

Have you used the arc_summary tool to watch how zfs is handling your arc, and its behavior? if not i highly recommend you do so.

What is your arc’s hit and miss rate?

What is the percentages and size of the mfu and mru as it stands?

Whats the hit and miss rate of the mfu and mru?(my mfu usually has a 99% hit rate and my mfu usually is mostly metadata which explains why its always getting hit)

Generally speaking you will not see much benefit from a l2arc if your miss rate is less than 10%. With that said you can most definitely see a benefit from a special vdev, while your arc has less than a 1% miss rate.

1 Like

Also zfs has no concept of files in arc, it thinks in terms of blocks, so which blocks were recently used, which blocks are frequently used, is it a meta data blcok or a regular block. Metadata blocks generaly contain meta data for many regular blocks so its very easy for a single meta data block to become quite hot in the mfu if all the blocks it has info on are being aceesed a lot.

That is exactly what special vdevs are for.

You won’t need L2ARC, when you have special vdev.

Mmmmmmmmmmm, that depends, they’re both solving different problems. weather the metadata originates on the hdd vdev or the special you still need sufficient arc space in the level 1 and or level 2 for good performance. The latency time difference between cpu cache and memory is smaller than memory and ssds, not to mention hdds. When running data heavy applications data locality matters. special vdevs is more about the iops than the throughput, where l2arc is about throughput and caching.

Although I don’t quite agree with what you wrote, let’s stay with his workload for simplicity.

ARC he has either way, no matter if he uses L2ARC or special vdev.
(He has a little bit more ARC, when not using L2ARC)

Some metadata is not in ARC.
It has to be read from either special vdev (a SSD) or L2ARC (also a SSD).
Is it slower than RAM? Sure.
Does it matter? No.

So if metadata is stored on special vdev, and it misses ARC, you don’t make it faster by reading it from L2ARC instead from the special vdev.
The opposite is true, you make it more likely that it misses ARC and has to go down to slow route of reading from SSD.
And that still leaves out all the performance benefits you get when writing metadata.

That is why I wrote, that L2ARC will not help in his case.

AWESOME replies! @zmezoo & @ThisMightBeAFish

Thank you. Learning a lot.

So if I have read/understand correctly:

  1. ARC prioritizes Metadata.
  2. SvDev (metadata) is a supplementary persistent metadata storage location.
  3. The ARC is still GOD, and GOD likes metadata, and the SvDev (metadata) is simply a faster than HDD persistent repository.
  4. The biggest advantage of SvDev is during the write function, with some supplemental/auxillary benefits in read, depending on how sticky the metadata is in ARC.

I’m all in on ARC, always have been. If the metadata is stickier than documented/thought, perhaps I am overthinking this. I am trying to index 180 HDDs of varying sizes for large storage and simply worried that large files will flush my metadata from ARC. Contrary to popular belief, ignorance is not bliss :upside_down_face:.

I have a more general question now:

Can I run 2 separate SvDevs simultaneously? One for Meta, and another for small files? Can I simply separate them logically and turn the dials to store metadata on one, and small files on another?

If so… This might be the path of least resistance, without trying to reinvent a perfectly round wheel.

I am literally deploying tomorrow. So I do not have the luxury of testing this. I am however in the conundrum of allocating what I have to best achieve performance for my workflow.

Thanks again for the great replies. SUPER helpful.

no arc has two buckets frequently used and recently used, arc prioritizes blocks of any type into these two buckets depending on if its either frequently used or recently used, metadata blocks typically are frequently used, therefor make it into this bucket easily. Since arc does not evict solely base on one or the other, each bucket will have some amount of data in it.

not always you can have a special vdev that is a hdd that would be no faster performing than the main pool, it would still provide some benefit of reliving load off the main pool. Its just that since special vdevs were introduced along side draid ssd’s have been around, and it makes the most sense for specials to be on small fast ssd’s which are great with iops.

any time you read you must first read the metadata to know where to read so specials relieve the main pool of an iop, this can have a big an impact or bigger in reading iops as it does for writes. Any time we can keep the slow hdds in a pool from seeking is big. Again the metadata is relatively small its not about throughput its about iops.

dont know the answer to this.

All specials automatically will have metadata stored on them and you cant turn this off, it is their main function.

PSA reminder if you have a special vdev any metadata that is stored there is NOT DUPLICATED onto the pool. If you loose the special you loose the related metadata, and by prox data on the related vdev. Specials have to be mirrors, and its highly recommended to make them 3way or greater especially in production, of varying age and manufacture so its highly unlikely they all fail at the same time.

I feel like this is a bad plan, I would defiantly try to fit in some quick tests. Especially at the scale your talking about. Have you looked at draid?

1 Like

No, and there is no reason to. When you create a dataset, you can define under advanced options this: Metadata (special) small block size

By default this is 0. So it will store only metadata.
If you set this to for example 16k, all files that are 16k or smaller will also be stored on special vdev.

When special vdev reaches 75% capacity, small files will no longer be stored on special vdev but on the pool. So you don’t even have to worry about special vdev overspilling.

This small file setting could come in especially handy, if you use DRAID. DRAID has one big disadvantage, the minimum stripe size. Depending how you set it up, it could be that the smallest stripe of your DRAID is 40k. Now you write a 4k file and have to use 40k for that. Huge waste. With special vdev you could simple offload all files smaller than 40k.

Highly agree for this setup.

You are :slight_smile:
But seriously, L2ARC and SLOG you still can take a look at it afterwards.
svdev you have to do in advance, because metadata will NOT be moved just because you add it afterwards.

I also would set your dataset record size to 1MiB or even 16MiB if you only store huge video files.

You should more think about your pool. You said that you have what, 80 drives? How big are these?

You could go with 10x RAIDZ2 8 drives wide. That would give you 75% storage efficiency.

Or you use DRAID. Unfortunately I have no practical experience with DRAID. Take this with a huge grain of salt!
dRAID2:6d:2s:82c would be similar to the RAIDZ2 above, but with two spares (which is a big plus for dRAID, they are not bound to a single RAIDZ) and also 75% storage efficiency (if we ignore the spares). With 6 data drives, smallest stripe would be 24k, which you could catch with sdev.

Or if you value storage more than performance, you could also use something like dRAID2:18d:2s:82c to get 88% storage efficiency and a 72k stripe size.

Thank you @ThisMightBeAFish and @zmezoo AGAIN for first spending the time to reply, and AGAIN for doing it so eloquently and clearly. Your patience is saintly, and your kung-fu is STRONG.

I fully understand the sensitivity of the meta-data, the requirement for triple mirrors (as a result), and the underlying dynamics of WHY you would deploy an SvDev. But as you so eloquently pointed out… I need to make this particular decision at the START, with no real way of reverting. Hence my manic need to have a full grasp of this, since all of the above = Dollars, PCIE Lanes (more $), Physical Accommodation (More $).

I snatched up some 12Gen Dell R820’s as my head units (I have 2, they will be tiered). They cost me almost nothing. I fully understand the restrictions of this Gen, and thus my conundrum. I also snatched up 4 P905 Optane 1TB drives for C$300.00/ea before anyone noticed, and I would have grabbed 8 if they were available.

I have a Quad (mezzanine) 820 with 1.5TB of RAM. The other is a Dual with 758GB of RAM.

The Dual 758GB is going to be my ARCHIVE head-unit that is servicing a Dell 84 Bay Disk Shelf with 8TB 3.5" SAS2 6Gbps HDD.

The QUAD 1.5TB is going to be my “warm/current” storage head-unit where I am storing my currently working projects AND frequently used everything, that will be backed up to the archive. This unit has 4 JBODs @ 1.2Gbps thru to 96 1TB 12Gbps HDD.

BOTH have the R820 NVME add-on (cheap as well BTW), and I understand the PCIE 2.0 downgrade. But as you again pointed out, its not about the throughput… its about the IOPS, compared to bareback native. The Optane @3.0 was not purchased for the sequential speed. They were bought because of endurance, and the latency (IOPS).

My problem of sorts, is that I only have 4 drives that are suitable, that are discontinued. So I am up in the air regarding running them on the 96 Drive WARM storage. Where my ARC is strong. Or my Archive, where my ARC is less, but the access is also less.

Simple answer is neither, I have lots of ARC. Sequential is easy. Until you encounter my wife who wants to browse the Jelly-fin catalogue like a Parkinson’s patient. Then starts to bitch about the 42U rack in the garage that makes a racquet. I got Solar, so no bills belly-aching. But the struggle is real.

So;

Original question was prudent on logic, because I lacked the knowledge on some general rule on Meta-data size. Thanks to you and the T3 TrueNAS crew, I have some excellent understanding.

But now that I have waded through the weeds and the swamps to finally find the gate to the Jade Castle. I have some “philosophical questions” as it were.

WHY?

My situation is not common, but also not unique. With recent enlightenment, I find myself in this really weird “optimization” paradox. Where the only common answer is predictably enterprise… “Throw more money at it!”

As a result. One question really BURNS for me. A question for the ZFS Devs (LOVE YOU).

If we can put smallfiles on the SvDev dynamically in such a beautiful and elegant way… Why can’t we put them dynamically and elegantly elsewhere? Preferably NOT on an SvDev that is so battered (that it needs M3) and critical (that it needs M3), and must be high IOPS (the unicorn of persistent memory) for the entire pool?

Am I missing something? Is there already a trick to do this, and I am the village idiot looking for MOMMA?

Why can’t I simply put (dynamically) small files on a separate SvDev (NVME)?

I will not pretend to understand the underlying logistics of ZFS, nor criticize in anyway.

But, am I alone in this realization? From a Logical and philosophical way of course…

Thanks for your patience.

This is speculation but i think the small blocks feature was intended to combat fragmentation. Any block smaller than the pools set block size is a fragment.

However as Wendel pointed out we home labers can abuse it for other purposes. Honestly if your in a commercial environment your probably just building two pools one for small files and one for large, each tailored to the needs of their respective workloads. Us home labers, are just repurposing a zfs feature to take advantage of it, so we dont have to do this.

My opinion and advice is if you have enough small files that you would like on a ssd just get a pair of 1tb ssds and make a mirror pool and put them there. Linux’s mount anything anywhere means the mount points of the various zfs pools and data sets, can be such that from a users standpoint you would never know certain folders are on different pools.

That is the perfect match made in heaven use case for svdev! At least it was for me.

Where are you going with this?
What is it that you try to achieve?
Is it just the pool mission critical that bothers you?
If yes, I mean you can use the read only and not mission critical L2ARC.
Or simply only your ARC. ARC even “caches” (I hate that word when talking about ZFS) async writes in 5s groups to write down data in a sequential fashion.

But you can? Sure you can’t move it afterwards. And because sdev is not some kind of cache, the data lies ONLY there, which is why it is mission critical. If the sdev is gone, the data is nowhere else.

I don’t see how sdev could help with fragmentation. But it can help with IOPS offloading, with speeding up metadata or smaller files, with helping against suboptimal pool geometry or DRAID limitations.