ZFS: Metadata Special Device - Real World Perf Demo

The tool is called fio and available in basically all repos. Great tool.

64k or 128k is probably what most people use. I have mostly 256k recordsize and I’m using 128k. Anything up to and including this value is stored on the special vdev.

Thank you for your answer . I want to know what size special metadata vdev should be added what size pool. Is there a formula for that?

I was using special devices at home NAS. However, I don’t think it was worth the efforts. In most case, you’d better just put some cache devices. Cache device can cache metadata, too. You just need to wait the cache gets warm.
Also adding log devices is recommended to improve sync write.
As someone mentioned, for special device, you have to mirror it. Then, it would takes at least 8 pcie lines. Generally, I don’t mirror cache and log devices.
Special devices also add a point of failure, also an opportunity for human error. You loose special devices, you loose the pool. There is not an issue if you loose cache and log devices.
In my opinion, the priority should be, cache > log > special.

1 Like

There is a fixed amount of bytes per record or block. The more blocks/records you have, the more metadata you get. Storing data in large records/blocks results in relatively tiny amounts of metadata. Usually you end up with 1-10GB per TB of data.
special vdevs are particularly useful if you use a lot of 4k or 8k records on a large pool because you can’t cache all that metadata. Or for using small files, which is probably the most beneficial part for home usage. Sizing entirely depends on how many small files you have.

Special being too small is not a problem. Excess data just gets allocated to HDD as normal and cached in ARC. Just get a mirror of 1-2TB NVMe or SATA SSD if short on PCIe lanes and you’ll be fine.

2 Likes

Thank you for your patient answer.

Hi all!

I replied elsewhere, sorry, meant to find this thread.
I’ve got this board:
https://www.supermicro.com/en/products/motherboard/a2sdi-8c-hln4f
I have all 12 SATA ports in use, the board only has PCIe 3.0 4x slot.

Wendell is selling me hard on low latency 118GB discount Optane drives.

I do not see a PCIe bifurcation option for the board, despite Anandtech saying the processor is capable.

“the Atom C3000 features a PCIe 3.0 x16 controller (with x2, x4 and x8 bifurcation), 16 SATA 3.0 ports, four 10 GbE controllers, and four USB 3.0 ports.”


This helpful poster @jode has replied here basically telling me I’m boned if I want 4x M.2

So to my question(s)

PCIe 3.0 x4 can do 4GB/s, if I’m comfortable losing bandwidth, can I purchase a PCIe switching 4x M.2 slot card? (I’ll still get the great latency, right?)

Unfortunately finding a slot which is only x4 with 4 slots is very difficult:
PCIe 3.0 x4 to 4x M.2 NVMe Adapter Card,IO-PCE2806-4M2 - PCIe - Storage Products - My web (I think that one won’t work?)

Is this simply not possible? The 118GB drives are reasonably priced and I have no other use for the PCIe slot. The NAS has been particulatly slow in accessing drives with lots of files of late, highly suspect Wendells Metadata ‘trick’ is right up my alley.

Continuing the discussion from HP NL36 Microserver - PCI-e Bifurcation (Yes, it's enterprise you cheeky person!):

Well that is unfortunate but it is what it is. (Funny enough I have an N40L and N54L under a desk here I’m needing to eventually sell, but that’s another story)

So the question is, if I can’t do PCIe bifurcation, is there a solution which allows me 4x m.2 devices, fits into a 4x slot on this old classic motherboard and doesn’t hold me back too much?

Assuming this is a no?

This, vastly more expensive thing is a yes? (note the “no bifurcation” in the description) EXCEPT it’s not a 4x capable card.
https://www.amazon.com.au/KONYEAD-Card-PCI-Support-Non-Bifurcation-Motherboard/dp/B0CBRCFY69?th=1

Yes, you can use a card with PLX chip (no bifurcation needed). There are things to watch out for:

  • Most such cards are PCIe 8x or 16x, meaning these are physically larger than the slot on your mobo. However (and I cannot really tell from the pics) it looks like the PCIe slot of your mobo is open to the one side allowing to fit larger cards into the slot. Assuming this is true (or you would be willing to modify your slot as a customization) these cards would work in your mobo with speed reduced to the 4 available PCIe lanes.
  • Any hop required between device and CPU will add latency. I cannot guess how much latency is added by a PLX switch, but it’s going to be measurable.
  • PLX cards are pretty expensive (thanks, Broadcom). Expect to add a card that costs as much as your mobo.
  • Finally, using such a card you’ll only get speed benefits adding up to two Optane 1600X devices, because these support up to 2 PCIe lanes (at superior latency). Adding more devices (say 4) would expand the capacity of Optane storage you can access, but not bandwidth. It likely will reduce latency a tiny bit because the PLX chip has more work to do switching (but that’s just a wild guess on my part). If you need more capacity than the cheaper Intel Optane 1600X device offers, look for more expensive but higher capacity and faster (4 pcie lanes) Intel Optane 905.
1 Like

Really loving the helpfulness here, thanks so much!

You’re bang on, good old Patrick STH.
https://www.servethehome.com/wp-content/uploads/2019/02/Supermicro-A2SDi-8C_-HLN4F-Storage.jpg - back is indeed open on this slot, though it does go straight into the SATA ports so I’d need to watch out for that.

Before I make further posts, can I confirm even if I compromise to just 2x m.2 slots would still require a PLX switch based (expensive) card?! - No way for me to add a dirt cheap one?

Example:
https://www.amazon.com.au/Cablecc-Express-Raid0-Adapter-cablecc/dp/B08QDFMDG8?th=1

I think I’ve copied and pasted this data correctly (and added a row for 64GB)



How does one translate this data in to required drive space for metadata? Am I missing the formula for that?

(thank you!)

You can add a dirt cheap card that allows you to add a single m.2 drive. Any more m.2 drives connected through a dirt cheap card and the mobo needs to support bifurcation (which yours doesn’t).
Look at the first comment in your Amazon.au listing: “you need a motherboard that has PCI-E Bifurcation support”.

So, you can add two m.2 devices to your mobo (the first in m.2 slot, the second with cheap card in PCIe slot. Looking at the block diagram on p.20 of the manual it seems as if using the m.2 slot disables access to one of the miniSAS ports for 4 SATA disks. But the manual is not clear on that (at least not to me). Should you populate both PCIe slot and m.2 slot, you can connect at least 8 SATA disks, 12 if I am wrong. How many SATA devices do you have/need?

I think we have not asked the fundamental question: What exactly do you want to achieve by adding 2 or 4 Optane 1600X devices?
The mobo has 4x 1gb ports. That offers a max of 400MB/s network bandwidth, but only 100MB/s for any single connection (limitation of LAG). A zfs pool of 8+ HDDs should in most cases saturate a 1gb network connection.
ZFS caches metadata by default in memory (ARC). So, special devices to store metadata improve performance on first access and after that only if there is not sufficient memory.
ZFS special devices can also store data up to a specified recordsize. This is beneficial because it allows faster disks to overcome the severe performance limitations of HDDs in regards to accessing small recordsizes. But any performance benefits beyond what your NICs can transfer would be lost. Obviously, not only storing metadata but (small) data blocks as well consumes potentially a lot of space. Will 118GB optane drives suffice for this?

Do you have your ZFS NAS assembled already? If yes, what performance limits do you observe? Why don’t we think about if adding optane special devices can/will actually benefit your use case?

I hadn’t even contemplated the onboard m.2 slot because as you say, they often disable SATA ports. (I have not tested this on this board but you’ve checked the manual for me and I’d be surprised if they got it wrong, I can confirm in 3 weeks though)

I am using all 12 SATA ports and been using the system for 6 years well and need all 12. (heck I even boot from USB and have done a decade without issue)

My system can at times really pump along perfectly well with large sequential file transfers but working with some data it can really slow down when connecting to the mapped drive letter. (This drive is full of many many small files and folders)
Heck the whole system sometimes just feels sluggish. Never an error, never lost data, she’s just ‘sluggy’

I was considering either Arc or Metadata with the Optanes, Wendell’s youtube video is pretty compelling. I like instantaneous responses.

Also, I’ve recently sold a lot of old junk gadgets and so financially I can comfortably afford several hundred if it delivers some improvements.

Oh and I’m also upgrading from 32 to 64GB, I’ve already placed the order.

This Konyead 3003k device seems to sound ok at only $180 after 10% coupon.
Uses the ASM2824 switch and that’s really not that much money.

Certainly tempting, even if I get ‘only’ 4GB/s (500MB/s) out of 4 nvme / optane drives
(if it does indeed work in a x4 slot)

https://www.amazon.com/KONYEAD-Card-PCI-Aluminum-Non-Bifurcation-Motherboard/dp/B0CBRCFY69?th=1

The Konyead 3003k should allow you to add 4 m.2 drives to your mobo. I hope you have a case that accepts full size cards.

I am certainly interested what happens when you plug in a card into the m.2 slot on the mobo. But please note that in case this disables 4 SATA devices your zfs pool will not start (but should not cause other issues).

Also, after adding special devices to a pool the existing datasets don’t use them automatically. Any newly added files will, though.
To store the metadata of existing files on the new special devices you will need to copy it within the pool.

Yet again you’ve saved my bacon. Can’t thank you enough at this point. Had forgot about drive cage. (Node 304, stacked heavily with hardware)





(I’ve long since ditched that card, it doesn’t suit my needs, systems or network etc)

Old pic with a half height, 145mm deep card in there using most of the space.
I would guess the best I can do is about 165mm full height

So I can consider a 2xNVMe card, maybe!

Thanks again for all the help.

I’ve done my measuring, I feel I could get the Konyead PCIe quad m.2 card in the system (the 3003) I’m having difficulty finding a PCIe 4x or 8x to 8x riser cable that I would trust though.

Furthermore, while I should’ve known this I suppose, I’ve done osme reading and I can see that the metadata is critically important and only stored on the metadata device once you make such a decision.

This feels a bit risky to me to be honest if it’s hanging off a riser cable. What if the signalling is bad one day, what if it pops out? How much risk is there if metadata is unplugged randomly, can it even pull itself back together and figure it all out once I power off, fix it and reboot the system?

I hope our interactions help someone else considering this. I am still considering going for it but not metadata. Perhaps 2xOptante 118GB L2Arc and 2xRegular NVMe SSDs. Heck maybe 4x4TBs - I’m not sure.

Ultimately I’d like to make my system more responsive in general interaction, specifically in paths with lots of files or just general filesystem calls, not necessarily bandwidth itself.
Unfortunately it seems metadata is the best trick to improve that responsiveness based on the youtube video, whereas I can see l2arc ending up caching for me, my wifes 90 day fiance episodes or something horrific like that. (!!!) :frowning:

Any final thoughts on a lower risk solution?

Thanks for the help

I just want to mention the existence of USB 3.x to M.2 adapters. They are cheap.

Yeah but they are vastly slower than even sata. The whole point of the experiment was to try optane to see if I can make spinning rust more responsive.

Regarding the old discussion about L2ARC vs Special and metadata in L2ARC:

l2arc_exclude_special

Has anyone tested this?

2 Likes