ZFS: Metadata Special Device - Real World Perf Demo

Yeah it’s fucked. Those are all individual special vdevs.

Well yeah, [quote=“Log, post:61, topic:191533”]
it’s fucked
[/quote]

Seems like you got a nice capacity optimized RAID0 configuration now. With that RAIDZ pool configuration you can’t remove them anymore and have to delete and restore from backup to change this (which is what you want to do anyway, because you want all the metadata on the HDDs to be on the special vdev. ZFS doesn’t move the data by itself so it stays on the HDDs.

Use my command a couple of postings earlier in the shell to properly get mirrored setup.

It took a few days but I backedup the data then rebuilt the pool. Before I transfer the data back, is this the configuration I want? The goal was a 330GB stripped/mirrored vdev for metadata.

Screenshot 2023-01-20 134043

3 Likes

Yeah that looks like it’s probably what you want, presuming you wanted 3 vdevs of 2 disk mirrors, and not 2 vdevs of triple mirrors. I’m assuming the special vdev drives are ~110GB.

Note that as a consequence, your RAIDZ2 vdev has to lose 3 disks to kill the pool and 2 to be vulnerable to bitrot, but a special vdev only 2 and 1 respectively. As long as you have automatic backups, and another driver to throw in should a special vdev drive fail, this shouldn’t be an issue. If you ever needed, you could take almost any other HDD or SSD and throw it in to replace a failed special vdev because they’d be larger. The performance of that special vdev would suck, but it’d work and the drive could be switched out later when you have a replacement.

Note you can explicitly check sizes, allocated and free space on a pool using zpool list -v yourpool

An example

root@pve:~# zpool list -v kpool
NAME                                       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
kpool                                      165T   138T  26.4T        -         -     0%    83%  1.00x    ONLINE  -
  raidz2-0                                54.6T  46.4T  8.14T        -         -     0%  85.1%      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_1EG16LHZ    9.10T      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_1EG1SSMZ    9.10T      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_1EG8MK2Z    9.10T      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_1EGDT2NN    9.10T      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_1EGGZWKN    9.10T      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_1EGJDJEZ    9.10T      -      -        -         -      -      -      -    ONLINE
  raidz2-1                                54.6T  45.8T  8.77T        -         -     0%  83.9%      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_1EGKSBKZ    9.10T      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_1EGLDJ3Z    9.10T      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_1EGLWUHZ    9.10T      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_1SHL0DBZ    9.10T      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_2YJ8PV5D    9.10T      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_2YK1BK9D    9.10T      -      -        -         -      -      -      -    ONLINE
  raidz2-2                                54.6T  45.8T  8.74T        -         -     0%  84.0%      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_2YK20S1D    9.10T      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_JEH2XVPM    9.10T      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_JEHN2MXN    9.10T      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_JEHNVVRM    9.10T      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_JEJ4WBVN    9.10T      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_JEKGYS0Z    9.10T      -      -        -         -      -      -      -    ONLINE
special                                       -      -      -        -         -      -      -      -  -
  mirror-3                                 894G   140G   754G        -         -    12%  15.6%      -    ONLINE
    ata-MK000960GWTTK_PHYG1141030Z960CGN   894G      -      -        -         -      -      -      -    ONLINE
    ata-MK000960GWTTK_PHYG114103CS960CGN   894G      -      -        -         -      -      -      -    ONLINE
    ata-MK000960GWTTK_PHYG114201LE960CGN   894G      -      -        -         -      -      -      -    ONLINE
    ata-MK000960GWTTK_PHYG114201RM960CGN   894G      -      -        -         -      -      -      -    ONLINE

As you can see, the totals for each vdev can be different from other vdevs. This is normal.

1 Like

I like this idea, even with my flash pools.

HDD’s, but especially Flash can be real quick to rebuild, but having any drive on hand with the right size/cap, can serve as a band-aid to maintain integrity till another drive of better quality ordered and badblocked/tested

As long as I have a mirrored 330GB special device for the recommended 0.03% size.

I do have a spare hard drive and octane drive just in case of a failure.

Thanks for the help. I’m currently transferring the data back.

This value is highly dependent on the record/blocksize of your datasets. Lower recordsize, less metadata and vice versa. But unless you’re running a lot of 4k/8k on your pool, you’ll be fine with metadata-only on the special vdev. But with small blocks included, 330GB isn’t that much really.

edit: but judging from that 3GB per 5.5TB from the list, all is fine.

I want to know how can test these information. By what command?

Sequential Read: 1699MB/s IOPS=1
Sequential Write: 612MB/s IOPS=0

512KB Read: 1579MB/s IOPS=3159
512KB Write: 583MB/s IOPS=1166

Sequential Q32T1 Read: 1711MB/s IOPS=53
Sequential Q32T1 Write: 618MB/s IOPS=19

4KB Read: 352MB/s IOPS=90189
4KB Write: 252MB/s IOPS=64605

4KB Q32T1 Read: 1678MB/s IOPS=429603
4KB Q32T1 Write: 641MB/s IOPS=164312

4KB Q8T8 Read: 2047MB/s IOPS=524246
4KB Q8T8 Write: 718MB/s IOPS=183897

what is the special_small_block you set?

The tool is called fio and available in basically all repos. Great tool.

64k or 128k is probably what most people use. I have mostly 256k recordsize and I’m using 128k. Anything up to and including this value is stored on the special vdev.

Thank you for your answer . I want to know what size special metadata vdev should be added what size pool. Is there a formula for that?

I was using special devices at home NAS. However, I don’t think it was worth the efforts. In most case, you’d better just put some cache devices. Cache device can cache metadata, too. You just need to wait the cache gets warm.
Also adding log devices is recommended to improve sync write.
As someone mentioned, for special device, you have to mirror it. Then, it would takes at least 8 pcie lines. Generally, I don’t mirror cache and log devices.
Special devices also add a point of failure, also an opportunity for human error. You loose special devices, you loose the pool. There is not an issue if you loose cache and log devices.
In my opinion, the priority should be, cache > log > special.

1 Like

There is a fixed amount of bytes per record or block. The more blocks/records you have, the more metadata you get. Storing data in large records/blocks results in relatively tiny amounts of metadata. Usually you end up with 1-10GB per TB of data.
special vdevs are particularly useful if you use a lot of 4k or 8k records on a large pool because you can’t cache all that metadata. Or for using small files, which is probably the most beneficial part for home usage. Sizing entirely depends on how many small files you have.

Special being too small is not a problem. Excess data just gets allocated to HDD as normal and cached in ARC. Just get a mirror of 1-2TB NVMe or SATA SSD if short on PCIe lanes and you’ll be fine.

2 Likes

Thank you for your patient answer.

Hi all!

I replied elsewhere, sorry, meant to find this thread.
I’ve got this board:
https://www.supermicro.com/en/products/motherboard/a2sdi-8c-hln4f
I have all 12 SATA ports in use, the board only has PCIe 3.0 4x slot.

Wendell is selling me hard on low latency 118GB discount Optane drives.

I do not see a PCIe bifurcation option for the board, despite Anandtech saying the processor is capable.

“the Atom C3000 features a PCIe 3.0 x16 controller (with x2, x4 and x8 bifurcation), 16 SATA 3.0 ports, four 10 GbE controllers, and four USB 3.0 ports.”


This helpful poster @jode has replied here basically telling me I’m boned if I want 4x M.2

So to my question(s)

PCIe 3.0 x4 can do 4GB/s, if I’m comfortable losing bandwidth, can I purchase a PCIe switching 4x M.2 slot card? (I’ll still get the great latency, right?)

Unfortunately finding a slot which is only x4 with 4 slots is very difficult:
PCIe 3.0 x4 to 4x M.2 NVMe Adapter Card,IO-PCE2806-4M2 - PCIe - Storage Products - My web (I think that one won’t work?)

Is this simply not possible? The 118GB drives are reasonably priced and I have no other use for the PCIe slot. The NAS has been particulatly slow in accessing drives with lots of files of late, highly suspect Wendells Metadata ‘trick’ is right up my alley.

Continuing the discussion from HP NL36 Microserver - PCI-e Bifurcation (Yes, it's enterprise you cheeky person!):

Well that is unfortunate but it is what it is. (Funny enough I have an N40L and N54L under a desk here I’m needing to eventually sell, but that’s another story)

So the question is, if I can’t do PCIe bifurcation, is there a solution which allows me 4x m.2 devices, fits into a 4x slot on this old classic motherboard and doesn’t hold me back too much?

Assuming this is a no?

This, vastly more expensive thing is a yes? (note the “no bifurcation” in the description) EXCEPT it’s not a 4x capable card.
https://www.amazon.com.au/KONYEAD-Card-PCI-Support-Non-Bifurcation-Motherboard/dp/B0CBRCFY69?th=1

Yes, you can use a card with PLX chip (no bifurcation needed). There are things to watch out for:

  • Most such cards are PCIe 8x or 16x, meaning these are physically larger than the slot on your mobo. However (and I cannot really tell from the pics) it looks like the PCIe slot of your mobo is open to the one side allowing to fit larger cards into the slot. Assuming this is true (or you would be willing to modify your slot as a customization) these cards would work in your mobo with speed reduced to the 4 available PCIe lanes.
  • Any hop required between device and CPU will add latency. I cannot guess how much latency is added by a PLX switch, but it’s going to be measurable.
  • PLX cards are pretty expensive (thanks, Broadcom). Expect to add a card that costs as much as your mobo.
  • Finally, using such a card you’ll only get speed benefits adding up to two Optane 1600X devices, because these support up to 2 PCIe lanes (at superior latency). Adding more devices (say 4) would expand the capacity of Optane storage you can access, but not bandwidth. It likely will reduce latency a tiny bit because the PLX chip has more work to do switching (but that’s just a wild guess on my part). If you need more capacity than the cheaper Intel Optane 1600X device offers, look for more expensive but higher capacity and faster (4 pcie lanes) Intel Optane 905.
1 Like

Really loving the helpfulness here, thanks so much!

You’re bang on, good old Patrick STH.
https://www.servethehome.com/wp-content/uploads/2019/02/Supermicro-A2SDi-8C_-HLN4F-Storage.jpg - back is indeed open on this slot, though it does go straight into the SATA ports so I’d need to watch out for that.

Before I make further posts, can I confirm even if I compromise to just 2x m.2 slots would still require a PLX switch based (expensive) card?! - No way for me to add a dirt cheap one?

Example:
https://www.amazon.com.au/Cablecc-Express-Raid0-Adapter-cablecc/dp/B08QDFMDG8?th=1

I think I’ve copied and pasted this data correctly (and added a row for 64GB)



How does one translate this data in to required drive space for metadata? Am I missing the formula for that?

(thank you!)