ZFS: Metadata Special Device - Real World Perf Demo

I’m using a couple of the following bifurcation cards:

They work fine on PCIe4 connections for me. Decently affordable.

I have an idea…
@wendell

Is special device/vdev even needed?

If l2arc can be persistent after a reboot now.

Why not setup a l2arc (redundancy not required) and set it for meta data only.

So if your l2arc dev is lost. Nothing bad happens to the pool, right?

2 Likes

theoretically, yes? but stuff gets ejected from the arc/l2arc in ways I don’t fully understand.

@malventano has a find . script that was running on cron once a minute on a huge pool and the metadata was still ejected from l2arc even though he had gobs of ram and l2arc space. my experience has been similar even with 128gb ram and 1tb of l2arc on machines that hadn’t been rebooted in ages and ages.

I got some datasets to behave better than others with some fiddling and other difficult-to-remember incantations whereas the special vdev was more or less idiot proof and didn’t require me to remember anything special in order to enjoy breathtaking pool speedup.

3 Likes

ah very nice to know.

Cool that you’ve already tried it.

1 Like

Just to add on to @wendell’s response: L2ARC is pretty stupid by design, and although there have been proposals in the past to make it less stupid, it’s fallen off the priority list now that the special vdev is stable and widely available. It solves this problem handily, as well as the small data blocks problem, and it mitigates the dedup table problem. And it does so without wasting RAM or requiring tuning ad infinitum to get some specific desired behavior.

All that being said, if you wanted to explore this, the key is to get metadata out of mru and into mfu, then you can play around with l2arc_mfuonly and other tunables to prevent mru data from pushing mfu metadata out of L2ARC. To do that you have to touch each piece of metadata (metadatum??) twice within a very short period of time, 60ms IIRC. And probably repeat that process every so often to keep it fresh.

1 Like

ah ok… yeah i see why now the ideal solution is special device.

its been a \/ moment

This needs to be developed.

The one thing stopping me from adding special vdevs for meta data is the volatility it introduces. It’s like, oh great, now I have a completely separate thing that could tank my pool. I don’t know if I want to risk the inherent loss of reliability for the performance it might bring.

However, if it could be made so that it was more l2arc-like and survived reboots I’d throw a bunch of optane at it and not lose a ton of space or throw away money just for trying to maintain quadruple redundancy.

1 Like

Can someone please explain to me why a special Metadata VDEV couldn’t work in tandum with redundant metadata stored on the disks along with the data to begin with. If that were an option, it would put my mind at ease and make losing that Special VDEV less of a worry.

To restate this another way, if you lost your special vdev for any reason, you would be able to restore it from existing redundant metadata still stored on the slower medium (So you know, not a catastrophic loss of your zpool). But, at the same time, you could still leverage the speed of the special device day to day by using it a much faster lookup table and small file storage.

Because there are two hard problems in computer science: naming schemes, cache invalidation, and off-by-one errors.

6 Likes

Counting from 0. Nice Touch. LOL

1 Like

im still looking thru the previous thread, but if you add a special dev to existing pool… will metadata be moved over to the new vdev automatically or only when new files are added to the pool?

Only when data is added. Or moved/changed, but I’m not clear on when exactly ZFS will determine it needs to just rewrite metadata.

The surefire way is to copy to a new dataset, then rename the dataset

This script might work: GitHub - markusressel/zfs-inplace-rebalancing: Simple bash script to rebalance pool data between all mirrors when adding vdevs to a pool.

I encountered similar issues with evictions in L2ARC. Nevertheless, there is only so much RAM and user data isn’t covered by special vdev. For me, it’s not an either-or decision. Sweetspot for me is a special mirror along with a big L2ARC.

Non-evictable (meta)data on the special is just too nice to have and the memory extension feature of the L2 has no substitute if you deal with data far bigger than your memory on a regular basis. Being able to exclude data from special in L2 removes some friction when using both.

FIFO policy as well as primitive round robin allocation between cache devices makes L2 a bit outdated, but persistence and upcoming shared L2ARCacross all pools are great new features. Having a bunch of user data pre-cached after reboot is really great.

1 Like

I’m using this 10Gtek 4-Port M.2 NVMe Adapter M-Key, PCIe X16 Gen3.:

I’ve got my boot M.2 (Samsung 970 Evo Plus) and the two special VDEVs (WD SN700) and various 4th M.2 (eg. an Intel Optane M10) installed and on the ASRockRack X470D4U with x4x4x4x4 set on the 16x slot am able to boot from and see all four NVMEs

I also have the 2-Port version from the same company, which also works a treat.

I’m not sure if this it the thread for this. How do you create a stripped/mirrored vDEV in Trunas Scale?

I recently bought 8 octane drives. Two 58GB for a mirrored slog and six 118GB for a stripped/mirrored metadata vdev. The GUI does not give me the option for a stripped/mirrored vdev. Do I need to do this in the cli?

If it matters, it’s a raidz2 pool with 110TB of usable storage.

you add one vdev at a time via the GUI. Add two drives and repeat. zpool status will show several mirrors.
CLI command just in case: zpool add special mirror disk1 disk2 mirror disk3 disk4 mirror disk5 disk6, where disk is the /dev path

I think I fumbled. I created a 3 drive stripped vdev then I created another 3 drive stripped vdev. I was hoping I would get an option to mirror them after but I didn’t. Is there anyway to fix this without deleting the pool?

I’m not familiar with that GUI, but it seems most likely it’s messed up as they aren’t listed under a “MIRROR” label. Because it’s not entirely mirrors (as in, the pool has a RAIDZ vdev) then there is zero chance that you can remove single disk vdevs without having to completely redo the pool.

What does zpool status output in the CLI?

As an example, my output with properly mirrored disks looks like this:

root@pve:~# zpool status
  pool: kpool
 state: ONLINE
  scan: scrub repaired 0B in 20:25:34 with 0 errors on Sun Jan  8 20:49:35 2023
config:

        NAME                                      STATE     READ WRITE CKSUM
        kpool                                     ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            ata-WDC_WD100EMAZ-00WJTA0_1EG16LHZ    ONLINE       0     0     0
            ata-WDC_WD100EMAZ-00WJTA0_1EG1SSMZ    ONLINE       0     0     0
            ata-WDC_WD100EMAZ-00WJTA0_1EG8MK2Z    ONLINE       0     0     0
            ata-WDC_WD100EMAZ-00WJTA0_1EGDT2NN    ONLINE       0     0     0
            ata-WDC_WD100EMAZ-00WJTA0_1EGGZWKN    ONLINE       0     0     0
            ata-WDC_WD100EMAZ-00WJTA0_1EGJDJEZ    ONLINE       0     0     0
          raidz2-1                                ONLINE       0     0     0
            ata-WDC_WD100EMAZ-00WJTA0_1EGKSBKZ    ONLINE       0     0     0
            ata-WDC_WD100EMAZ-00WJTA0_1EGLDJ3Z    ONLINE       0     0     0
            ata-WDC_WD100EMAZ-00WJTA0_1EGLWUHZ    ONLINE       0     0     0
            ata-WDC_WD100EMAZ-00WJTA0_1SHL0DBZ    ONLINE       0     0     0
            ata-WDC_WD100EMAZ-00WJTA0_2YJ8PV5D    ONLINE       0     0     0
            ata-WDC_WD100EMAZ-00WJTA0_2YK1BK9D    ONLINE       0     0     0
          raidz2-2                                ONLINE       0     0     0
            ata-WDC_WD100EMAZ-00WJTA0_2YK20S1D    ONLINE       0     0     0
            ata-WDC_WD100EMAZ-00WJTA0_JEH2XVPM    ONLINE       0     0     0
            ata-WDC_WD100EMAZ-00WJTA0_JEHN2MXN    ONLINE       0     0     0
            ata-WDC_WD100EMAZ-00WJTA0_JEHNVVRM    ONLINE       0     0     0
            ata-WDC_WD100EMAZ-00WJTA0_JEJ4WBVN    ONLINE       0     0     0
            ata-WDC_WD100EMAZ-00WJTA0_JEKGYS0Z    ONLINE       0     0     0
        special
          mirror-3                                ONLINE       0     0     0
            ata-MK000960GWTTK_PHYG1141030Z960CGN  ONLINE       0     0     0
            ata-MK000960GWTTK_PHYG114103CS960CGN  ONLINE       0     0     0
            ata-MK000960GWTTK_PHYG114201LE960CGN  ONLINE       0     0     0
            ata-MK000960GWTTK_PHYG114201RM960CGN  ONLINE       0     0     0

calling a 4-way mirror “properly” is highly subjective :wink:

1 Like

Here is the output: