ZFS: Metadata Special Device - Real World Perf Demo

jode · December 5, 2022, 1:22am

Try adding your new devices with the same sector size as the existing one specifying with option ashift=<>

jsclayton · December 5, 2022, 5:27pm

I think this is the limiting issue - the pool is 4 RAIDZ2 vdevs, the special device is a 3-way mirror. Also my test VM had all the disks created the same way and got the same error with this pool configuration.

Apple2c · December 9, 2022, 12:14pm

Hi, there seems to be confusion about this:

In order to add / remove / modify the special VDEVs

Given top-level VDEVs and special VDEVs have same ashift

which configuration is correct?

a) Top-level VDEVs and special VDEVs need to both be Mirror or both be same RAID Z
b) Top-level VDEVs can be RAID Z or Mirror but special VDEVs can only be Mirror
c) Top-level VDEVs and special VDEVs can ONLY BOTH be mirror

jode · December 9, 2022, 12:36pm

The correct answer would be “b)”

Apple2c · December 9, 2022, 1:13pm

Thanks!

Log · December 9, 2022, 3:38pm

As mentioned, it’s B. However in order to remove any vdev, including special, ALL vdevs in the pool must be mirrors. They must also have the same ashift.

Any vdev can be added to the pool at any time.

Mirror vdevs can have disks added or removed from them, though the final disk in the vdev cannot be removed (if there are non-mirror or mixed ashift vdevs in the pool).

In the future raidz vdevs will allow expanding by adding a disk at any time.

A special vdev can theoretically be removed, but this is not to be done casually. You should have a full backup confirmed good before trying it, as there remains the slim possibility of killing the pool if something goes wrong like here: Pool corruption after export when using special vdev · Issue #10612 · openzfs/zfs · GitHub

Also note, my understanding is removing a vdev does a weird thing where the data is basically put on a sort of virtual vdev, adding a layer of indirection for blocks going to it. Usually not something that really matters in practice, but also not optimal and will permanently follow the pool from then on.

Apple2c · December 9, 2022, 5:15pm

Thanks for that GitHub link. The issue is rather concerning.

The last thread entry is Nov 19, 2021:

thatChadM commented Nov 19, 2021 •

Tried recreating the failure exporting a pool with a special vdev losing the special vdev during the export but so far I haven’t been able to. Appears this is a somewhat random event since my digging so far only found 2 instances of it occuring so far: @don-brady’s event earlier in this ticket and when I ran into it.

Since then no action or updates to the topic.

From reading the entire GitHub thread, it would seem that there are two cases where exporting and importing (likely the issue specifically importing) a pool with special vdevs corrupts the pool because the import does not recognize the special vdevs as special.

The original poster tested and reproduced the problem many times with many different constellations of controllers, disks, with and without nvme.

Perhaps a good idea might be to test the export and import of a special vdev pool for oneself first, before putting data on it.

I will do that: I am preparing the LSI 3008 HBA and the three Toshiba HDDs as Z1 and the two Kingston NVMEs (partitioned, with special and slog on the same devices) as a mirror - before I migrate all my data to the new pool, I will export the empty pool and import it.

And see what happens.

Log · December 9, 2022, 6:56pm

Agreed on the testing, and confirmed backup before any major change. ZFS is a weird mix of very stable core, but with neat but still beta features that you don’t want to shake out without a fallback.

Jim Salter is also know as mercenary_sysadmin, and you probably seen his various guides on ZFS. He’s also the author of sanoid/syncoid, and manages many VMs on ZFS professionally. He’s not infallible but he generally knows reasonably well what he’s doing so any issue he has is definitely one to pay attention to. That said, no one else can really reproduce the issue. There’s no telling what sort of dark and weird code paths he’s managing to touch or how. Bugs are constantly being fixed in ZFS, in fact here’s a neat summary about Ryao’s recent work using a static analyzer to clean a lot of loose ends up that few would ever be able to hit.
I’m not sure if Jim’s tested it recently or not, and probably isn’t inclined to as he doesn’t see a performance benefit (which I disagree with) so I’m not sure if he can still reproduce the issue himself.

The overwhelming majority of people I see mention using special vdevs have no issues, same with myself.

wendell · December 10, 2022, 3:41am

technically also striped mirrors are fine for special vdevs. and stripe(mirror,mirror,mirror) is also fine and so on.

Bootyhole · December 20, 2022, 10:03pm

With that Optane sale I picked up 2 x 118 gb P1600X drives to use for my meta data + small blocks i currently have this saved to a 3 way mirror on 256gb sata ssd’s.

block                                     25.7T  9.02T  16.7T        -         -     0%    35%  1.00x    ONLINE  /mnt
  mirror-0                                12.7T  4.51T  8.21T        -         -     0%  35.5%      -    ONLINE
    9c49ef16-716d-11ec-8ca0-e41d2d7f4000      -      -      -        -         -      -      -      -    ONLINE
    9c68698c-716d-11ec-8ca0-e41d2d7f4000      -      -      -        -         -      -      -      -    ONLINE
  mirror-1                                12.7T  4.50T  8.22T        -         -     0%  35.4%      -    ONLINE
    9c1f9065-716d-11ec-8ca0-e41d2d7f4000      -      -      -        -         -      -      -      -    ONLINE
    9c09833f-716d-11ec-8ca0-e41d2d7f4000      -      -      -        -         -      -      -      -    ONLINE
special                                       -      -      -        -         -      -      -      -  -
  mirror-2                                 238G  10.6G   227G        -         -     5%  4.43%      -    ONLINE
    9ae1c0c6-716d-11ec-8ca0-e41d2d7f4000      -      -      -        -         -      -      -      -    ONLINE
    9af5dd71-716d-11ec-8ca0-e41d2d7f4000      -      -      -        -         -      -      -      -    ONLINE
    ef72e0f5-fdd0-4b42-9ea2-b6459171972b      -      -      -        -         -      -      -      -    ONLINE

now i think i know the answer but let me ask is there any way to add the 2 Optane to the special mirror and then remove the 3 ssd or bc the drives that are in the mirror now are larger then the 118 i cant add? and if that’s the case do i need to export the data and rebuild the zpool to get the Optane in the mix.

Thanks

jode · December 20, 2022, 11:06pm

zpool add <poolname> special mirror <optane1> <optane2>
zpool remove <poolname>  mirror-2

Bootyhole · December 21, 2022, 3:12pm

so i add another special mirror thats the 2 optane and im guessing that will be like mirror-3 under the special section. so it will then have have 2 vdev mirrors holding the same metta data? and then i can remove the first 3way ssd

Apple2c · December 22, 2022, 9:44am

Sadly Optane in Germany is not really cheap.

I went with WD SN700 as they are cheap and have a very high write endurance for the special VDEVs. They are not as insanely write proof as optane, but SLC NAND is the closest you can get.

NAME                                              SIZE  ALLOC   FREE
asphodel                                         15.4T  5.51T  9.86T
  mirror-0                                       14.5T  5.49T  9.06T
    ata-TOSHIBA_MG08ACA16TE_XXXXXXXAFWTG         14.6T      -      -
    ata-TOSHIBA_MG08ACA16TE_XXXXXXXDFVGG         14.6T      -      -
special                                              -      -      -
  mirror-1                                        848G  24.2G   824G
    nvme-WD_Red_SN700_1000GB_22012N8XXXXX-part1   850G      -      -
    nvme-WD_Red_SN700_1000GB_22012N8XXXXX-part1   850G      -      -

special_small_blocks=64K and the default record size is 128K. The speedup is very very noticable (for example zdb -PLbbbs just flies, doesn’t take a minute.)

As per best practice I have to wait a couple months to add the third SN700 to the special mirror to stagger the device aging.

Bootyhole · December 23, 2022, 2:20pm

ok so iv aded the new mirror-3 but i dont see any data on it?

NAME                                       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
bigblock                                  25.8T  9.08T  16.7T        -         -     0%    35%  1.00x    ONLINE  /mnt
  mirror-0                                12.7T  4.53T  8.18T        -         -     0%  35.6%      -    ONLINE
    9c49ef16-716d-11ec-8ca0-e41d2d7f4000  12.7T      -      -        -         -      -      -      -    ONLINE
    9c68698c-716d-11ec-8ca0-e41d2d7f4000  12.7T      -      -        -         -      -      -      -    ONLINE
  mirror-1                                12.7T  4.52T  8.20T        -         -     0%  35.5%      -    ONLINE
    9c1f9065-716d-11ec-8ca0-e41d2d7f4000  12.7T      -      -        -         -      -      -      -    ONLINE
    9c09833f-716d-11ec-8ca0-e41d2d7f4000  12.7T      -      -        -         -      -      -      -    ONLINE
special                                       -      -      -        -         -      -      -      -  -
  mirror-2                                 238G  21.7G   216G        -         -     5%  9.12%      -    ONLINE
    9ae1c0c6-716d-11ec-8ca0-e41d2d7f4000   238G      -      -        -         -      -      -      -    ONLINE
    9af5dd71-716d-11ec-8ca0-e41d2d7f4000   238G      -      -        -         -      -      -      -    ONLINE
    ef72e0f5-fdd0-4b42-9ea2-b6459171972b   238G      -      -        -         -      -      -      -    ONLINE
  mirror-3                                 110G   184K   110G        -         -     0%  0.00%      -    ONLINE
    nvme1n1                                110G      -      -        -         -      -      -      -    ONLINE
    nvme2n1                                110G      -      -        -         -      -      -      -    ONLINE

will that data move when I put in the remove mirror-2 command? iv taken backups so im not too scared but I’m a little scared

Exard3k · December 23, 2022, 3:24pm

ZFS never moves data by itself by definition. In a so called evacuation process (zpool remove), the contents of the evacuated device will be distributed across all remaining vdevs. Those 22G should be allocated towards mirror-3.

Worst thing that could happen is that ZFS distributes the data across your data vdevs

NicKF · December 23, 2022, 3:37pm

You should be fine as long as you don’t lose power while the data is flight, and as long as your system is stable.

Exard3k · December 23, 2022, 3:43pm

There is no in-flight. The “evacuee” stays active and running until everything is finished and checksummed. I tested this by pulling the plug myself (“hey let’s test this zfs”-afternoon). It just treats the evacuation/removal as cancelled and you have to re-enter the remove command again (or resume process, it’s a bit like scrub (same syntax, same listing under zpool status). Only after everything is done, the transaction is committed to the pool.

NicKF · December 23, 2022, 3:59pm

Built in safe guards do not prevent unexpected behavior is all I am saying.
Data corruption happens when power disappears.

marceljoseph · December 24, 2022, 2:49pm

Does anyone know what search terms I have to use to find the card at 7:49 in the video? I haven’t been able to find an quad slot cards for $20 and I haven’t found one by Gigabyte.

Log · December 26, 2022, 11:52am

I believe that low profile blue card is the GIGABYTE CMT4030. Good luck finding one for a decent price.

“low profile pcie 4 slot|port m.2 riser x16” is probably what you need to find similar things. Here is a double sided linkreal card
https://www.aliexpress.us/item/3256803691464173.html
https://www.newegg.com/p/17Z-00TX-000B5

Do note it’s gen 3.0

ZFS: Metadata Special Device - Real World Perf Demo

thatChadM commented Nov 19, 2021 •