ZFS Metadata Special Device: Z

hysel · December 13, 2022, 5:34pm

Thank you. So to clarify, as long as I use two devices as mirrors and later decide that I don’t want to use Metadata anymore, will I be able to remove the entire metadata config from my Pool?

Apple2c · December 13, 2022, 8:56pm

Oh, that however is something else. From what has been said here, you can replace (remove) sides of a mirror, and add mirrors, and mirrors of mirrors, but unless I missed it in the discussion, you can’t remove the the last VDEV of a special device mirror.

You can send your pool to another pool without special VDEVs, and then the meta data gets put right back into the normal VDEVs with data and all the other types of metadata. You can also rsync the data or use another method to move it to some place that doesn’t have special VDEVs.

Exard3k · December 13, 2022, 9:57pm

correct. You can remove if your pool doesn’t have RAIDZ vdevs. There is always metadata, but with the special vdev, you store the metadata on fast NVMe.

You did. See: zpool-remove.8 — OpenZFS documentation

The entire pool must be mirrors. Doesn’t work with RAIDZ. The term “Top-level-vdev” refers to vdevs where actual data is stored, like the vdevs for HDDs and special where metadata is stored.

edit: I removed my special vdev in my pool half a year ago and was using L2ARC with both SSDs, but lately decided to revert back because workload changed. Mirrors are great and flexible.

Apple2c · December 13, 2022, 9:59pm

Indeed I did miss it.
Mirrors are GUD.

Removes the specified device from the pool. This command supports removing hot spare, cache, log, and both mirrored and non-redundant primary top-level vdevs, including dedup and special vdevs.

hysel · December 14, 2022, 12:33pm

Thank you. That explains why I can’t remove my Metadata from my pool. I am using RaidZ-2 on my data tier, and while the Metadata special setup is mirrored, I can’t remove it from my setup and will need to rebuild my entire pool.

Looking at your suggestions, I will probably stay away from Metadata unless my workload change.

NicKF · December 21, 2022, 1:49am

Just got my 118GB P1600Xs today.

I have “demoted” my Samsung 9A1 500GB drives (previously Tier 1 VM Storage) to be the special devices on my main pool, They are in a striped 4-way mirror (8 drives total, capacity of 2).
The P1600Xs as L2ARC. Maybe some day they will give me a flag that allows me to not cache metadata in L2ARC

I am finally encroaching into big boy SAN land.


pool                                         137T  65.3T  72.0T        -         -     2%    47%  1.20x    ONLINE  /mnt
  raidz1-0                                  45.5T  20.9T  24.6T        -         -     5%  45.9%      -    ONLINE
    24f5b474-f68c-11ec-9fa9-246e9642ff38    9.09T      -      -        -         -      -      -      -    ONLINE
    24bd6bc9-f68c-11ec-9fa9-246e9642ff38    9.09T      -      -        -         -      -      -      -    ONLINE
    2511846b-f68c-11ec-9fa9-246e9642ff38    9.09T      -      -        -         -      -      -      -    ONLINE
    24d7465b-f68c-11ec-9fa9-246e9642ff38    9.09T      -      -        -         -      -      -      -    ONLINE
    24f69fb7-f68c-11ec-9fa9-246e9642ff38    9.09T      -      -        -         -      -      -      -    ONLINE
  raidz1-1                                  45.5T  22.6T  22.9T        -         -     3%  49.7%      -    ONLINE
    24aadaec-f68c-11ec-9fa9-246e9642ff38    9.09T      -      -        -         -      -      -      -    ONLINE
    spare-1                                     -      -      -        -         -      -      -      -    ONLINE
      82e1befb-c6f5-47cc-9b7a-a81351d10437  9.09T      -      -        -         -      -      -      -    ONLINE
      6919a517-7b86-49c8-a2f2-0b65d03c57d3  9.09T      -      -        -         -      -      -      -    ONLINE
    255a65de-f68c-11ec-9fa9-246e9642ff38    9.09T      -      -        -         -      -      -      -    ONLINE
    24f77b18-f68c-11ec-9fa9-246e9642ff38    9.09T      -      -        -         -      -      -      -    ONLINE
    24bf6bcb-f68c-11ec-9fa9-246e9642ff38    9.09T      -      -        -         -      -      -      -    ONLINE
  raidz1-2                                  45.5T  21.5T  24.0T        -         -     0%  47.3%      -    ONLINE
    b90af04b-6aba-4c1a-bd53-0fde02098d2b    9.09T      -      -        -         -      -      -      -    ONLINE
    451f468f-73b1-4eb8-aec5-780594ce367a    9.09T      -      -        -         -      -      -      -    ONLINE
    e9dd89c9-24b3-49f8-b319-32a6b26fcf69    9.09T      -      -        -         -      -      -      -    ONLINE
    29083760-6de5-48a5-bdf9-3281c02c79dd    9.09T      -      -        -         -      -      -      -    ONLINE
    fd472396-91b8-48c1-8240-efe931defb4a    9.09T      -      -        -         -      -      -      -    ONLINE
special                                         -      -      -        -         -      -      -      -  -
  mirror-3                                   476G   191G   285G        -         -    10%  40.2%      -    ONLINE
    529b8e0e-89d0-44b9-ab02-ac1327e738b6     477G      -      -        -         -      -      -      -    ONLINE
    5de80780-19d2-4a41-a90f-97a4ec0e88fe     477G      -      -        -         -      -      -      -    ONLINE
    4a877232-bc85-40b3-aedf-99ee2c481d0c     477G      -      -        -         -      -      -      -    ONLINE
    04ffc4c8-7a7f-46e1-a1b1-b49daebda47f     477G      -      -        -         -      -      -      -    ONLINE
  mirror-4                                   476G   192G   284G        -         -    11%  40.3%      -    ONLINE
    06af2010-9f76-41c7-ab3f-e08fd09629f1     477G      -      -        -         -      -      -      -    ONLINE
    f84cecd7-6e7d-4ba8-9c51-379b664ddf5f     477G      -      -        -         -      -      -      -    ONLINE
    8e61280b-d2d9-4442-9651-c72f8d532f29     477G      -      -        -         -      -      -      -    ONLINE
    d260dbe8-a426-4b72-b7c4-d2f70a27e04c     477G      -      -        -         -      -      -      -    ONLINE
cache                                           -      -      -        -         -      -      -      -  -
  1960bed3-d6c2-48b8-8c83-f4bb63bd4082       110G  2.93G   107G        -         -     0%  2.65%      -    ONLINE
  6ef0339d-c3c2-4387-9d4f-f08414775810       110G  2.86G   107G        -         -     0%  2.59%      -    ONLINE
spare                                           -      -      -        -         -      -      -      -  -
  6919a517-7b86-49c8-a2f2-0b65d03c57d3      9.09T      -      -        -         -      -      -      -     INUSE
  3388494b-9f79-43cb-8da0-b73d197b9032      9.09T      -      -        -         -      -      -      -     AVAIL
  11014265-e5ed-490c-a68c-6f914764fe0a      9.09T      -      -        -         -      -      -      -     AVAIL
  50309bdd-3ece-45fc-a305-33bfd43c9a9e      9.09T      -      -        -         -      -      -      -     AVAIL

The 9A1s were replaced by a pair of 905ps 960gb drives, mirrored.


root@prod[~]# zpool list -v optane_vm
NAME                                       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
optane_vm                                  888G   622G   266G        -         -    42%    70%  1.08x    ONLINE  /mnt
  mirror-0                                 888G   622G   266G        -         -    42%  70.0%      -    ONLINE
    0a3ec38c-61bc-4c92-989e-bb886b22c241   892G      -      -        -         -      -      -      -    ONLINE
    e33c70aa-009f-4032-8ae6-0465156ad9ca   892G      -      -        -         -      -      -      -    ONLINE
root@prod[~]#

So now I have a really fast Tier 1 storage pool also.
If electricity rates weren’t so high I’d be building a middle ground storage tier with a bunch of flash disks…I even have a NetApp disk shelf waiting in the wings… My rates were $.09 and are doubling to $.16, and almost tripled to $.27 if we didn’t change suppliers…

TDLR, my wife hates my hobby.

In any case, Thanks, Intel, for killing off one of your best products so I could get a deal. lol

Took about a month to get the money together, do the upgrades, and have fun. Well worth it. Thanks @wendell for the tip in your video recently.

bambinone · December 21, 2022, 2:33am

Add this to your kernel command line/module options:

zfs.l2arc_exclude_special=1

Added in 2.1.6 IIRC. Have fun!

NicKF · December 21, 2022, 4:24am

Great, Didn’t know they did that

added as a POSTINIT script

echo 1 > /sys/module/zfs/parameters/l2arc_exclude_special

jode · December 22, 2022, 2:41pm

Intel still seems to have stock left. Today only, Newegg has an additional $60 off on the $399 Optane 905 (960GB).

Intel Optane 905P Series 960GB, 2.5" x 15mm, U.2, PCIe 3.0 x4, 3D XPoint Solid State Drive (SSD) SSDPE21D960GAM3 - Newegg.com-20-167-463--12222022

This is your chance if you haven’t got your fill, yet.

(no, I’m not affiliated with Newegg nor Intel, but exited that this technology is finally being priced accessibly for prosumers)

Apple2c · December 22, 2022, 3:37pm

Not that Newegg sells here in Europe, but that is a nice price for Optane.

However, $700 for 2x1TB Optanes for special VDEVs is… still quite a hefty price.

An alternative that I took is to use WD Red SN700 SLC NAND NVME M.2. The 1TB models have a write endurance of 2000 TB, and for 100 EU a piece, that is a no-brainer.

Not that we have a choice here, since that nice Newegg price isn’t available to us anyway.

Exard3k · December 22, 2022, 4:07pm

I don’t share the hype for Optane special vdev either. Standard consumer SATA SSD or NVMe SSD is totally fine. And you certainly won’t wear out some drive because of metadata and 8k files, same with L2ARC which is even more read-heavy. For SLOG, Optane has a (3DX-)Point.

I rather have 1-2TB special with SATA SSD than one with 110G of Optane.

My zpool list -v shows 360G out of 1T used (incl. 64k small blocks pool-wide) and traffic is rather low all things considered, the PCIe 3.0 NVMe certainly doesn’t take a sweat. Endurance will last until my retirement.

ligistx · December 30, 2022, 8:51pm

I finally got around to doing this…

Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:  32.1K  16.0M  16.0M  30.5K  15.3M  15.3M      0      0      0
     1K:  31.9K  38.3M  54.4M  29.0K  35.0M  50.3M      0      0      0
     2K:  42.7K   113M   167M  36.9K  97.3M   148M      0      0      0
     4K:  1.24M  4.97G  5.13G  31.6K   166M   314M      0      0      0
     8K:   541K  5.87G  11.0G  22.2K   247M   561M   242K  2.84G  2.84G
    16K:   182K  3.85G  14.9G   650K  10.3G  10.8G  1.24M  29.8G  32.6G
    32K:   381K  17.4G  32.3G   775K  24.5G  35.3G   739K  34.7G  67.3G
    64K:  1.85M   177G   209G  20.4K  1.80G  37.1G  1.08M   111G   179G
   128K:   111M  13.8T  14.0T   113M  14.2T  14.2T   112M  18.3T  18.5T
   256K:      0      0  14.0T      0      0  14.2T  1.17K   439M  18.5T
   512K:      0      0  14.0T      0      0  14.2T      0      0  18.5T
     1M:      0      0  14.0T      0      0  14.2T      0      0  18.5T
     2M:      0      0  14.0T      0      0  14.2T      0      0  18.5T
     4M:      0      0  14.0T      0      0  14.2T      0      0  18.5T
     8M:      0      0  14.0T      0      0  14.2T      0      0  18.5T
    16M:      0      0  14.0T      0      0  14.2T      0      0  18.5T

I am trying to figure out if the special metadata vdevs would make much sense for my use case still… My biggest gripe with my setup is the slow write speeds, even over gigabit networking… I can’t seem to get more then ~350 mbps writes to my 10x4 TB Z2 array.

janmartendeboer · December 31, 2022, 3:54pm

A couple of days ago I finally got the hardware to expand my pool with a mirrored special device. I went for low latency consumer grade SSDs. Now, a couple of days in, I’m starting to massively regret that decision, since my pool is quickly forming all these data errors.

$ sudo zpool status main -x

Shows

  pool: main
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 85K in 00:08:48 with 13 errors on Sat Dec 31 15:30:47 2022
config:

	NAME        STATE     READ WRITE CKSUM
	main        ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    sdc     ONLINE       0     0     0
	    sdd     ONLINE       0     0     0
	    sdb     ONLINE       0     0     0
	    sdf     ONLINE       0     0     0
	    sdg     ONLINE       0     0     0
	    sda     ONLINE       0     0     0
	special	
	  mirror-1  ONLINE       0     0     0
	    sdi     ONLINE       0     0 2.23M
	    sdj     ONLINE       0     0 2.23M
	cache
	  nvme1n1   ONLINE       0     0     0
	spares
	  sdh       AVAIL   
	  sde       AVAIL   

errors: 9 data errors, use '-v' for a list

I don’t know how to confirm whether the special device is the cause of these errors, but given that I had no errors before and they started not long after adding the special vdev, I’m betting its the cause.

Since my pool is not entirely backed up, please don’t judge, I’m not in a position to get there, I would rather not rebuild the whole pool, leave the special devices.

I know there’s no real undo button here. I’ve looked around seeing if there are ways to migrate the metadata back to the data vdevs, but I am either too unfamiliar with ZFS, or can’t find the right resources to guide me here.

I’m also unclear if and how this might be related, but since the errors started showing up, a zed process has continuously started consuming huge amounts of resources. This persists even after a reboot of the system, which should remove suspicions of software updates mismatching versions with loaded software / drivers /modules / what have you.

I’m really hoping someone would be able to help me out and guide me on any way of removing the special vdev without loss of data.

Thanks for reading this and for your time and consideration.

Log · December 31, 2022, 5:56pm

Bummer

Probably won’t matter, but for reference what models of SSDs did you get and how are they connected? It’s known that sometimes a particular samsung evo drive will generate a few errors every so often, but nothing like this and certainly not on both drives at the same time.

What’s strange is it’s both drives, which can point more to something like a sata controller being bad/overheating. Bad cables (the most common error generating problem) will cause errors only for the drive it’s connected to.

However it’s also very possible you hit some kind of software bug and ZFS is getting confused.

I’d search the GitHub issues, as well as take this to the ZFS mailing list and the subreddit since you’re more likely to find someone who can help figure out what’s going on in regards to the software internals.

Beyond that, how I’d personally approach this is basically clear it all, completely make a fresh pool from scratch, and then transfer the data from the backup pool, and if that doesn’t work possibly even using rsync (with byte for byte verification) instead of sending/receiving as snapshot to make sure everything is completely rewritten.

AbsolutelyFree · December 31, 2022, 7:48pm

I am looking at adding a metadata special device to my pool while I am rebuilding my NAS. Does zdb -Lbbbs *pool name* only provide a block size histogram in the output on certain platforms? Because my NAS is currently on Ubuntu 20.04 and the output of that command does not contain a block size histogram.

EDIT: Nevermind, it looks like this feature was added in a later version of ZFS according to the merge request and was not backported to the 0.8 version that 20.04 is still on. I will need to wait until I have my new hardware installed with a newer OS to run these commands.

Log · December 31, 2022, 11:00pm

I found this which might explain why you strangely have no read errors: Topicbox

Christian Quest
Mar 25 (9 months ago)
I got something very similar with one of my pool a few months ago, millions of checksum errors, no read errors.

It was caused by a cache SSD that was silently dying. ZFS was not reporting errors from the cache SSD, but from the vdev HDDs. I removed the cache SSD from the pool, did a scrub and all checksum problems disappeared.

Your pool doesn’t use cache, but has a special vdev. I don’t know if we are is a more or less similar situation, and I have no clue how to get away from your problem because data on the special vdev are not duplicated anywhere else if I’m not wrong.

It’s possible your nvme1n1 is dying.

Additionally, check to make sure that cat /sys/module/zfs/parameters/zfs_no_scrub_io returns 0, as this is also something I found.

janmartendeboer · January 1, 2023, 11:58am

The SSDs used for the special vdev are 2x a model WDC WDS120G2G0A-.

They are both connected to the last two SATA ports on the motherboard. X570 Phantom Gaming 4. The only “off” thing I went about connecting the drives is their power connectors. They are fed through a 4-pin Molex connector through an adapter that splits in 2 SATA power connectors.

I see you added an additional reply. Thanks a lot for helping out so much!

janmartendeboer · January 1, 2023, 12:04pm

I can look into detaching the cache device from the pool, to see if that helps at all. I’t not my expectation the drive is dying, because it has barely any write cycles on it, but who knows.

The result of catting the zfs_no_scrub_io is indeed 0, as expected.

Before detaching the cache drive to see if that solves the problem, I’ll first have to try to better reproduce the error rates, as they seem to have stabilized after writing my initial post. I’m not sure if this is something that is due to the low activity on the pool, or not having run the proper scrubs since, but I am currently looking for a way to properly test that detaching the cache device solves my problem.

Given the state of things, I will give it some time for now. Thanks a lot for all the help, you really gave me the courage and motivation to continue. I’ll post back with any updates, good or bad.

janmartendeboer · January 2, 2023, 11:16pm

The errors seem to increase still. For a bit more insight, I’m sharing the files that are marked as an error:

errors: Permanent errors have been detected in the following files:

        /media/anime/Pokémon/Season 3/Pokémon - S03E06 - Flower Power SDTV.mkv
        /var/lib/mysql/nextcloud/oc_activity_mq.ibd
        /var/lib/mysql/nextcloud/oc_twofactor_providers.ibd
        /var/lib/mysql/undo_001
        main/applications/docker-compose:<0x0>
        /srv/docker-compose/backblaze/wine/drive_c/ProgramData/Backblaze/bzdata/bzlogs/bzbui/bzbui31.log
        /srv/docker-compose/bazarr/config/log/bazarr.log.2022-12-29
        /srv/docker-compose/plex/config/Library/Application Support/Plex Media Server/Logs/Plex Crash Uploader.log
        /srv/docker-compose/backblaze/wine/drive_c/ProgramData/Backblaze/bzdata/bzlogs/bzbui/bzbui02.log
        /srv/docker-compose/backblaze/wine/drive_c/ProgramData/Backblaze/bzdata/bzlogs/bzbui/bzbui30.log
        /srv/docker-compose/backblaze/wine/drive_c/ProgramData/Backblaze/bzdata/bzlogs/bzbui/bzbui29.log
        /srv/docker-compose/backblaze/wine/drive_c/ProgramData/Backblaze/bzdata/bzfilelists/bigfilelist.dat

These mostly seem to be highly active files.

Log files
Database files
A recently transcoded file

Some file has remained in the list of errors, without a proper name, after having manually deleted it when it showed up in the list before and I deemed it unimportant.

I will try removing the cache drive for a couple of days and see if that changes anything.

$ sudo zpool status main

  pool: main
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 45K in 00:08:33 with 19 errors on Mon Jan  2 23:10:42 2023
config:

	NAME        STATE     READ WRITE CKSUM
	main        ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    sdc     ONLINE       0     0     0
	    sdd     ONLINE       0     0     0
	    sdb     ONLINE       0     0     0
	    sdf     ONLINE       0     0     0
	    sdg     ONLINE       0     0     0
	    sda     ONLINE       0     0     0
	special	
	  mirror-1  ONLINE       0     0     0
	    sdi     ONLINE       0     0   355
	    sdj     ONLINE       0     0   355
	spares
	  sdh       AVAIL   
	  sde       AVAIL   

errors: 12 data errors, use '-v' for a list

Log · January 2, 2023, 11:28pm

Do the drives have the latest firmware? A 4 pin molex cable can handle 2 SDD’s just fine, the only potential issue being the molded ones are known to sometimes short out/catch on fire.

I’m still leaning toward a software bug being the problem here, given that both drives have identical numbers of checksum errors. If they had differing amounts of errors, then that’d point more to a pure hardware issue.

These mostly seem to be highly active files.

That’s good to know.

I basically can’t realistically give any further actual help, but I’m definitely interested in what you find. I should strongly suggest making a github issue and/or a mailing list post, as 1# you’re more likely to talk to someone with more in depth knowledge, and 2# it’ll put your issue on the radar if it is a software bug, so it can get fixed.