ZFS Metadata Special Device: Z

Log · December 31, 2022, 5:56pm

Bummer

Probably won’t matter, but for reference what models of SSDs did you get and how are they connected? It’s known that sometimes a particular samsung evo drive will generate a few errors every so often, but nothing like this and certainly not on both drives at the same time.

What’s strange is it’s both drives, which can point more to something like a sata controller being bad/overheating. Bad cables (the most common error generating problem) will cause errors only for the drive it’s connected to.

However it’s also very possible you hit some kind of software bug and ZFS is getting confused.

I’d search the GitHub issues, as well as take this to the ZFS mailing list and the subreddit since you’re more likely to find someone who can help figure out what’s going on in regards to the software internals.

Beyond that, how I’d personally approach this is basically clear it all, completely make a fresh pool from scratch, and then transfer the data from the backup pool, and if that doesn’t work possibly even using rsync (with byte for byte verification) instead of sending/receiving as snapshot to make sure everything is completely rewritten.

AbsolutelyFree · December 31, 2022, 7:48pm

I am looking at adding a metadata special device to my pool while I am rebuilding my NAS. Does zdb -Lbbbs *pool name* only provide a block size histogram in the output on certain platforms? Because my NAS is currently on Ubuntu 20.04 and the output of that command does not contain a block size histogram.

EDIT: Nevermind, it looks like this feature was added in a later version of ZFS according to the merge request and was not backported to the 0.8 version that 20.04 is still on. I will need to wait until I have my new hardware installed with a newer OS to run these commands.

Log · December 31, 2022, 11:00pm

I found this which might explain why you strangely have no read errors: Topicbox

Christian Quest
Mar 25 (9 months ago)
I got something very similar with one of my pool a few months ago, millions of checksum errors, no read errors.

It was caused by a cache SSD that was silently dying. ZFS was not reporting errors from the cache SSD, but from the vdev HDDs. I removed the cache SSD from the pool, did a scrub and all checksum problems disappeared.

Your pool doesn’t use cache, but has a special vdev. I don’t know if we are is a more or less similar situation, and I have no clue how to get away from your problem because data on the special vdev are not duplicated anywhere else if I’m not wrong.

It’s possible your nvme1n1 is dying.

Additionally, check to make sure that cat /sys/module/zfs/parameters/zfs_no_scrub_io returns 0, as this is also something I found.

janmartendeboer · January 1, 2023, 11:58am

The SSDs used for the special vdev are 2x a model WDC WDS120G2G0A-.

They are both connected to the last two SATA ports on the motherboard. X570 Phantom Gaming 4. The only “off” thing I went about connecting the drives is their power connectors. They are fed through a 4-pin Molex connector through an adapter that splits in 2 SATA power connectors.

I see you added an additional reply. Thanks a lot for helping out so much!

janmartendeboer · January 1, 2023, 12:04pm

I can look into detaching the cache device from the pool, to see if that helps at all. I’t not my expectation the drive is dying, because it has barely any write cycles on it, but who knows.

The result of catting the zfs_no_scrub_io is indeed 0, as expected.

Before detaching the cache drive to see if that solves the problem, I’ll first have to try to better reproduce the error rates, as they seem to have stabilized after writing my initial post. I’m not sure if this is something that is due to the low activity on the pool, or not having run the proper scrubs since, but I am currently looking for a way to properly test that detaching the cache device solves my problem.

Given the state of things, I will give it some time for now. Thanks a lot for all the help, you really gave me the courage and motivation to continue. I’ll post back with any updates, good or bad.

janmartendeboer · January 2, 2023, 11:16pm

The errors seem to increase still. For a bit more insight, I’m sharing the files that are marked as an error:

errors: Permanent errors have been detected in the following files:

        /media/anime/Pokémon/Season 3/Pokémon - S03E06 - Flower Power SDTV.mkv
        /var/lib/mysql/nextcloud/oc_activity_mq.ibd
        /var/lib/mysql/nextcloud/oc_twofactor_providers.ibd
        /var/lib/mysql/undo_001
        main/applications/docker-compose:<0x0>
        /srv/docker-compose/backblaze/wine/drive_c/ProgramData/Backblaze/bzdata/bzlogs/bzbui/bzbui31.log
        /srv/docker-compose/bazarr/config/log/bazarr.log.2022-12-29
        /srv/docker-compose/plex/config/Library/Application Support/Plex Media Server/Logs/Plex Crash Uploader.log
        /srv/docker-compose/backblaze/wine/drive_c/ProgramData/Backblaze/bzdata/bzlogs/bzbui/bzbui02.log
        /srv/docker-compose/backblaze/wine/drive_c/ProgramData/Backblaze/bzdata/bzlogs/bzbui/bzbui30.log
        /srv/docker-compose/backblaze/wine/drive_c/ProgramData/Backblaze/bzdata/bzlogs/bzbui/bzbui29.log
        /srv/docker-compose/backblaze/wine/drive_c/ProgramData/Backblaze/bzdata/bzfilelists/bigfilelist.dat

These mostly seem to be highly active files.

Log files
Database files
A recently transcoded file

Some file has remained in the list of errors, without a proper name, after having manually deleted it when it showed up in the list before and I deemed it unimportant.

I will try removing the cache drive for a couple of days and see if that changes anything.

$ sudo zpool status main

  pool: main
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 45K in 00:08:33 with 19 errors on Mon Jan  2 23:10:42 2023
config:

	NAME        STATE     READ WRITE CKSUM
	main        ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    sdc     ONLINE       0     0     0
	    sdd     ONLINE       0     0     0
	    sdb     ONLINE       0     0     0
	    sdf     ONLINE       0     0     0
	    sdg     ONLINE       0     0     0
	    sda     ONLINE       0     0     0
	special	
	  mirror-1  ONLINE       0     0     0
	    sdi     ONLINE       0     0   355
	    sdj     ONLINE       0     0   355
	spares
	  sdh       AVAIL   
	  sde       AVAIL   

errors: 12 data errors, use '-v' for a list

Log · January 2, 2023, 11:28pm

Do the drives have the latest firmware? A 4 pin molex cable can handle 2 SDD’s just fine, the only potential issue being the molded ones are known to sometimes short out/catch on fire.

I’m still leaning toward a software bug being the problem here, given that both drives have identical numbers of checksum errors. If they had differing amounts of errors, then that’d point more to a pure hardware issue.

These mostly seem to be highly active files.

That’s good to know.

I basically can’t realistically give any further actual help, but I’m definitely interested in what you find. I should strongly suggest making a github issue and/or a mailing list post, as 1# you’re more likely to talk to someone with more in depth knowledge, and 2# it’ll put your issue on the radar if it is a software bug, so it can get fixed.

wendell · January 3, 2023, 12:15am

Wondering if this is pcie errors on a board with aer squelched. …

Log · January 3, 2023, 7:35am

Another thing that crossed my mind was have you tested for memory errors?

I was able to dig up problem much more similar to yours, though they only involve mirrors, not special vdev mirrors.

https://www.reddit.com/r/zfs/comments/vlfy2o/high_cpuio_leads_to_identical_checksum_errors/
https://www.reddit.com/r/zfs/comments/rqsion/identical_zfs_checksum_errors_on_mirrored_nvme/
No resolution to these

And this: https://www.reddit.com/r/zfs/comments/od3t32/help_wanted_zfs_mirror_always_has_the_same_number/
In which case replacing both sata cables and a disk seemed to work.

And this: Checksum errors, need help diagnosing. | TrueNAS Community

Updated my BIOS
Replaced my power supply (the original one was an off-brand that required sata adpters, that felt like my most egregious hardware part)

Perhaps unplug and reseat the power connections, like the adapter you used,

janmartendeboer · January 3, 2023, 11:09am

Hi wendell, thanks for chiming in.

Unless I’m looking at the wrong place, I don’t think that is happening, currenly. I do have an add-in card for the additional SATA slots, but I’m not seeing AER errors.

$ sudo dmesg | grep -i AER

[    0.728018] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug SHPCHotplug PME AER PCIeCapability LTR DPC]
[    1.321877] pcieport 0000:00:01.2: AER: enabled with IRQ 28
[    1.322065] pcieport 0000:00:01.3: AER: enabled with IRQ 29
[    1.322221] pcieport 0000:00:03.1: AER: enabled with IRQ 30
[    1.322439] pcieport 0000:00:07.1: AER: enabled with IRQ 32
[    1.322598] pcieport 0000:00:08.1: AER: enabled with IRQ 33

janmartendeboer · January 3, 2023, 11:17am

I’m especially intrigued by the high cpuio loads. I am running into some aberrant behavior with both my NVIDIA GPU getting evicted from the Docker container runtime, resulting in high CPU loads on transcode operations and Backblaze consuming lots of CPU resources, even if it has not had anything to backup in a while. Both are suspect.

I ran memtest86 and found no memory errors. I reseated cables to the drives and memory for good measure.

After having the cache detached, the CKSUM errors have not gone away and they are still mirrored across devices.

At this point I am choosing to bow out and thank you all for thinking along, but my SO overruled me in that we will simply rebuild the pool at this point, for both our sanity. If I had gotten a bit longer with this, I would have opened an issue, but as I won’t be reintroducing special devices in the new pool, it seems wrong to open the issue for a pool that can’t reproduce the problem.

Thank you all for your input!

janmartendeboer · January 3, 2023, 1:42pm

I conclude that we successfully rebuilt the pool together.

zpool status

  pool: main
 state: ONLINE
config:

	NAME        STATE     READ WRITE CKSUM
	main        ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    sda     ONLINE       0     0     0
	    sdb     ONLINE       0     0     0
	    sdc     ONLINE       0     0     0
	    sdd     ONLINE       0     0     0
	    sdf     ONLINE       0     0     0
	    sdg     ONLINE       0     0     0
	logs	
	  nvme1n1   ONLINE       0     0     0
	cache
	  sdi       ONLINE       0     0     0
	  sdj       ONLINE       0     0     0
	spares
	  sde       AVAIL   
	  sdh       AVAIL

errors: No known data errors

I swapped around some duties here, making the previous cache device a log device and the special devices as cache devices. It might still end up showing hardware issues, but at least now I can safely detach devices from the pool.

Thanks again for all the input and support!

nx2l · January 3, 2023, 6:23pm

if it were me…

id export the pool

and import it with

zpool import poolname -d /dev/disk/by-id/

janmartendeboer · January 3, 2023, 7:02pm

Thanks a bunch for the suggestion!

$ zpool status

  pool: main
 state: ONLINE
  scan: scrub repaired 0B in 00:07:10 with 0 errors on Tue Jan  3 14:46:47 2023
config:

	NAME                                                                                       STATE     READ WRITE CKSUM
	main                                                                                       ONLINE       0     0     0
	  raidz2-0                                                                                 ONLINE       0     0     0
	    wwn-0x50014ee2bfb9ebed                                                                 ONLINE       0     0     0
	    wwn-0x50014ee26a640583                                                                 ONLINE       0     0     0
	    wwn-0x5000c500c3d1a6f1                                                                 ONLINE       0     0     0
	    wwn-0x5000c500c3d6963d                                                                 ONLINE       0     0     0
	    wwn-0x50014ee2bfba002a                                                                 ONLINE       0     0     0
	    wwn-0x5000c500c3d165b1                                                                 ONLINE       0     0     0
	logs	
	  nvme-nvme.1987-3139323938323239303030313238353535304334-466f726365204d50363030-00000001  ONLINE       0     0     0
	cache
	  wwn-0x5001b444a7b30dc7                                                                   ONLINE       0     0     0
	  wwn-0x5001b444a7b30db5                                                                   ONLINE       0     0     0
	spares
	  wwn-0x50014ee25cc3127d                                                                   AVAIL   
	  wwn-0x50014ee25cc2f341                                                                   AVAIL   

errors: No known data errors

Sawtaytoes · January 5, 2023, 6:56pm

What happens if my metadata device fills up? Is it possible?

Or does it write to the pool instead?

I’m guessing a bunch of node_modules/ directories filled this up. I excluded them in Robocopy, but /PURGE (or /MIR) doesn’t remove them if they’re excluded, so they’re artificially bloating my metadata drives right now.

I have 1TB metadata on my other NAS, but I thought 200+ gigs would be plenty here. It looks like I might need to add some mirrors.

Log · January 5, 2023, 7:02pm

ZFS starts writing to the pool as it would normally do without a special vdev when it’s 75% full, leaving a 15% reserve for general performance reasons.

You can the reserve percentage in: /etc/modprobe.d/zfs.conf by adding zfs_special_class_metadata_reserve_pct=10% for example, which should still be perfectly fine. The larger the drives are, the less reserve is needed to maintain sanity in free space availability that allows ZFS to function quickly.

Calculating actual space in ZFS is a little bit complicated even before metadata is involved, so I refuse to comment on what reported numbers actually mean and why napkin math doesn’t give entirely expected results. Because fuck if I know.

Sawtaytoes · January 5, 2023, 9:04pm

The important thing for me is knowing the pool doesn’t fail when metadata drives fill up.

I wish I could dedicate metadata only to specific datasets like I can with dedupe. Is that possible?

Log · January 5, 2023, 9:17pm

~~Yes it is! I can find the exact way to do that in a bit when I’m not on the phone, but essentially you set something the datasets you don’t want included to “none” or something like that.~~

I am not finding an obvious way to exclude a dataset from utilizing a special vdev. I take it back I may have gotten confused with some other things like setting primarycache=none|metadata|all on a dataset, or excluding dedupe data by changing the zfs_ddt_data_is_special module property (though this is system wide).

Tim_Ramich · January 8, 2023, 6:51pm

Hello. I have a basic understanding of what is going on here, but I’m not grasping some of it. My situation is that I don’t even have a ZFS pool yet, but I am looking to create one with new drives at some point (my RAID array is full). If I create a histogram of the files of this array, is there any way I can calculate how large of SSDs I would need to store the metadata (to figure out things before even migrating to ZFS)? I wouldn’t be storing at small blocks files on it, just metadata. My mind is going all over the place thinking about this.

Sawtaytoes · January 9, 2023, 8:59am

It might be better to see real data. I don’t know how to get the file stats, but I have a mix of games, long video streams, short videos, pictures, and code projects as well as documents, music compositions, you name it.

All that data is currently 8.22TB. This zpool has 204GB of metadata. Unless you have 8TB of super tiny files, you’re probably fine. The quickest way to test is to copy your files. I’m assuming you have an extra copy of your data to play with.

I have an all SSD-array in this NAS. I’m only using metadata vdevs because I put four Optane drives in there as 2 mirrors. Those drives allow very low-latency access; otherwise, I wouldn’t have bothered. In my offsite HDD array, I used regular NAND flash SSDs.