ZFS Metadata Special Device: Z

Wondering if this is pcie errors on a board with aer squelched. …

Another thing that crossed my mind was have you tested for memory errors?

I was able to dig up problem much more similar to yours, though they only involve mirrors, not special vdev mirrors.

https://www.reddit.com/r/zfs/comments/vlfy2o/high_cpuio_leads_to_identical_checksum_errors/
https://www.reddit.com/r/zfs/comments/rqsion/identical_zfs_checksum_errors_on_mirrored_nvme/
No resolution to these

And this: https://www.reddit.com/r/zfs/comments/od3t32/help_wanted_zfs_mirror_always_has_the_same_number/
In which case replacing both sata cables and a disk seemed to work.

And this: Checksum errors, need help diagnosing. | TrueNAS Community

  • Updated my BIOS
  • Replaced my power supply (the original one was an off-brand that required sata adpters, that felt like my most egregious hardware part)

Perhaps unplug and reseat the power connections, like the adapter you used,

Hi wendell, thanks for chiming in.

Unless I’m looking at the wrong place, I don’t think that is happening, currenly. I do have an add-in card for the additional SATA slots, but I’m not seeing AER errors.

$ sudo dmesg | grep -i AER
[    0.728018] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug SHPCHotplug PME AER PCIeCapability LTR DPC]
[    1.321877] pcieport 0000:00:01.2: AER: enabled with IRQ 28
[    1.322065] pcieport 0000:00:01.3: AER: enabled with IRQ 29
[    1.322221] pcieport 0000:00:03.1: AER: enabled with IRQ 30
[    1.322439] pcieport 0000:00:07.1: AER: enabled with IRQ 32
[    1.322598] pcieport 0000:00:08.1: AER: enabled with IRQ 33

I’m especially intrigued by the high cpuio loads. I am running into some aberrant behavior with both my NVIDIA GPU getting evicted from the Docker container runtime, resulting in high CPU loads on transcode operations and Backblaze consuming lots of CPU resources, even if it has not had anything to backup in a while. Both are suspect.

I ran memtest86 and found no memory errors. I reseated cables to the drives and memory for good measure.

After having the cache detached, the CKSUM errors have not gone away and they are still mirrored across devices.

At this point I am choosing to bow out and thank you all for thinking along, but my SO overruled me in that we will simply rebuild the pool at this point, for both our sanity. If I had gotten a bit longer with this, I would have opened an issue, but as I won’t be reintroducing special devices in the new pool, it seems wrong to open the issue for a pool that can’t reproduce the problem.

Thank you all for your input!

1 Like

I conclude that we successfully rebuilt the pool together.

zpool status
  pool: main
 state: ONLINE
config:

	NAME        STATE     READ WRITE CKSUM
	main        ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    sda     ONLINE       0     0     0
	    sdb     ONLINE       0     0     0
	    sdc     ONLINE       0     0     0
	    sdd     ONLINE       0     0     0
	    sdf     ONLINE       0     0     0
	    sdg     ONLINE       0     0     0
	logs	
	  nvme1n1   ONLINE       0     0     0
	cache
	  sdi       ONLINE       0     0     0
	  sdj       ONLINE       0     0     0
	spares
	  sde       AVAIL   
	  sdh       AVAIL

errors: No known data errors

I swapped around some duties here, making the previous cache device a log device and the special devices as cache devices. It might still end up showing hardware issues, but at least now I can safely detach devices from the pool.

Thanks again for all the input and support!

2 Likes

if it were me…

id export the pool

and import it with

zpool import poolname -d /dev/disk/by-id/ 
1 Like

Thanks a bunch for the suggestion!

$ zpool status
  pool: main
 state: ONLINE
  scan: scrub repaired 0B in 00:07:10 with 0 errors on Tue Jan  3 14:46:47 2023
config:

	NAME                                                                                       STATE     READ WRITE CKSUM
	main                                                                                       ONLINE       0     0     0
	  raidz2-0                                                                                 ONLINE       0     0     0
	    wwn-0x50014ee2bfb9ebed                                                                 ONLINE       0     0     0
	    wwn-0x50014ee26a640583                                                                 ONLINE       0     0     0
	    wwn-0x5000c500c3d1a6f1                                                                 ONLINE       0     0     0
	    wwn-0x5000c500c3d6963d                                                                 ONLINE       0     0     0
	    wwn-0x50014ee2bfba002a                                                                 ONLINE       0     0     0
	    wwn-0x5000c500c3d165b1                                                                 ONLINE       0     0     0
	logs	
	  nvme-nvme.1987-3139323938323239303030313238353535304334-466f726365204d50363030-00000001  ONLINE       0     0     0
	cache
	  wwn-0x5001b444a7b30dc7                                                                   ONLINE       0     0     0
	  wwn-0x5001b444a7b30db5                                                                   ONLINE       0     0     0
	spares
	  wwn-0x50014ee25cc3127d                                                                   AVAIL   
	  wwn-0x50014ee25cc2f341                                                                   AVAIL   

errors: No known data errors
1 Like

image

What happens if my metadata device fills up? Is it possible?

Or does it write to the pool instead?

I’m guessing a bunch of node_modules/ directories filled this up. I excluded them in Robocopy, but /PURGE (or /MIR) doesn’t remove them if they’re excluded, so they’re artificially bloating my metadata drives right now.

I have 1TB metadata on my other NAS, but I thought 200+ gigs would be plenty here. It looks like I might need to add some mirrors.

ZFS starts writing to the pool as it would normally do without a special vdev when it’s 75% full, leaving a 15% reserve for general performance reasons.

You can the reserve percentage in: /etc/modprobe.d/zfs.conf by adding zfs_special_class_metadata_reserve_pct=10% for example, which should still be perfectly fine. The larger the drives are, the less reserve is needed to maintain sanity in free space availability that allows ZFS to function quickly.

Calculating actual space in ZFS is a little bit complicated even before metadata is involved, so I refuse to comment on what reported numbers actually mean and why napkin math doesn’t give entirely expected results. Because fuck if I know.

1 Like

The important thing for me is knowing the pool doesn’t fail when metadata drives fill up.

I wish I could dedicate metadata only to specific datasets like I can with dedupe. Is that possible?

Yes it is! I can find the exact way to do that in a bit when I’m not on the phone, but essentially you set something the datasets you don’t want included to “none” or something like that.

I am not finding an obvious way to exclude a dataset from utilizing a special vdev. I take it back I may have gotten confused with some other things like setting primarycache=none|metadata|all on a dataset, or excluding dedupe data by changing the zfs_ddt_data_is_special module property (though this is system wide).

Hello. I have a basic understanding of what is going on here, but I’m not grasping some of it. My situation is that I don’t even have a ZFS pool yet, but I am looking to create one with new drives at some point (my RAID array is full). If I create a histogram of the files of this array, is there any way I can calculate how large of SSDs I would need to store the metadata (to figure out things before even migrating to ZFS)? I wouldn’t be storing at small blocks files on it, just metadata. My mind is going all over the place thinking about this.

It might be better to see real data. I don’t know how to get the file stats, but I have a mix of games, long video streams, short videos, pictures, and code projects as well as documents, music compositions, you name it.

All that data is currently 8.22TB. This zpool has 204GB of metadata. Unless you have 8TB of super tiny files, you’re probably fine. The quickest way to test is to copy your files. I’m assuming you have an extra copy of your data to play with.

I have an all SSD-array in this NAS. I’m only using metadata vdevs because I put four Optane drives in there as 2 mirrors. Those drives allow very low-latency access; otherwise, I wouldn’t have bothered. In my offsite HDD array, I used regular NAND flash SSDs.

Is that even true? I mean you should also have checksums with ZFS and if one record from a 2-way mirror has correct checksums and the other has not, then it should be obvious which record is the correct one?!

What I was describing was if a disk in a mirror fails or is removed, leaving only one disk. In which case ZFS can tell if there is an error, but there is no practical way to fix regular data.

One thing I didn’t consider however, is that normal metadata/files stored within metadata blocks should have an extra copy by default, so that should still be correctable. I don’t actually know if this is still the case when using special vdevs. If there is an extra copy, then there’s a chance that things are still correctable, assuming the data corruption isn’t across areas of the disk effecting both copies. Use of special small blocks to put regular data on the special vdev would still likely be at risk.