Unusable ZFS pool

HawtNoodles · April 5, 2022, 1:29am

I suffered an unclean shutdown during a power outage a few days ago, resulting in an effectively degraded ZFS pool. To summarize the pool’s erratic behavior, it’s sometimes able to boot, but in ~10x the time it usually takes. If it is able to boot, interactive processes will block every ~5 minutes for ~2 minutes until an eventual lockup, forcing an unclean shutdown.

For corrective measures, I managed to scrub the pool today to no avail.

The following bits were retrieved after booting from a live CD:

root@ubuntu:~# zpool status
  pool: rpool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
	still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
	the pool may no longer be accessible by software that does not support
	the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 06:14:37 with 0 errors on Mon Apr  4 23:04:45 2022
config:

	NAME                                 STATE     READ WRITE CKSUM
	rpool                                ONLINE       0     0     0
	  mirror-0                           ONLINE       0     0     0
	    ata-ST2000DM008-2FR102_ZFL1XK66  ONLINE       0     0     0
	    ata-ST2000DM008-2FR102_ZFL1WR7D  ONLINE       0     0     0
	logs	
	  sda3                               ONLINE       0     0     0
	cache
	  sda4                               ONLINE       0     0     0

errors: No known data errors

Please let me know if and how I can provide any additional information, particularly diagnostic.

ChrisA · April 5, 2022, 12:35pm

Sorry to hear this, I wonder if you’d be better answered if you can SSH in to provide information and head over to the truenas forum? Speak politely and they can be incredibly helpful.

I think in your situation I would wonder about the non-drive hardware and see if I had a spare machine that I could plug the drives into. Not so easy of course, but failing that I would refer to my snapshot backup and restore all files after validating the health of the drives via SMART.

Trooper_ish · April 5, 2022, 1:39pm

If you haven’t done a zpool clear, it doesn’t look like a pool problem per se.
Might be a hardware issue, like a slightly loose cable.
Or a boot order issue where it is trying to load the pool before having all providers available.

A photo of the error message would help a lot. Presumably it does not get in to the OS for operating logs?

I haven’t looked at zdb, it might have info, but I’m not sure what to look for…

HawtNoodles · April 5, 2022, 2:59pm

The pool hasn’t reported any errors.

like a slightly loose cable

I had the same thought and wiggled them all during early troubleshooting. Along that line however I incidentally booted this morning to find my cache device was unavailable. This happens occasionally, since I’ve bound (and never fixed) the cache and log devices using a /dev/sdXn path.

Without the cache device, the pool is performing nominally. If we are to assume the issue is hardware-rooted, that drive is by far the oldest in use.

Should I follow through and remove the cache device from the pool? Is this all adding up?

Trooper_ish · April 5, 2022, 3:06pm

I think the main thing is working out what the error is.

Yeah, removing the cache/log might help, if that provider is the one causing the block/delay.
If it were either of the data drives, it should boot normally, but the pool would be in degraded status until the other data drive is available, then a resilver should bring them up to speed.

But glad you gave the cables a poke- always good to check the basics when troubleshooting.

Any error message would help- I set the grub/SystemD to remove “quiet” so I have the text output if/when it stalls/freezes.