I suffered an unclean shutdown during a power outage a few days ago, resulting in an effectively degraded ZFS pool. To summarize the pool’s erratic behavior, it’s sometimes able to boot, but in ~10x the time it usually takes. If it is able to boot, interactive processes will block every ~5 minutes for ~2 minutes until an eventual lockup, forcing an unclean shutdown.
For corrective measures, I managed to scrub the pool today to no avail.
The following bits were retrieved after booting from a live CD:
root@ubuntu:~# zpool status
pool: rpool
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(5) for details.
scan: scrub repaired 0B in 06:14:37 with 0 errors on Mon Apr 4 23:04:45 2022
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-ST2000DM008-2FR102_ZFL1XK66 ONLINE 0 0 0
ata-ST2000DM008-2FR102_ZFL1WR7D ONLINE 0 0 0
logs
sda3 ONLINE 0 0 0
cache
sda4 ONLINE 0 0 0
errors: No known data errors
Please let me know if and how I can provide any additional information, particularly diagnostic.
Sorry to hear this, I wonder if you’d be better answered if you can SSH in to provide information and head over to the truenas forum? Speak politely and they can be incredibly helpful.
I think in your situation I would wonder about the non-drive hardware and see if I had a spare machine that I could plug the drives into. Not so easy of course, but failing that I would refer to my snapshot backup and restore all files after validating the health of the drives via SMART.
If you haven’t done a zpool clear, it doesn’t look like a pool problem per se.
Might be a hardware issue, like a slightly loose cable.
Or a boot order issue where it is trying to load the pool before having all providers available.
A photo of the error message would help a lot. Presumably it does not get in to the OS for operating logs?
I haven’t looked at zdb, it might have info, but I’m not sure what to look for…
I had the same thought and wiggled them all during early troubleshooting. Along that line however I incidentally booted this morning to find my cache device was unavailable. This happens occasionally, since I’ve bound (and never fixed) the cache and log devices using a /dev/sdXn path.
Without the cache device, the pool is performing nominally. If we are to assume the issue is hardware-rooted, that drive is by far the oldest in use.
Should I follow through and remove the cache device from the pool? Is this all adding up?
I think the main thing is working out what the error is.
Yeah, removing the cache/log might help, if that provider is the one causing the block/delay.
If it were either of the data drives, it should boot normally, but the pool would be in degraded status until the other data drive is available, then a resilver should bring them up to speed.
But glad you gave the cables a poke- always good to check the basics when troubleshooting.
Any error message would help- I set the grub/SystemD to remove “quiet” so I have the text output if/when it stalls/freezes.