Thank you to anyone who responds and takes time to read my post. I have a TrueNAS Scale system that I’ve been running for nearly 3 years with no issues at all. First a quick breakdown of my system:
ASRock X570M Pro4 M-ATX motherboard
Ryzen 9 3950X (in ECO Mode)
64 GB of Kingston ECC RAM
6 8TB HGST Ultrastar He8 (HUH728080ALE601) drives for storage
2 Intel 240 GB SSD for boot
2 Intel P1600X 118 GB Optane Drives for SLOG
TrueNAS Scale 24.10.1
1 Pool with 3 mirrored VDEVs
About 7 or 8 days ago, one of my hard drives in one of the mirrored VDEVs went down, all the other drives were reporting back as good, so I decided to offline the drive and remove the VDEV entirely since my overall pool has enough storage space to absorb the data. I reassigned the now extra drive as a spare vdev for the pool. Everything seemed to run fine and I did a SMART test to make sure all the other drives were good, and a scrub to make sure the pool was still healthy - all checks came back with no errors. Then last night I was trying to access my SMB share and I noticed that it was inaccessible, so I tried to log in to the TrueNAS server and realized it was completely frozen. This is the first time this has ever happened, so I rebooted it and when it was starting it went directly into a Kernel Panic. I’ve tried to do research on what may be causing this issue but I haven’t been able to find much aside from one issue that seemed unrelated. So far the only troubleshooting I could think of is running a MEMTEST86 test to make sure the memory is good, but aside from that I don’t know what else to do. Any and all help would be greatly appreciated as this is my production system.
Dumb question, but how would I get those details if it goes into Panic as soon as I attempt to boot? Also I did come across that issue but it was on a Fedora system so I didn’t think it would be applicable to TrueNAS scale. Thank!
I would assume that it’s possible to look up which zfs version a certain TrueNAS Scale version runs. (I don’t run TrueNAS myself so I’m not sure though.) The kernel version seems to be linux-6.6.44 from your photo.
Since the error messages are exactly the same (down to the exact same source file line number - vdev_indirect_mapping.c:528) I’m guessing that it’s the same error.
The fact that you removed a vdev from the pool, which IIUC adds what they call indirect mappings to stored data, together with the name of the source file could point to a bug in the code that handles removal of vdevs. Make sure to keep the info about the removed vdev in your report as well.
So I was able to mount my pools using Ubuntu Live boot, and everything looks good. I ran a scrub on my boot and storage pool and no errors were found. Tried to reboot and it failed with the same error from the original screenshot.
The issue seems to be related to the indirect-0 device, which from everything I found online says that it is a “ghost” device that ZFS creates as a pointer for the VDEV that was removed. The reason I think they’re connected is because the error states “PANIC at vdev_indirect_mapping.c:528”, however, I have no clue where to go now, especially since everything is my pool is showing as healthy.
So I guess this means you tried to reboot into TrueNAS again, the same one that failed before? I guess this means that whatever the issue is, it’s fixed (or worked around) in the ZFS version on your Ubuntu stick.
Is your TrueNAS root on ZFS, or just your storage pools? If the latter, you should be able to disconnect your ZFS drives, boot up TrueNAS, upgrade to a fixed version, power off, hook up your drives, and hopefully have a working system. (Note that I’m not familiar with TrueNAS so I’m not 100 % sure it won’t do something stupid like “forget” about your pool if you boot it without the drives, but…)
If the former you are in a bit of a pickle. How to update your system when you cannot get it to boot? I don’t have an obvious solution to this; perhaps better ask on the TrueNAS forum?
Yeah I tried to reboot into TrueNAS after I performed the scrub to no avail. I was thinking the same thing, but I was running the latest stable version of TrueNAS, which one would assume works correctly with the version of ZFS that they (TrueNAS devs) distribute.
My TrueNAS root is in a ZFS mirror, so yeah I am in an unfortunate pickle . I did post this to the TrueNAS forums, which is were I got the idea to boot from a Live Ubuntu disk, but the person that I was chatting with there was also very perplexed by this issue.
I guess the only good thing so far is that I know my data is “safe”, so I think I might try to migrate my data to somewhere else and rebuild my entire server if I can’t find a fix. Anyways, thanks for the tips and ideas!!
Okay. So there is a tentative way to rescue this situation - or at least there would have been if this was a normal Linux distro and you weren’t already on the latest version:
It is possible to boot into a Live environment (or really into any Linux system as long as it boots to a text console), use ‘chroot’ to “switch” to the broken, non-bootable system’s file system while still running the “rescue” system’s kernel, and then use the broken system’s tools to update it (or otherwise fix it). But yeah, you need some newer version to update to, and I would strongly prefer doing this from a text console rather than trying to start and use a web GUI in the chrooted environment.
In short, I’m unsure on how to do this in TrueNAS. But someone on their forums should know…