BtrFS: System Crash during Restorecon, now mounting drive hangs

I was restorecon -R /path/to/drive/mount when my system crashed and rebooted due to a different issue.

Now, when I try to mount the drive with mount -a, it hangs and checking journalctl -xe shows that the device’s start service failed and timed out.

Doing btrfs check on the device works fine and reports no errors.

SMART reports no errors either. Just a healthy new drive.

I imagine this is an issue with SystemD/SELinux due to the incident before this issue occurred, but I am uncertain of how to fix the drive if I cannot mount it to do context operations on it.

For reference, I had upgraded from Fedora 30 to Fedora 31, and found I couldn’t access my files over SMB share as I could before. Just the directories. I did restorecon -R /path/to/drive/mount and this resolved my issue as I could now access the files.

However, upon a normal reboot, I lost access again, and ran the command again, and that fixed it once more. That’s why I was running restorecon, as I had just rebooted once more.

Anyway, here is the Journal Control showing the timeout error:

I can’t get the hung mount command to stop, even with kill -s 9 <pid>, but I’m more concerned about why this is happening at all.

When I try to mount, the drive goes up to 99% usage, then drops back to 0% after a minute or two, presumably when the timeout limit is hit.

Does anyone have any idea why this might be happening?

Edit: journalctl -f showing after disabling SELinux entirely with vi /etc/sysconfig/selinux to disabled.

Edit 2: Googling has resolved me to these two stackexchange questions that revolved around /etc/fstab and finding a device by that UUID.

However, when I do blkid, it matches the UUID in the /etc/fstab. And more importantly, I’m trying to mount the device with this command:

mount -t btrfs /dev/sdb /path/to/drive/mount

So UUID’s shouldn’t be an issue. Targeting /dev/sdb with btrfs check works fine. Same with smartctl, fdisk, etc.

So I’m not understanding why it might not be finding the UUID, unless SystemD is somehow hiding it from itself.

After a little more sleuthing, the dependency that’s failing is the .swap for the drive.

I tried swapoff -a, but it gives this error:
image
However, I don’t have a device with that UUID and never have to my knowledge:

wai

Edit:

More sleuthing = reading the actual error log

So now I have to figure out why that’s even a thing. How do you disable Swap on a drive that isn’t mounted yet?

1 Like

Are you able to mount it from another OS (either live usb or passing the drive to a vm)?

Also, can you disable swap in fstab and/or systemd, reboot and then see what happens?

Since swap is managed by systemd, it will be activated again on the next system startup. To disable the automatic activation of detected swap space permanently, run systemctl --type swap to find the responsible .swap unit and mask it.

2 Likes

Good questions.

I shall have to setup a separate VM to try that.

I have disabled swap in the fstab and systemctl --type swap returns 0 units loaded after rebooting. However, the mount command still hangs.

journalctl -f no longer shows the timeout, yet the command still hangs.

Time to test in another OS.

Try a btrfs scrub too.

1 Like

I don’t understand most of your first post, I know little about SELinux, but I’d like to make two comments:

  1. My main btrfs had a problem once that manifested as a send failure and “BTRFS error: did not find backref in send_root” in the system log. My various attempts to fix it made it worse and it started hanging as you describe, unkillable. This was on Ubuntu 19.04. I booted a really old systemrescuecd USB stick, with some old 4.3 kernel, and that hung on mount too. I was forced to back up (at the file level) and wiping and recreating the btrfs.
  2. UUIDs in /etc/fstab are painful. If you can, set sensible labels on the file systems and use LABEL= instead (PARTLABEL= for swap). A good thing for grub entries, too.

I am not able to mount it from a live OS for Fedora. Damn.

I’ve done this before to my detriment. I’ll try it after I’ve run out of research topics to figure this out.

I’m not seeing any errors with BTRFS.

image

Womp. Can’t do that unless there’s a way to scrub a device without mounting it.

Based on that, and that I can’t open it in another VM running a Live OS image, I take it it’s a problem with the drive itself.

Which is odd, given btrfs check reports no errors, and crystaldiskinfo reports no SMART problems.

It sounds like there’s a 3rd system here I’m unaware of that’s messed up, but I don’t know what that could be.

Edit: Found this.

Edit 2:

https://patchwork.kernel.org/patch/11370677/#23143831

dmesg shows the following:
image

So, it’s doing a tree-log replay. This is a 4TB drive with 100,000+ files on it, so if that’s the case, should this take hours? Should I let it finish?

My issue is that this is then in a loop. It takes too long to mount, so it doesn’t but then it’s stuck in a hung mount process.

So I have to reboot since I can’t kill it, which gives it an unclean mount, and it has to replay the log again.

I doubt that’s my issue, but it’s one potential self-defeating cycle.

https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-rescue

I see there’s a zero-log argument for btrfs rescue, but I worry about using that.

Honestly, I just need to see where an error would go for this. journalctl only shows the timeout on mount, though it doesn’t do this anymore since I disabled swap.

dmesg only shows the 3 lines above.

Where else would I look for mount hanging errors?

Edit 3:

When doing this, I tried mount -f -v -t btrfs /dev/sdb /path/to/drive/mount, but Verbose gave me nothing, and Faking the mount succeeded without issue. So that points to it being the drive further, I’d think. However, when I try mount -v -t btrfs /dev/sdb /path/to/drive/mount it hangs as expected and produces no Verbose output.

Edit 4:

I will be trying to let it play out the log over the next 12 hours. If it isn’t complete by then (it is an 8TB drive after all), I’ll probably reboot the VM again and try to mount with -o recovery,nospace_cache and if that fails instead trying -o recovery,nospace_cache,clear_cache.

It makes sense to me that this log replay would take forever since I was changing the context of the entire disk for SELinux. I presume this requires changes on the whole thing since it took a while every time.

image

So I take it the base root is the one that’s screwed up?

Interesting. What is the recovery process here now that I’ve witnessed that this works?

I should probably unmount, then try again without recovery, I imagine, and see if it’s just the cache.

Edit 1:

Aaaand it seems once mounting it successfully, then unmounting it successfully, mounting it normally works fine.

Huzzah.

It may have been letting it try to mount for over 12 hours, but I didn’t bother to check it and instead jumped to the recovery/cache options for mount -t btrfs.

Edit 2:

After this issue coming again, it was the recovery,nospace_cache command that worked, not letting it run for 12 hours.

1 Like