BTRFS: cannot upgrade Debian release

I have a server that uses BTRFS as the root filesystem.
I am attempting to upgrade from Debian Bullseye to Bookworm.

I have a full system backup I can easily restore to, using a different machine, which is already in Debian Bookworm.

So, I tried to apt full-upgrade as normal, and I get BTRFS errors and the filesystem goes ro. I say no big deal, and I simply reformat and rsync my backup data back onto the server’s SSD. Then, while I’m still on the Bookworm system, I run BTRFS scrub and check --readonly on the server’s SSD before I put it back into the server and boot. In every attempt I’ve made, these checks always look clear and show no issues.

I have tried simply using "mkfs.btrfs -r backup/ " , but that almost always fails to pass a btrfs scrub and check. So rsync it is.

Server boots perfectly, like nothing ever happened. Just as expected. I even btrfs scrub the root fs whole the server is booted just to be sure, and it always shows it’s fine. So I start the upgrade attempt again, and BTRFS will always get a checksum error at some point in the process. Always. I’ve made 6 attempts, and it keeps happening, do different files even. I keep restoring the same backup, and verifying all the BTRFS stuff returns no errors, but the upgrade process always triggers a checksum error and forces the server to read-only.

What am I doing wrong? Why does BTRFS scrub and check keep falsely telling me everything is fine when clearly there’s a checksum issue? How can I fix this so that I may upgrade to Debian Bookworm?

The server’s SSD passes the Smart tests. So I don’t believe hardware failure to be the culprit.

Server is using default mount options for it’s root filesystem. Linux 5.18. Debian Bullseye.

Maybe check the ssd with os tools, like badblocks? Have you done an extended smart test or a quick one?

1 Like

Just did the long test. It passed.
I don’t think badblocks is effective at all on NVMe devices, isn’t that the NVMe controllers job?

I thought badblocks was just for ATA devices.

I’m not aware of any other “os tools” that test for failing storage hardware.

I COULD replace the SSD, but I wont unless i’m sure that will fix the issue. Nothing indicates the SSD is failing, so the issue is probably not the hardware.

Hmm, in principle, yes. But what if the controller is buggy? I think badblocks writes to disk and then just reads if the result is the same. If the controller is not mapping out bad blocks correctly that would still work.

The only other reason I could think of is some kind of btrfs bug.

As you restore using rsync, of course you could switch to ext4. Though that may fail silently, if the hardware does have issues?

1 Like

funny you should mention that. ive already tried it. switching to ext4 renders the system unbootable. the kernel refuses to see my initramfs file even though ive verified it DOES exist where it should.

i’ve been using this setup for quite some time, on this exact kernel even. i cant find any documented bug with btrfs on kernel 5.18 that gives a false clean check/scrub like this, and then has errors anyway. but i do know my google-fu is not the strongest.

before anybody tells me to update my kernel: that’s exactly what im trying to do here.

unlikely, but I can’t rule it out. this system’s been working flawlessly for years. is there any way to be certain? the board is the ASUS Prime x399-a. the SSD is an Intel SSDPE21D280GA

If the OS drive’s configuration can be restored from backup easily, then can you try a totally fresh install of bullseye with both ext4 and btrfs and go through the upgrade process with each?

1 Like

I could do that, but I don’t see what that would accomplish. If I were okay with just reinstalling, I wouldn’t have bothered to make a backup.

I want to update my current setup, not have to recreate it from fresh Debian.

If I do try your suggestion, I will have to wipe away the fresh Debian anyways to restore my backup and bring my server back online.

At best, I would have an updated fresh Debian, which I’d have to wipe away as I restore from backup. I’d be back to my current state.

If you do a fresh install of bullseye on ext4 (or figure out whatever issue is preventing you from switching the current system to ext4), successfully upgrade the system to bookworm, and are able to use it for its intended purpose without further issues, it makes it more likely that your issue lies specifically with btrfs. No matter what tree you bark up software wise, the software folks are going to request more extensive evidence that the hardware is proven to be sound and working. SMART tests aren’t always the most reliable thing.

1 Like

Very well. I will try this.
I never considered that SMART might be giving me a false negative when testing for errors.

… maybe this SSD is actually dying.
But then why does the Bookworm machine not fail any scrubs or checks when I’m writing my backup into this SSD?

SMART is a brilliant concept and isn’t totally worthless, but it’s also not the be-all end-all source of truth when it comes to actual drive health. I had a 6TB hard drive in a ZFS pool give a SMART error in a test over 5 years ago that is still in use today and has never resulted in a ZFS error. I have the replacement drive on hand and ready to replace the one with an error, but it’s time to shine still hasn’t come.

Just a guess, but it could be that there’s a range of bad sectors and your existing data fits on the drive without touching those bad sectors. New data being added for the upgrade stretches into those bad sectors, causing issues.

The easiest test of all would be to just clone the drive to a new drive and try on there. I know you said you don’t want to replace the drive, but even if the current drive winds up being fine and the issue is entirely software, it is still good practice to have another drive on hand for immediate recovery for when a drive does eventually fail, or for testing potential drive failures in future similar situations. Personally I keep a spare drive on hand in every size and format that is currently in use and important to me.

1 Like

I have ordered a replacement drive. Another Intel NVMe U.2 one.

Debian will not install fresh to the current SSD. Keeps kernel panicking or segfaulting when I try.

Now, if the new SSD arrives and still has this behavior then we’d really have a mystery.

1 Like

That sounds very strange… Maybe it is some other component causing corruption: memory, motherboard? Do you have ECC? If the server is not usable you might as well run a memtest.

2 Likes

I do not have ECC. I don’t think the RAM or board is dead, wouldn’t I fail POST?

Not necessarily. And bad ram can absolutely corrupt your file system, as everything written to disk will be held there first.

Hardware also doesn‘t have to be completely dead before it starts malfunctioning.

3 Likes

I cannot run a memtest. It never outputs to the screen, always just black. This server happens to not have a serial connection either. dont suppose any of you know how to make memtest output on a Navi10 graphics card?

or is there a way i can run a memtest from a linux kernel booted in text mode?

Do you have any bios settings relating to video output? Otherwise try a different memtest, i.e. memtest86+ instead of memtest 86 pr vice versa?

1 Like

I see at least on fedora that I can install memtest86+. But using a bootable one should be better, if possible, because the memory in use won‘t be able to be tested.

1 Like

neither memtest86, memtest86+ or the kernel parameter output anything to the screen.
my motherboard’s settings have no entries governing display output.

Update: i got memtest 86 to actually cooperate with this machine. Running mem test now. I will update this post with the results

Yup: RAM is fucked.

anybody know of an 8x16GB kit thats compatible with the prime x399-a?

3 Likes

There’s ASUS’ QVL list. Probably many other kits would work too but those should be guaranteed.

2 Likes