Kernel panic plus other system errors

xavier01 · November 17, 2019, 5:05pm

Hi

I built a new system back in august and lately i’ve been having different errors pop up, and i am not yet skilled enough in linux to track down the cause. So, i was hoping i could get some advice.

My system:
Ryzen 9 3900x
Gigabyte Aorus master with F5n BIOS (i did change some settings per a thread on here about optimizing ryzen chips). I turned off PBO, and set the cpu voltage manually, but other than that, settings should be baseline, and no OC’ing.
32GB Crucial Ballistix RAM
2 nvme Adata sx8200Pro.
/boot and /boot/efi on separate partitions on #1. /, /home, and /swap are on RAID 0 on both drives.
Fedora 30

the most recent error is a kernel panic when i installed the latest kernel update. 5.3.11-200. “broken padding”. So far, this happens everytime i try to start this kernel.

But, i can load into the previous kernel 5.3.8-200. And also, last week, when i updated to 5.3.8-200, i had the same problem. I could load into the previous kernel (5.2.18-200), but not the current one. Kernel panic again, but a slightly different message. (don’t have a picture of this error)

On November 15, i had the following error

A couple of days ago i had this pop on in problem reporting

WARNING: CPU: 10 PID: 0 at arch/x86/kernel/cpu/common.c:1678 debug_stack_reset.cold+0xc/0xf

Also, on the same day this error

traps: ImgDecoder #12[23873] general protection fault ip:7fc101127e6a sp:7fc0e5ffa698 error:0 in libxul.so[7fc0fe2f5000+38be000]

2 weeks ago i had this error

BUG: Bad rss-counter state mm:00000000f54264e1 idx:1 val:3

Then, also 2 weeks ago, several errors of this nature, different cpu ids, but same error

watchdog: BUG: soft lockup - CPU#16 stuck for 22s! [kworker/16:0:3292]

On googling the kernel panics, i have seen reports that say it’s the initramfs that the problem, others say its the ram, others the hard drive. But given the CPU lockups, i’m am also concerned that i might have a bad processor.

Tried running memtest86 this morning, but it’s telling me it doesn’t work on EFI systems. I ran memtester for about 50 passes, but everything came back ok.

So, what i would really like is help to figure out what exactly is the issue. If it’s hardware, then i need answers so i can get warranty service. If it’s something else, then i of course, i just want to get it fixed.

Any advice or direction would be appreciated.

Thanks for reading

Mastic_Warrior · November 18, 2019, 3:59am

Something is wrong with your Raid configuration. the kernel panic is due to not being able to find your rootfs on the expected drive. If that location changes between builds of your initrd then you will not boot. I am surprised it did to give you the option for interactive mode.

I don’t really know how to help you if you are still new to the *nix world. Recommendation would be to use md instead of raid as that is the prefered way to do things in the GNU/Linux world now it seems.

xavier01 · November 18, 2019, 1:56pm

thanks for your reply.

I am using MD for my raid array. Sorry if i wasn’t clear on that. Drive or initrd location isn’t changing., which is why i was thinking that maybe one of the drives is bad. But, fsck says the /boot filesystem is clean, and smartctl says the boot drive is healthy.

sda 8:0 0 447.1G 0 disk
├─sda1 8:1 0 100M 0 part
├─sda2 8:2 0 128M 0 part
└─sda3 8:3 0 446.9G 0 part
nvme1n1 259:0 0 477G 0 disk
├─nvme1n1p1 259:2 0 100M 0 part /boot/efi
├─nvme1n1p2 259:3 0 500M 0 part /boot
├─nvme1n1p3 259:4 0 50.1G 0 part
│ └─md127 9:127 0 100.1G 0 raid0 /
├─nvme1n1p4 259:5 0 2G 0 part
│ └─md126 9:126 0 4G 0 raid0 [SWAP]
└─nvme1n1p5 259:6 0 424.3G 0 part
└─md125 9:125 0 848.3G 0 raid0 /home
nvme0n1 259:1 0 477G 0 disk
├─nvme0n1p1 259:7 0 50.1G 0 part
│ └─md127 9:127 0 100.1G 0 raid0 /
├─nvme0n1p2 259:8 0 2G 0 part
│ └─md126 9:126 0 4G 0 raid0 [SWAP]
└─nvme0n1p3 259:9 0 424.3G 0 part
└─md125 9:125 0 848.3G 0 raid0 /home

I’m not new to linux, just new to dealing with this level of problems, so far, any troubleshooting i’ve had to do has been simple, straightforward and fairly easy. So i posted here in hopes that other people might point out anything i could be missing, overlooking, or just plain didn’t know.

Mastic_Warrior · November 19, 2019, 6:17pm

Do me a favor and boot to the last known working kernel and then check what UEFI is looking for in regards to booatable device by typing sudo efibootmgr.

8 had an issue when my boot devices actually changed in regards to sdc becoming sda, and sdb and sdc swapped as well. When I rebuilt my boot options and wrote them to EFI, I had an unbootable system. I had to rebuild using the --gpt flag to set the system to use the UUID instead of the sdXy designation.

xavier01 · November 20, 2019, 3:04am

BootCurrent: 0001
Timeout: 1 seconds
BootOrder: 0001,000C,000D,0000
Boot0000 Windows Boot Manager HD(1,GPT,d4d0dcb3-9f5a-4678-8c8c-1124c1c5ebb9,0x800,0x32000)/File(\EFI\MICROSOFT\BOOT\BOOTMGFW.EFI)WINDOWS…x…B.C.D.O.B.J.E.C.T.=.{.9.d.e.a.8.6.2.c.-.5.c.d.d.-.4.e.7.0.-.a.c.c.1.-.f.3.2.b.3.4.4.d.4.7.9.5.}…
Boot0001* Fedora HD(1,GPT,d4d0dcb3-9f5a-4678-8c8c-1124c1c5ebb9,0x800,0x32000)/File(\EFI\FEDORA\SHIMX64.EFI)
Boot000C* Fedora HD(1,GPT,d4d0dcb3-9f5a-4678-8c8c-1124c1c5ebb9,0x800,0x32000)/File(\EFI\FEDORA\SHIM.EFI)…BO
Boot000D* Windows Boot Manager HD(1,GPT,e34ef75b-84a1-4e3d-ae4c-e05880594d0f,0x800,0x32000)/File(\EFI\MICROSOFT\BOOT\BOOTMGFW.EFI)…BO

Mastic_Warrior · November 20, 2019, 10:05pm

Looks interesting. I know that on ArchLinux, the default is to use the systemdx64.efi file to use the efiboot/systemdboot loader. On Debian, they use grub as the default so you get the grubx64.efi. I use both of those OS.

What happens when you press the F8 key at POST and select options 0000, 000C. or 0000D.

Also, do you have Windows and Linux installed on the same RAID/NVMe device(s)? Take a look at your UUIDs on the devices.

xavier01 · November 21, 2019, 12:20am

yeah. that first windows is from when i first built the computer and put win10 on there to run benchmarks on the disks. It doesn’t exist anymore as it was overwritten by fedora. I need to delete that entry. The second windows is a an install i did a few weeks ago on a different HD, just to have access to the CPU/case lighting, since there is no way in linux to control it.

I didn’t notice until you mentioned it that the 2 fedora entries actually point to different files.

I will try different boot options and see what happens and get back to you.

xavier01 · November 21, 2019, 1:12pm

OK. Tried booting with the different fedora options.
choosing either one, still results in the same kernel panic with the latest kernel.

However, there were a couple of other things i noticed in BIOS.

Clock was off by 6 hours
The first time i turned my PC on, my CPU was clocking at 4.199. (Clock multiplier is at 39). First time i’ve ever seen it clocking that high in BIOS
Memory was clocking at 3499 (memory should be 3200Mhz)

Not sure if that means anything or not.

Mastic_Warrior · November 26, 2019, 1:18am

Hmm, That is interesting. That might be ACPI.

You may need to supply new boot options maybe. I don’t know why the new kernels are not working but the old ones are.

Have you tried bi-secting the kernel releases until you find where it breaks. Then report the bug.

xavier01 · November 27, 2019, 5:56pm

Actually, it’s gotten even more frustrating. I just installed a new kernel today, now the first 2 kernels don’t work and give the same error message.

THink i might just have to wipe the whole thing, get rid of the RAID, and just use the 2 nvme’s as separate drives. At least, then maybe i can see if it’s a hd issue.
Keep having other issues pop up. ABRT went and erased all previous error reports. Daemon is throwing up an error that it can’t access a folder due to permissions. And the log talks about having to delete problem abrt folders, so i’m guessing that’s why all my error reports are now gone.
Also, now i’m getting some radeon video errors, something about forbidden registers, but i haven’t been able to find much info about it online yet.

Unfortunately, i don’t know how to b-sect the kernel to find the source of the issue.

Its just one of those days today

EDIT: Ok, now that i’ve had a chance to cool down and think clearly again.

This is my boot cmdline

BOOT_IMAGE=(hd1,gpt2)/vmlinuz-5.3.8-200.fc30.x86_64 root=UUID=67e32de0-8719-47ec-bb47-7b1203b67d1e ro resume=UUID=85b148b7-c28d-46e1-a60b-d30acaa77d47 rd.md.uuid=ade7702e:9b613259:149278b2:980da072 rd.md.uuid=1335dc25:6ff79998:e43d8960:9ca4310e rhgb iommu=pt amd_iommu=on rd.driver.pre=vfio-pci kvm_amd.npt=1 pcie_aspm=off

and the boot with the newer kernels fails at “cannot open root device UUID=67e32de0-8719-47ec-bb47-7b1203b67d1e” It doesn’t see the RAID devices, it only sees a SATA disk i have for a windows VM. Just before it fails at this line md does an autodetect for RAID devices, it doesn’t list any devices, it just says autorun, then autorun DONE.

EDIT#2—11/28/19

Ok. after spending all morning digging thru modules, logs, etc. At first, thought it was md raid was not finding the array which was causing the issue. But, it turns out that both 5.3.11 & 5.3.12 kernel had corrupt initramfs files. After regenerating the initramfs for 5.3.12… It boots. YAY!..
Now, to figure out why both files ended up corrupt, and maybe if i’m lucky, figure out why my CPUs have the occasional soft lockup.

And thank you very much @Mastic_Warrior for your patience and time in helping me to try to figure all this out.

Mastic_Warrior · November 29, 2019, 10:28pm

@xavier01 I honestly did nothing. You did all of the work, including having the patience to work through this and not rage quit.

As far as the corrupted files, that may be because the device did not sync properly before a shutdown, or trim is not working correctly. I don’t Fedora, but when you install a new kernel, does it automagically build the initramfs or do you have to manually follow up to build it yourself. Old Fedora Core (~2005) used to require you to do this on your own when I used to play with it.

You may also want to check to see if it is pointing the symlinks to the correct initramfs file in you boot directory/device.

xavier01 · December 1, 2019, 4:22pm

you did help.
it was your idea about changing the boot options that got me looking at the parameters and finally using the lsinitrd to list out the initramfs, and seeing the corruption errors inside.

fedora automatically creates the initramfs when it updates to a new kernel. Symlinks are correct, so i’ll have to wait for the next kernel update and see if it does it again.
And, i definitely should be checking to make sure the nvme drives are using trim correctly. I know that in the past with certain distros, this has been an issue.