Understanding MCE Errors - Won't Boot Linux 6.5

I am getting an MCE Error whenever I boot my PC and it prevents Linux from loading using the 6.5 Kernel but weirdly I can boot 6.4 and use my PC without issue but it still shows the same MCE error in the system logs. I’m running a Ryzen 5900x CPU and want to know how I can figure out what is causing the error. I already tried running memtest and all tests passed several passes.

Here seems to be the relevant log:

[    0.839041] microcode: CPU0: patch_level=0x0a201025
[    0.839042] microcode: CPU3: patch_level=0x0a201025
[    0.839043] microcode: CPU2: patch_level=0x0a201025
[    0.839043] microcode: CPU4: patch_level=0x0a201025
[    0.839043] microcode: CPU5: patch_level=0x0a201025
[    0.839044] microcode: CPU7: patch_level=0x0a201025
[    0.839044] microcode: CPU6: patch_level=0x0a201025
[    0.839045] microcode: CPU8: patch_level=0x0a201025
[    0.839045] microcode: CPU9: patch_level=0x0a201025
[    0.839046] microcode: CPU10: patch_level=0x0a201025
[    0.839046] microcode: CPU11: patch_level=0x0a201025
[    0.839047] microcode: CPU12: patch_level=0x0a201025
[    0.839047] microcode: CPU13: patch_level=0x0a201025
[    0.839048] microcode: CPU14: patch_level=0x0a201025
[    0.839049] microcode: CPU16: patch_level=0x0a201025
[    0.839049] microcode: CPU17: patch_level=0x0a201025
[    0.839049] microcode: CPU15: patch_level=0x0a201025
[    0.839050] microcode: CPU18: patch_level=0x0a201025
[    0.839051] microcode: CPU19: patch_level=0x0a201025
[    0.839052] mce: [Hardware Error]: Machine check events logged
[    0.839053] microcode: CPU20: patch_level=0x0a201025
[    0.839053] microcode: CPU21: patch_level=0x0a201025
[    0.839054] microcode: CPU22: patch_level=0x0a201025
[    0.839054] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 27: faa000000000080b
[    0.839055] microcode: CPU23: patch_level=0x0a201025
[    0.839056] mce: [Hardware Error]: TSC 0 MISC d012000200000000 SYND 5d000000 
[    0.839061] fbcon: Taking over console
[    0.839069] IPID 1002e00000500 
[    0.839071] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1696856580 SOCKET 0 APIC 0 microcode a201025
[    0.839077] microcode: CPU1: patch_level=0x0a201025
[    0.839085] microcode: Microcode Update Driver: v2.2.

I have issues with segfaults in FIO and various other problems on 6.5.5+, but can run 6.5.4 with minimal issue. Can you try older revisions of 6.5? Also with 5900x here.
What version microcode do you have? What version agesa? Maybe there’s a conflict there with new AMD_PSTATE features, and older microcode, or older agesa with newer microcode?

But I am a simple pork and can only guess. :woozy_face:

I don’t have any software issues on 6.4.15.

Here are those versions:

desktop ~> grep 'stepping\|model\|microcode' /proc/cpuinfo | head -4
model		: 33
model name	: AMD Ryzen 9 5900X 12-Core Processor
stepping	: 0
microcode	: 0xa201025
desktop ~> dmesg | grep -e 'DMI.*BIOS'
[    0.000000] DMI: Micro-Star International Co., Ltd. MS-7C91/MPG B550 GAMING EDGE WIFI (MS-7C91), BIOS 1.E0 06/27/2023

I found the culprit, I had to remove my PCIe to Firewire card and now I get no errors and can boot on Kernel 6.5. I know they had some firewire changes so I’m wondering if there is a bug or just bad timing.

Edit: Looks like it’s a bug: 1215436 on suse’s bugzilla

1 Like