Linux reporting Hardware Errors (long-term, sporadic issue)

Hi there,

I built my current rig in around October of 2019, with a Ryzen 3900X, an ASUS Pro WS X570-ACE (mostly because it was all black, judge away), with 32GB of Trident Z Neo 3600MHz CL16 RAM.

Since putting this together, despite updating my BIOS with every new revision hoping that the issue would just quietly disappear, I have been having random freezes/crashes in linux. Initially, 90% of the time, this issue only appeared when I would sleep and then wake the system, meaning I could keep my system running for several days and would (almost) never experience it crash. If, however, it were to sleep and then wake, during use the next day it may suddenly freeze up. More recently, there have been several occasions where the computer just reboots out of the blue - mostly when I’m not even touching it, it’s just idling.

Here are some things I’ve attempted to do: 1) I’ve run Prime95 tests in Windows overnight with no errors reported (Blend stress tests) so, at least according to these tests, both my CPU and RAM are supposedly fine. I should also mention that while I mostly just use Windows for gaming, random crashes are very infrequent. They have occurred, but I have always just assumed those were classic Windows things. 2) I’ve run memtest and stress in linux for a few hours each. After 5 loops of memtest no errors are reported, and running stress does not crash the system either.

The only modifications I make to the default BIOS: 1) PBO set to Enabled (from auto) 2)

With Prime95 my system sometimes reaches 90 degrees C, which is far higher than it ever reaches during normal use (if I load it up with my usual workload it can reach 75-80). Therefore, it is not crashing due to heat. Nor, I believe, due to the power supply(?). It doesn’t seem to be a memory issue or a CPU issue from my testing. However, finally I get to the hardware errors that linux (mint) is reporting:

mce: [Hardware Error]: Machine check events logged
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:0 (17:71:0) MC25_STATUS[-|CE|MiscV|-|-|-|-|CECC|-|-|-]: 0x98004000003e0000
[Hardware Error]: IPID: 0x000100ff03830400
[Hardware Error]: Platform Security Processor Ext. Error Code: 62
[Hardware Error]: cache level: RESV, tx: INSN
mce: [Hardware Error]: Machine check events logged
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:0 (17:71:0) MC25_STATUS[-|CE|MiscV|-|-|-|-|CECC|-|-|-]: 0x98004000003e0000
[Hardware Error]: IPID: 0x000100ff03830400
[Hardware Error]: Platform Security Processor Ext. Error Code: 62
[Hardware Error]: cache level: RESV, tx: INSN
mce: [Hardware Error]: Machine check events logged
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:0 (17:71:0) MC25_STATUS[-|CE|MiscV|-|-|-|-|CECC|-|-|-]: 0x98004000003e0000
[Hardware Error]: IPID: 0x000100ff03830400
[Hardware Error]: Platform Security Processor Ext. Error Code: 62
[Hardware Error]: cache level: RESV, tx: INSN
mce: [Hardware Error]: Machine check events logged
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:0 (17:71:0) MC25_STATUS[-|CE|MiscV|-|-|-|-|CECC|-|-|-]: 0x98004000003e0000
[Hardware Error]: IPID: 0x000100ff03830400
[Hardware Error]: Platform Security Processor Ext. Error Code: 62
[Hardware Error]: cache level: RESV, tx: INSN
mce: [Hardware Error]: Machine check events logged
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:0 (17:71:0) MC25_STATUS[-|CE|MiscV|-|-|-|-|CECC|-|-|-]: 0x98004000003e0000
[Hardware Error]: IPID: 0x000100ff03830400
[Hardware Error]: Platform Security Processor Ext. Error Code: 62
[Hardware Error]: cache level: RESV, tx: INSN
[Hardware Error]: Machine check events logged
[Hardware Error]: CPU 13: Machine Check: 0 Bank 0: baa0000000010145
[Hardware Error]: TSC 0 MISC d012000100000000 SYND 4d00002e IPID b000000000
[Hardware Error]: PROCESSOR 2:870f10 TIME 1621245389 SOCKET 0 APIC 3 microcode 8701021
mce: [Hardware Error]: Machine check events logged
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:0 (17:71:0) MC25_STATUS[-|CE|MiscV|-|-|-|-|CECC|-|-|-]: 0x98004000003e0000
[Hardware Error]: IPID: 0x000100ff03830400
[Hardware Error]: Platform Security Processor Ext. Error Code: 62
[Hardware Error]: cache level: RESV, tx: INSN
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 13: Machine Check: 0 Bank 0: baa0000000010145
mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 4d00002e IPID b000000000
mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1621690972 SOCKET 0 APIC 3 microcode 8701021

I am at a complete loss as to what could be causing my issues, so I’m asking for the insight of more experienced builders/linux users. If you could please spare a moment to guess at what might be the cause, and where I could start to isolate the issue I’m having, that would be fantastic.

go into bios and check to see if hpet (high precision event timer) is enabled.
if it is leave it. if not, enable it, reboot and see how you go. if its disabled then its likely a key cause of crashing.
why?
some processes rely on hardware clocks to be enabled in bios.
if they aren’t enabled will cause random bsods anytime an application makes a call to a clock it cant find.

as for pbo it shouldn’t cause any issues with boosting the cpu or throttling it assuming you have the cooling and you havent been fiddling with the power states manually.
also unless you have a poor psu that can handle the super low power states, you shouldn’t see crashing when idling down the system from a heavy workload thats just completed.
if you do, try a different psu with a better than 80+ thats less than 5 years old.

1 Like

Thank you very much for your suggestions, I’ll reboot in a bit and check the HPET you mention.

I bought everything new when I pieced this system together, so the power supply is a Corsair RM850X 80+ Gold, so unless it’s faulty that shouldn’t be a problem.

I haven’t messed with any of the additional power settings, purely because I don’t know anything about them and I’m fine with stock speeds using PBO. I was sometimes setting a -0.05v offset to reduce thermals in the system so that they never peak above 80, but these crashes happen when I leave everything else at stock while enabling the RAM D.O.C.P. profile. PBO left at default Auto setting.

Okay so on my ASUS motherboard there doesn’t appear to be a setting related to HPET.

As an additional note, note sure if it’s unrelated, I get

usb 1-4: device descriptor read/64, error -71

This is with just a mouse, keyboard and audio DAC connected. Even when I connect my USB hub everything works (seemingly) correctly, but just in case this could be related, these are the only errors I see being reported when booting Linux Mint.

yeah seems they didnt implement the hpet switch in bios on that board.
much to the chagrin of some other asus users.

in windows you can enable if via the cmd prompt.
as for linux. i have no idea if there is a software switch to enable it.

as for the usb hub your sure its not overpowering your usp connection. if you can split off a couple of your devices onto another connector that isnt shared on the back of the motherboard… (just for testing sake)

The only thing I have connected to it at the moment is a Bluetooth transceiver and my old logitech webcam. I will try to boot in a minute with everything unplugged but it’s complaining about all four ports 1-4 apparently. I’m starting to think that some of this could be from a bad installation, but the sporadic crashes have been harder to pin down. [Edit: After unblocking all and rebooting one-by-one the error has cleared - likely to be my USB hub still, maybe it’s a little flaky.)

I googled for HPET on my ASUS board as well, and it seems like it’s typically auto-enabled by default and that in windows some prefer to disable it for gaming, so presumably that isn’t my issue? I will see if I can double check this in Windows later.

Thanks again for any suggestions on this. I’m planning to potentially swap out the motherboard and PSU this week, downsize to an ITX case, and while it would be unsatisfying, a part of me hopes that it miraculously fixes my issue (after perhaps doing a clean install of linux).

1 Like

Darkfeign,

What distro and kernel version are you running? I’d be surprised if a default configuration is properly reporting AMD MCE errors, as most only come prepared for Intel’s MCE.

Best regards,

vhns

So I suspended linux last night and woke it this morning, wanting to see if I could induce some more of these weird issues I’ve been seeing. I haven’t suspended the system in a long time, normally I just shut it down now because it always used to be buggy.

Seems I was correct.


Message from syslogd@meshify-workstation at May 25 10:05:21 …
kernel:[27184.375227] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/u65:0:21125]

Message from syslogd@meshify-workstation at May 25 10:05:49 …
kernel:[27212.376669] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/u65:0:21125]

Message from syslogd@meshify-workstation at May 25 10:06:29 …
kernel:[27252.377729] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/u65:0:21125]

Message from syslogd@meshify-workstation at May 25 10:06:57 …
kernel:[27280.378048] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/u65:0:21125]

Message from syslogd@meshify-workstation at May 25 10:07:25 …
kernel:[27308.378159] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/u65:0:21125]

Message from syslogd@meshify-workstation at May 25 10:07:53 …
kernel:[27336.378134] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/u65:0:21125]

Message from syslogd@meshify-workstation at May 25 10:08:21 …
kernel:[27364.378023] watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [kworker/u65:0:21125]

Message from syslogd@meshify-workstation at May 25 10:08:49 …
kernel:[27392.377857] watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [kworker/u65:0:21125]

Granted, this seemed to be the result of me trying to reconnect a bluetooth speaker, but this is the kind of funky stuff that seems to happen after suspending the system.

Ah is this a thing? I’m running Linux Mint 20.1, Linux Kernel 5.4.0-73-generic.

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.