Hi there,
I built my current rig in around October of 2019, with a Ryzen 3900X, an ASUS Pro WS X570-ACE (mostly because it was all black, judge away), with 32GB of Trident Z Neo 3600MHz CL16 RAM.
Since putting this together, despite updating my BIOS with every new revision hoping that the issue would just quietly disappear, I have been having random freezes/crashes in linux. Initially, 90% of the time, this issue only appeared when I would sleep and then wake the system, meaning I could keep my system running for several days and would (almost) never experience it crash. If, however, it were to sleep and then wake, during use the next day it may suddenly freeze up. More recently, there have been several occasions where the computer just reboots out of the blue - mostly when I’m not even touching it, it’s just idling.
Here are some things I’ve attempted to do: 1) I’ve run Prime95 tests in Windows overnight with no errors reported (Blend stress tests) so, at least according to these tests, both my CPU and RAM are supposedly fine. I should also mention that while I mostly just use Windows for gaming, random crashes are very infrequent. They have occurred, but I have always just assumed those were classic Windows things. 2) I’ve run memtest and stress in linux for a few hours each. After 5 loops of memtest no errors are reported, and running stress does not crash the system either.
The only modifications I make to the default BIOS: 1) PBO set to Enabled (from auto) 2)
With Prime95 my system sometimes reaches 90 degrees C, which is far higher than it ever reaches during normal use (if I load it up with my usual workload it can reach 75-80). Therefore, it is not crashing due to heat. Nor, I believe, due to the power supply(?). It doesn’t seem to be a memory issue or a CPU issue from my testing. However, finally I get to the hardware errors that linux (mint) is reporting:
mce: [Hardware Error]: Machine check events logged
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:0 (17:71:0) MC25_STATUS[-|CE|MiscV|-|-|-|-|CECC|-|-|-]: 0x98004000003e0000
[Hardware Error]: IPID: 0x000100ff03830400
[Hardware Error]: Platform Security Processor Ext. Error Code: 62
[Hardware Error]: cache level: RESV, tx: INSN
mce: [Hardware Error]: Machine check events logged
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:0 (17:71:0) MC25_STATUS[-|CE|MiscV|-|-|-|-|CECC|-|-|-]: 0x98004000003e0000
[Hardware Error]: IPID: 0x000100ff03830400
[Hardware Error]: Platform Security Processor Ext. Error Code: 62
[Hardware Error]: cache level: RESV, tx: INSN
mce: [Hardware Error]: Machine check events logged
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:0 (17:71:0) MC25_STATUS[-|CE|MiscV|-|-|-|-|CECC|-|-|-]: 0x98004000003e0000
[Hardware Error]: IPID: 0x000100ff03830400
[Hardware Error]: Platform Security Processor Ext. Error Code: 62
[Hardware Error]: cache level: RESV, tx: INSN
mce: [Hardware Error]: Machine check events logged
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:0 (17:71:0) MC25_STATUS[-|CE|MiscV|-|-|-|-|CECC|-|-|-]: 0x98004000003e0000
[Hardware Error]: IPID: 0x000100ff03830400
[Hardware Error]: Platform Security Processor Ext. Error Code: 62
[Hardware Error]: cache level: RESV, tx: INSN
mce: [Hardware Error]: Machine check events logged
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:0 (17:71:0) MC25_STATUS[-|CE|MiscV|-|-|-|-|CECC|-|-|-]: 0x98004000003e0000
[Hardware Error]: IPID: 0x000100ff03830400
[Hardware Error]: Platform Security Processor Ext. Error Code: 62
[Hardware Error]: cache level: RESV, tx: INSN
[Hardware Error]: Machine check events logged
[Hardware Error]: CPU 13: Machine Check: 0 Bank 0: baa0000000010145
[Hardware Error]: TSC 0 MISC d012000100000000 SYND 4d00002e IPID b000000000
[Hardware Error]: PROCESSOR 2:870f10 TIME 1621245389 SOCKET 0 APIC 3 microcode 8701021
mce: [Hardware Error]: Machine check events logged
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:0 (17:71:0) MC25_STATUS[-|CE|MiscV|-|-|-|-|CECC|-|-|-]: 0x98004000003e0000
[Hardware Error]: IPID: 0x000100ff03830400
[Hardware Error]: Platform Security Processor Ext. Error Code: 62
[Hardware Error]: cache level: RESV, tx: INSN
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 13: Machine Check: 0 Bank 0: baa0000000010145
mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 4d00002e IPID b000000000
mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1621690972 SOCKET 0 APIC 3 microcode 8701021
I am at a complete loss as to what could be causing my issues, so I’m asking for the insight of more experienced builders/linux users. If you could please spare a moment to guess at what might be the cause, and where I could start to isolate the issue I’m having, that would be fantastic.