Please help me debug "hardware" problem under linux

vlycop · February 8, 2023, 6:41pm

Hi,
I’m having a lot of issue with a pretty new PC under linux
It look like it work OK under windows, but to be fair i so rarely use it it also could be random luke.

So i have a Ryzen 9 5900X, on a ASRock X570 Taichi Razer Edition.
I run Fedora 37, but this as never been stable

randomly, the computer will freeze, power off and reboot.
I could be on discord, youtube or playing a game, never really the same thing.

At the reboot i get a message who say

The kernel log indicates that hardware errors were detected.

It recommend the use of mcelog to have info about it, but it is installed and refuse to start, asking for a kernel module already loaded

➜  ~ sudo mcelog start
mcelog: ERROR: AMD Processor family 25: mcelog does not support this processor.  Please use the edac_mce_amd module instead.
CPU is unsupported
➜  ~ sudo lsmod | grep amd     
edac_mce_amd           57344  0
kvm_amd               172032  0
kvm                  1126400  1 kvm_amd
amdgpu              10661888  11
drm_ttm_helper         16384  1 amdgpu
ttm                    94208  2 amdgpu,drm_ttm_helper
iommu_v2               24576  1 amdgpu
video                  65536  1 amdgpu
gpu_sched              49152  1 amdgpu
drm_buddy              20480  1 amdgpu
ccp                   122880  1 kvm_amd
drm_display_helper    208896  1 amdgpu

To me, first thing seam to be able to get this running, but i’m open to more option

Can you help ?

jode · February 8, 2023, 6:54pm

On a terminal run

journalctl -k | grep -i error

That should give you the list of errors.

vlycop · February 8, 2023, 7:47pm

sadly there is almost nothing since a new desktop os don’t keep those on disk

[sudo] password for vaarlion: 
févr. 08 20:23:16 tesla-v2 kernel: mce: [Hardware Error]: Machine check events logged
févr. 08 20:23:16 tesla-v2 kernel: mce: [Hardware Error]: CPU 10: Machine Check: 0 Bank 1: bc800800060c0859
févr. 08 20:23:16 tesla-v2 kernel: mce: [Hardware Error]: TSC 0 ADDR 34e387280 MISC d012000000000000 IPID 100b000000000 
févr. 08 20:23:16 tesla-v2 kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1675884195 SOCKET 0 APIC 18 microcode a201016
févr. 08 20:23:16 tesla-v2 kernel: RAS: Correctable Errors collector initialized.
févr. 08 20:23:16 tesla-v2 kernel: be2net 0000:03:00.0: PCIe error reporting enabled
févr. 08 20:23:17 tesla-v2 kernel: be2net 0000:03:00.1: PCIe error reporting enabled
févr. 08 20:23:17 tesla-v2 kernel: usbhid: probe of 5-3.4.4:1.1 failed with error -32
févr. 08 20:23:17 tesla-v2 kernel: usbhid: probe of 5-3.4.4:1.2 failed with error -32
févr. 08 20:24:13 tesla-v2 kernel: audit: type=1338 audit(1675884253.262:29): module=crypt op=ctr ppid=1 pid=706 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="systemd-cryptse" exe="/usr/lib/systemd/systemd-cryptsetup" subj=kernel dev=253:0 error_msg='success' res=1

jode · February 8, 2023, 7:59pm

First theory is insufficient cooling. In case you have not yet, I recommend installing the sensors package.

sudo dnf install lm_sensors

There are a bunch of tools that help with graphical display of sensor data, I mostly use the terminal like so

watch -n1 sensors

Look for temperature sensors starting with CPU temperatures, also motherboard, chipset sensors.
I had lockups from overheating nvme sticks as well in the past.

vlycop · February 8, 2023, 8:24pm

my temperature seam to be well under control
60°C at load, i have an NHD 15 on it and the paste is very new
nvme is at 41°C also

AbsolutelyFree · February 8, 2023, 8:31pm

Have you run memtest86 on it? If not, do a full round of it.
Do you have a different PSU you can put in that system?

vlycop · February 8, 2023, 8:56pm

Yup, see A years and a half after losing a luks partion, it broke again - #14 by risk

kinda ? i don’t have the cable lengh to do it i’ll have to disasemble everything.
My current power supply is a seasonic prime 1000W… i would trust if especially since the cut don’t happens only on triple A, not on VR game under windows.

vic · February 9, 2023, 5:36pm

Your 5900X’s L2 cache was corrupted when your system rebooted/crashed. This is very rare hardware issue IF you were running your system everything as default or recommended by AMD. If that’s the case, I suggest you do a RMA on your 5900x because AMD does sell marginal quality silicons from time to time.

However, I suspect more likely it was caused by overclock (either by you or someone else in your chain list of component suppliers). If you OC by yourself, you’re pretty much on your own. AMD officially doesn’t offer warranty once PBO/etc are turned on.

vlycop · February 9, 2023, 5:55pm

hi !
May i ask you what make you think that ? I’m still lacking any log of the actual issue since i can’t get mcelog to work

PBO was enabled (but left as it) by the company who sold me the cpu and the board because the Bord was an open box and they couldn’t ship the ram until after the 15 day warranty expired on it, So i had them test it with there own ram.

vic · February 9, 2023, 6:25pm

The hardware error was shown in your previous post:

vlycop:

févr. 08 20:23:16 tesla-v2 kernel: mce: [Hardware Error]: CPU 10: Machine Check: 0 Bank 1: bc800800060c0859
févr. 08 20:23:16 tesla-v2 kernel: mce: [Hardware Error]: TSC 0 ADDR 34e387280 MISC d012000000000000 IPID 100b000000000 
févr. 08 20:23:16 tesla-v2 kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1675884195 SOCKET 0 APIC 18 microcode a2

If you check Windows’ event log, you may find equivalent errors in Microsoft speak. Something like WHEA xxx which might give you a more human readable interpretation. Not very sure about this since these days I spent little time in Windows as an OS.

I suggest you restore BIOS to default without O/C. If problem persist, then exchange or refund the CPU/MB combo.

vlycop · February 9, 2023, 7:55pm

MB would be a lost, i had 15 day 9 month ago, CPU i’ll check.

If someone how know windows can help me filter the event viewer to see this i would find it very useful as i haven’t touch it since 2013

vlycop · February 21, 2023, 6:47pm

After more than a week, i can’t reproduce this issue under windows while it’s very regular under linux …
Not having linux log nor windows event log mean i have no way to contact support about it

This is what show up on linux after a crash

I really need every pointer you can give me

vic · February 22, 2023, 2:24am

Try Linux kernel 6.1

If you’ve O/C, revert to default. Or reduce memory & IF clock and try.

Windows and Linux initialise Ryzen differently in some aspects. Windows is more ‘friendly’ to an O/C’ed system.

Good luck

vlycop · February 22, 2023, 4:36pm

i’ve clear the cmos last time so no OC what so ever and i’m running 6.1.11 already

SquirrelMan5k · February 22, 2023, 5:02pm

mce/mcelog https://mcelog.org/ indicates that your processor is bad, specifically the cache on it. you might try updating/downgrading the bios and see if a different version doesn’t show that error. or you might need a new processor.

xzpfzxds · February 22, 2023, 5:22pm

the status code is : fc00 0800 0101 0135

It is output from the kernel by this code:

github.com

torvalds/linux/blob/v6.2/arch/x86/kernel/cpu/mce/core.c#L167-L170


      
          	pr_emerg(HW_ERR "CPU %d: Machine Check%s: %Lx Bank %d: %016Lx\n",
          		 m->extcpu,
          		 (m->mcgstatus & MCG_STATUS_MCIP ? " Exception" : ""),
          		 m->mcgstatus, m->bank, m->status);

for AMD CPUs the last 2 bytes indicate the type:

github.com

mchehab/rasdaemon/blob/master/mce-amd.c#L21


      
          * GNU General Public License for more details.
          */
          
          
#include <stdio.h>
          #include <string.h>
          
          
#include "ras-mce-handler.h"
          
          
/* Error Code Types */
          #define TLB_ERROR(x)                    (((x) & 0xFFF0) == 0x0010)
          #define MEM_ERROR(x)                    (((x) & 0xFF00) == 0x0100)
          #define BUS_ERROR(x)                    (((x) & 0xF800) == 0x0800)
          #define INT_ERROR(x)                    (((x) & 0xF4FF) == 0x0400)
          
          
/* Error code: transaction type (TT) */
          static char *transaction[] = {
          	"instruction", "data", "generic", "reserved"
          };
          /* Error codes: cache level (LL) */
          static char *cachelevel[] = {
          	"reserved", "L1", "L2", "L3/generic"

0135 & FF00 == 0100, so this is a memory error

the ADDR field is the failing address but you’d need to know the bank organisation, interleave settings and channel layout to determine which module/chip.

The earlier message from Feb 8 had a status of bc80 0800 060c 0859 - that’s a BUS_ERROR, i have no idea what would cause that. I’d try to find a stress test that can reproduce it as reliably/quickly as possible. Try prime95/mprime stress tests. Then try things like reseating the CPU, single memory module, reduce memory timing.

If you have ECC memory, make sure ECC mode is enabled and if it is a memory timing/stablility issue you should see ECC corrections happening.

vlycop · February 23, 2023, 8:44pm

I’ve got another one of those, it almost always when i have some media (discord or youtube) and a game going on…

i’ve re-seated my CPU, finger crosed i don’t have to buy a new cpu just to test

vlycop · February 25, 2023, 5:27pm

And one more

Would you recommend buying a new Ryzen 9 5900X full cost, hopping it’s not the motherboard ?

I though about buying the cheapest one for my motherboard but i feel like it may not reproduce the issue with another cpu

vic · February 25, 2023, 6:32pm

Is RMA’ing your 5900X one of the options available to you?

AMD does sell marginally okay processors…I have first hand experience.

I’ll opt for RMA if this option still available to you.

vlycop · March 1, 2023, 3:31pm

I mean, shouldn’t i have a proven issue that isn’t os based first ?
i don’t see how i can justify RMA to the support

Sorry for the delay, i’m now officialy unemployed
Life realy don’t want to give me a break but i’ll beat her