Ryzen crashing while idle

Here is my grub config. Let me know if anything seems out of the ordinary.

GRUB_DEFAULT=saved
GRUB_TIMEOUT=10
GRUB_TIMEOUT_STYLE=menu
GRUB_DISTRIBUTOR="Manjaro"
GRUB_CMDLINE_LINUX_DEFAULT="quiet apparmor=1 security=apparmor resume=UUID=c8e8be7f-12f8-4260-b15b-8ee0d6203cf8 udev.log_priority=3"
GRUB_CMDLINE_LINUX="processor.max_cstate=5 rcu_nocbs=0-11 quiet splash"

# If you want to enable the save default function, uncomment the following
# line, and set GRUB_DEFAULT to saved.
GRUB_SAVEDEFAULT=true

# Preload both GPT and MBR modules so that they are not missed
GRUB_PRELOAD_MODULES="part_gpt part_msdos"

# Uncomment to enable booting from LUKS encrypted devices
#GRUB_ENABLE_CRYPTODISK=y

# Uncomment to use basic console
GRUB_TERMINAL_INPUT=console

# Uncomment to disable graphical terminal
#GRUB_TERMINAL_OUTPUT=console

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command 'videoinfo'
GRUB_GFXMODE=auto

# Uncomment to allow the kernel use the same resolution used by grub
GRUB_GFXPAYLOAD_LINUX=keep

# Uncomment if you want GRUB to pass to the Linux kernel the old parameter
# format "root=/dev/xxx" instead of "root=/dev/disk/by-uuid/xxx"
#GRUB_DISABLE_LINUX_UUID=true

# Uncomment to disable generation of recovery mode menu entries
GRUB_DISABLE_RECOVERY=true

# Uncomment and set to the desired menu colors.  Used by normal and wallpaper
# modes only.  Entries specified as foreground/background.
GRUB_COLOR_NORMAL="light-gray/black"
GRUB_COLOR_HIGHLIGHT="green/black"

# Uncomment one of them for the gfx desired, a image background or a gfxtheme
#GRUB_BACKGROUND="/usr/share/grub/background.png"
GRUB_THEME="/usr/share/grub/themes/manjaro/theme.txt"

# Uncomment to get a beep at GRUB start
#GRUB_INIT_TUNE="480 440 1"

That is correct, did you run update-grub?

Something seems very wrong with your power delivery then. It might be time to RMA your motherboard.

I did. It hasn’t crashed yet, but it hasn’t been very long.

@FurryJackman Will MSI accept an RMA request if I don’t have the socket cover?

I see TSC in the kernel messages, id assume your TSC clock source is unstable and the issues arise around your frequency changing stepping’s.

While i see others have posted messages regarding limiting cstates, you could just use make menuconfig and compile the kernel without any power management support for the CPU.

This should cause the CPU to run at a solid clock speed, make sure to turn off anything that manipulated frequency as well in the bios.

Also latency is a bit better you have all the frequency crap disabled anyway.

Guys that have duel socket systems should know what iam talking about, latency is a big problem for us in the KVM environments.

When trying to use the TSC clock source on duel socket systems its almost a must to disable all the power management stuff anyway.

Best Regards,

@Gandhi Did you get anywhere?
I’m having similar issues with my 5900X on a Asus Crosshair VIII Hero. XMP turned off, no PBO.

The system freezes, becomes unreactive and then reboots. Sometimes the sound also “freezes”. It seems like it happens when starting a program / script etc or when something is stopping. It can happen after 5min of running or after ~2days.

I tried the PSU idle voltage and its still crashing.
I havent tried the cpu core voltage or the CPU NB/SoC Voltage offset yet.

Besides the mce hardware error im also getting:

>     Nov 26 14:02:33 mymachine kernel: __common_interrupt: 10.55 No irq handler for vector
> 
>     Nov 26 14:02:33 mymachine kernel: sp5100-tco sp5100-tco: Watchdog hardware is disabled
> 
>     Nov 26 14:02:34 mymachine kernel: EDAC amd64: Error: F0 not found, device 0x1650 (broken BIOS?)

Additional things I’ve tried:

  • Bios versions: 2402, 2502, 2702
  • Disable XMP and PBO (XMP with these sticks was no problem on a 7700K & Asus Strix 270E Gaming)
  • latest linux-lts kernel (5.4)
  • Different PSU (Be Quiet! Dark Power 12 1500Watt)

Whats kinda weird is that the benchmarks seem to run stable:

  • “stress-ng --cpu 6 --vm 6 --verify 1 --vm-bytes 80%” ran without issues for 20min
  • “phoronix-test-suit” completes: Appleseed, browser suit, x264, x265, Embree, CppPerformanceBenchmarks
  • “memtest86+” completed an iteration without errors - didnt have the time to run more yet

I was thinking of trying the following:

  • Try windows
  • Try kernel 5.10
  • Try a different GPU (but only have a super old Radeon and a 1080TI)
  • Try different RAM sticks (have a pair of 8GB)
  • Try a CPU NB/SoC Voltage offset
  • try setting the core voltage to normal
  • Try to do something about the C states? But zenstates only works for Zen 2 so far.

Sadly, no. I’ve tried on Windows 10 an Linux kernel 5.10, but I still get crashes on both. I’ve also tried new RAM with no success. Tried a voltage offset, no luck. Disabled the c-states in both the BIOS and using Zenstates.py, again no luck. Haven’t tried a new GPU though, and I don’t think I’ve tried setting the core voltage to normal. However, at this point I’m beginning to think there is something very wrong with either the CPU or the motherboard. Fortunately, I have a friend with a Ryzen 5 3600 that I can trade with for a bit a see if the problem persists. Whichever it turns out to be I’ll probably have to RMA. I’ll let you guys know how it goes.

1 Like

Ok. I also tried Windows 10 now :stuck_out_tongue:
Also crashes…
Trying less ram now. And then will try the 2nd set of RAM sticks…
And then i’ll RMA it. Its a bundle anyways.

How finicky is Ryzen with RAM?
I.e. Should any set of RAM work on 2133MHz, even if its not on the list of supported RAM?

Yeah 2133 shouldn’t be a problem at all, especially not for a 5k series and since you already ran memtest the problem is most likely going to be either the mainboard or the cpu.

I’m having same issue

EDAC amd64: Error: F0 not found, device 0x1650 (broken BIOS?)

joulu 06 03:27:04 TreeOfLight kernel: [Hardware Error]: Corrected error, no action required.
joulu 06 03:27:04 TreeOfLight kernel: [Hardware Error]: CPU:1 (19:21:0) MC2_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c20400004020136
joulu 06 03:27:04 TreeOfLight kernel: [Hardware Error]: Error Addr: 0x0000000406c70650
joulu 06 03:27:04 TreeOfLight kernel: [Hardware Error]: IPID: 0x000200b000000000, Syndrome: 0x000111081a44352c
joulu 06 03:27:04 TreeOfLight kernel: [Hardware Error]: L2 Cache Ext. Error Code: 2, L2M Data Array ECC Error.
joulu 06 03:27:04 TreeOfLight kernel: [Hardware Error]: cache level: L2, tx: DATA, mem-tx: DRD

[    0.028616] Booting paravirtualized kernel on bare hardware
[    0.528204] mce: [Hardware Error]: Machine check events logged
[    0.528205] mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 2: bea0200004020152
[    0.528207] mce: [Hardware Error]: TSC 0 ADDR 2ddea9c50 MISC d012000100000000 SYND 167101d442129 IPID 200b000000000 
[    0.528209] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1607216150 SOCKET 0 APIC 2 microcode a201009
[    3.818545] systemd[1]: Condition check resulted in Rebuild Hardware Database being skipped.
[ 1873.059558] mce: [Hardware Error]: Machine check events logged
[ 1873.059561] [Hardware Error]: Corrected error, no action required.
[ 1873.059565] [Hardware Error]: CPU:1 (19:21:0) MC2_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c20400004020136
[ 1873.059569] [Hardware Error]: Error Addr: 0x0000000406c70650
[ 1873.059570] [Hardware Error]: IPID: 0x000200b000000000, Syndrome: 0x000111081a44352c
[ 1873.059572] [Hardware Error]: L2 Cache Ext. Error Code: 2, L2M Data Array ECC Error.
[ 1873.059574] [Hardware Error]: cache level: L2, tx: DATA, mem-tx: DRD

Ryzen 5600x Asus Rog Strix X570-E Bios “Version 2816 Beta Version”

Ubuntu 20.10
Linux TreeOfLight 5.8.0-31-generic #33-Ubuntu SMP Mon Nov 23 18:44:54 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Sensors don’t work :frowning: So I have no idea what my CPU voltage is, it seems to be around 1.34v in Bios/UeFI that doesn’t seem normal at all. when I push offset it says 1.1v normal but the bios voltage is still 1.34v

(dunno if related but getting USB slow errors 2)
usr/libexec/gdm-x-session[2483]: (EE) event2 - ASUS ROG GLADIUS: client bug: event

/usr/libexec/gdm-x-session[2483]: (EE) event6 - Logitech USB Keyboard: client bug: event processing lagging behind by 30ms, your system is too slow processing lagging behind by 11ms, your system is too slow

Okay, I’m seeing this a lot with a lot of the recent BIOSes. Downgrade your BIOS and turn off C-states until a proper fix is in place. It seems the hardware errors are related to deep sleep states and it affects USB 2.0 ports as well:

USB 2.0 weirdness and Hardware Errors are currently common with the recent batch of Beta BIOSes.

For Gigabyte, use B550 series F11J BIOS and for X570 use F31K.

@JoneK
I’m only running it with 2 sticks of RAM now and its gotten more stable…
But I got the same mce hardware errors as you once!
so wuhuu ?? :thinking:

I sent a ticket to AMD. No response yet.
I saw that the Hero VIII non-wifi got a new BIOS today…
Maybe that will make it better…

@FurryJackman
Does C-state disabling in the BIOS affect Linux or do you need ZenStates? AFAIK Zenstates unfortunately only works for Zen 2 so far.

Maybe I’ll try using only 3.0 ports…

Zenstates is supposed to work unless you need a new kernel to expose more MSRs.

On my 5900x i needed to set vddg and vddp as i mentioned above before it was stable. Not a single problem since changing that shortly after release.
Mobo is an Aorus X570 pro.

Are you saying that apart from changing vddg and vddp no other changes been made and system does not run into MCE errors and reboots?

I want to jump on the tread too. I’ve built a system 9 days ago and every single day I’m having a random reboot.

I’m running 5900X on MSI X570 Tomahawk with G.Skill 32 Gb 2x16 Kit at 3600 Mhz CL16.

My current summary is:

  • If system goes from Off state to On state (power on), it randomly reboots
  • If system goes from Sleep state to On state (wake from sleep), it randomly reboots, normally within 5 minutes
  • Once the reboot happen, it never happens again no matter the load or uptime since last reboot, unless ^

Playing around with settings in BIOS yielded no effect, I’ve been trying power settings, PBO disable, IOMMU etc.

As per one of the suggestion in a lengthy BZ discussion I’ve tried disabling SMT and for now it’s running stable, but didn’t have a lot of time today to play around.

So just a quick update from my side.
I had to set a tiny SoC voltage offset (I think 0.0025V). I also set the idle to “typical”.
Since then my computer is stable…
Don’t even need to disable C6 states.

I will test XMP and PBO in the next days.

Yep i adjusted those voltages(and ram but that is separate) and set ram to 3733 with tight timings. It’s rock solid ever since i did that, windows and linux. Sometimes it was left compiling for hours on 24 threads with no problems whatsoever.

I RMA’ed my 3700X and received a new just before Christmas. While things looked promising initially, the computer crashed again shortly after logging into Windows. I haven’t had any problems with Linux yet. MSI has released a beta bios update for my motherboard. I may try that in the meantime.

Nvm, the beta bios is far too buggy right now.

I do have to wonder though about my power supply. I’ve only had it for a few years, but it is an older model. It’s a Antec HCP 1000W Platinum. Would it being old matter? Also, would it matter if the CPU cable is plugged into the 12v1 slot as opposed to the 12v2, 12v3, or 12v4? I really know next to nothing about power supplies.

Almost certainly the psu. I had an old antec with the racing stripe and it was just unusable even with first gen ryzen.