Random freezes on Ryzen in Linux, even if Linux is in VM

Duplicating my post from Reddit

TL:DR 4 different Ryzen (1700 and three 2700x) systems that I have access to sometimes freeze with rcu_sched detected stalls message. The worst part is that I can trigger the freeze by running Xubuntu 18.04 in a VirtualBox and freeze the host (Windows 10, for example) also. Since the issue exist in 4 different systems, I think it is unlikely that there is a problem with one individual component.

To trigger the freeze I use compiling curl in a loop in tmpfs, it usually happens in less then 2 hours. This is the only way I was able to trigger this freeze, otherwise system seems stable both in Windows and Xubuntu. I will be grateful if someone will try and test it also, either in VM or in native Xubuntu 18.04

Disabling SMT seems to help, but not exactly a solution. The strangest thing is, that updating gcc from 7 to 8 helps also, but I don’t think that the gcc is the cause of the problem, it gcc probably just manages to randomly trigger the bug; I believe so, because it should not matter what type of user applications you are running in VMs, crashes in them should not lead to total system freeze, including the host.

Things I’ve tried and that didn’t help:

Updating BIOS

Obviously, systems are not overclocked (but overclocking, unsurprisingly, didn’t help either)

Changing memory to another kit (anyway, memtest didn’t fail in 12 hours, so that’s something)

Setting memory to 2133MHz (kits are rated for 3000MHz)

Compiling latest kernel (4.20.x) , and adding various kernel boot parameters (idle=nomwait/idle=halt, processor.max_cstate=5, rcu_nocbs=0-15 with recompiled kernel that supports that option). Idle=halt helped a lot, but freezes still happen.

Using zenstates to disable C6

Increasing SoC voltage to 1.1v

Increasing CPU voltage by 0.0125v (don’t want to go any higher because XFR voltages for 2700x are high enough already)

Increasing DRAM voltage to 1.3v

Setting mysterious BIOS parameter to typical current idle

Disabling cores on CPU down to 2 (4 threads total)

Using another, high-end, PSU

Connecting a couple of mechanical HDDs to PSU, because there were reports, that some PSUs can’t handle low loads when the system idles

Setting cpu governor to performance

Ryzen 1700 was earlier RMAd because of segfault bug

The temperatures are fine, I’ve never seen more than 65 TDie on 2700x and 45 on non-overclocked 1700.

Systems:
Ryzen 1700 + 2x8 Corsair 3000 MHz RAM + ASRock X370 Gaming K4 + GTX 1080 + Samsung 960 Evo + a couple of HDDs + Fractal Design Newton R3 800w 80+ Platinum PSU
Ryzen 2700x + 4x16 Corsair 3000MHz RAM + ASRock B450 Pro4 + GT1030 + Samsung 860 Evo + Aerocool KCAS 650w 80+ Gold PSU
2x(Ryzen 2700x + 4x16 Corsair 3000MHz RAM + ASUS X470 Pro + GT1030 + Samsung 860 Evo + Aerocool KCAS 650w 80+ Gold PSU)

2 Likes

If I set you up with ipmi access to a ryzen 7 2700x system, will you configure to save me time testing? I can setup win10 or Linux on the host. From there it will be easier for me to diagnose. I have some “known ryzen incompatible” psus as well as known compatible systems. If that works you can ship me parts and I will ship you my working parts to get to help get to the bottom of it.

1 Like

You mean, you set up some machine, and I configure it? I can try, I’ve never had any experience with ipmi, but that can’t be that hard, can it?) Ubuntu 18.04 host should be OK.

As for the shipping parts, that’s not likely to happen; 1) I’m from Russia, and it will take a lot of time to ship parts to/from you, wherever you are 2) These are not my machines, but company ones, and I doubt I will get a permission to ship the parts somewhere.

Can you please give me a list of compatible and uncompatible PSUs? I can try and find a good PSU here.

Yes exactly. Ok, let me start the setup. Will take about one day. I will pm you or you can pm me. The goal is to run through setup as here or your Reddit post to experiment with conditions for hard lock.

We can at least identify “known good” configs or escalate up the chain if it is a h/w bug.

1 Like

I’ve been discussing this issue with one of Linux developers for the last several weeks, and he mailed me today, they can’t reproduce it at their machines. The most obvious difference is that he has MSI motherboard, and I have ASUS and ASRock, so at least it’s less likely that this is a CPU bug, more like BIOS settings or something else entirely.

1 Like

Have you tried any workarounds? As far as I know, these are the most probable to help, if freezes happen when you are not doing anything specific:

  • Setting in AMD CBS in BIOS typical current idle
  • adding idle=halt to /etc/default/grub to GRUB_CMDLINE_LINUX_DEFAULT (you have to update grub after that, in Ubuntu it’s sudo update-grub, don’t know if the same is for your distro)
  • disabling C6 states either with zenstates or with processor.max_cstate=5 in GRUB_CMDLINE_LINUX_DEFAULT

The interesting thing is, you have B450 Pro4 too.

Interested in this topic. I have a similar issue with a ryzen processor and B450. Not only on linux, but also windows. If I understood correctly from Wendell’s reply, there are psus that are incompatible. So I will try to change the power supply with an older I have.

I had random lockups with my Ryzen 1800x and Asus VII Hero mb. The solution for me was to make sure the ram voltage was set correctly (1.35)

After making that change, it hasn’t crashed since.

Some say, you need to have haswell-compatible PSU. I have one (fractal design newton r3 800w), but it does not help in my case.

As for ram, increasing ram voltage to 1.3v (which should me more than enough for 2133) or even 1.4v does not help.

You can try to add this to your bootloader entry under options*:
idle=nomwait rcu_nocbs=0-7

This fixed system freezes for me with a R5 2500U.

*I do not remember 100% how to do that in Ubuntu/Debian and Grub. I use systemd-boot, and just put it under “options” in the appropriate loader config file and update.

At least in case of ubuntu, you need to recompile kernel in order for rcu_nocbs to have an effect.

1 Like

Ah, good to know with regards to Ubuntu. That little line in my systemd-boot loader entry worked like a charm on Arch though.

I have created a Docker file and run script that in my case crash system (even when used in Windows 10), if someone was reluctant to install ubuntu instead of their distro for testing, now you can try running it from anywhere.
ryzen_crash_docker.zip (571 Bytes)

Asus released a new bios for x470 pro, 24 hours compiling stable while previously I couldn’t get more than 6. Now I have to wait for ASRock to fix their motherboards.