Broken virtualisation (and more) on ASRock AB350 Pro4

stoatally · April 23, 2018, 11:21am

So back in February this year I decided to update my BIOS in anticipation of new Ryzen CPUs launching. At the time I was on v3.20, so I did the upgrade process to v3.30 and then v3.40. This is when I first noticed something was wrong.

When powering on the machine for a regular boot, it would go past the BIOS screen and then lock up with the screen turning on and off every few seconds, the keyboard was non-responsive, tapping caps lock would not do anything. I rebooted and entered the BIOS setup and discovered that the screen would randomly turn off for a few seconds, sometimes minutes apart, other times in very rapid succession.

Thankfully I still had a few BIOS updates to apply, I figured this must just be a problem with this specific version so I upgraded all the way to v4.70. The problem still persisted however, and the only way to actually boot this machine now is to enter the BIOS and between the screen turning on and off try and select the device to boot from.

I’ve had a support request open with ASRock for this issue since mid March and have heard nothing back from them. So much for “We will have technical support personnel to contact you soon.”

Just this morning I noticed something else broken, virtual machines run in VirtualBox will most often crash while booting with an “rcu_sched detected stall on CPU” message. I managed to get a Fedora live CD to boot once and got screenshots of the dmesg output:

I also recorded a video demonstrating the frustration:

I’m not actually sure if this is a hardware level issue, or some kind of freak VirtualBox issue. Can anyone help?

My system is as follows:
Arch Linux
ASRock AB350 Pro4
AMD Ryzen 1700 @ 3.7GHz
Corsair Vengeance LPX 32GB DDR4 2666MHz CL16 KIT CMK32GX4M2A2666C16
MSI Radeon RX 580 Gaming X 8GB

blackfire · April 23, 2018, 11:54am

Do you have secure boot enabled at all? Have you any hardware testing, remove all but one ram stick could be a hardware conflict?

anon75264233 · April 23, 2018, 1:59pm

Hey @stoatally I moved your thread to the Software & Operating Systems subforum because I believe this to be more appropriate for the issue at hand; I also added tags to help people search for your thread better.

Dje4321 · April 23, 2018, 2:21pm

Have you tried clearing CMOS?
Have you tried downgrading the BIOS back to the last known working version?
Have you tried to reseat the CPU, GPU, RAM, etc?

stoatally · April 23, 2018, 3:14pm

I’ve not reseated any hardware since this happened, before that it was running stable for the best part of a year and had not been moved or bumped, so I had not considered doing so.

The CMOS was cleared as part of the BIOS upgrade process, I re-applied my overclock between each upgrade and checked that it was stable with mprime.

I’ve not tried to downgrade the BIOS back to a known working version as I was worried that this would cause more issues. Is this safe to do? Because if it is, I’ll do it before I try reseating hardware.

Thanks for your suggestions.

dent_nz · April 23, 2018, 5:13pm

Step 1: Turn off your overclock and fully test it.
Perhaps force re-flash the latest BIOS while the system is at default non overclocked settings.

stoatally · April 23, 2018, 5:26pm

The system was run at non overclocked settings when the BIOS updates were applied, it was faulty with completely stock clocks.

MarcT · April 23, 2018, 5:28pm

It does sound like a hardware issue, BUT be aware the latest Ryzen BIOSes includes a CPU microcode update to add support for the “ibpb” CPU flag (indirect branch prediction barrier), which helps with Spectre mitigation.

You may need to be on a recent host kernel version, and latest VirtualBox hypervisor version to handle this.
There were lots of RCU kernel updates recently too.

stoatally · April 23, 2018, 5:35pm

Yeah, I’m a bit worried about rolling back to a BIOS release from mid last year. I’m not even sure if that’s a thing that is safe to do ever.

The video was recorded with kernel 4.16.3 and VirtualBox 5.2.10, this appears to be the latest stable version in both cases.

MarcT · April 23, 2018, 6:04pm

Run a dmesg as root in the host OS and see what (if anything) is reported. You may see DMA timeouts or something.
Post the output here if you can.

If you have ECC RAM, ensure the EDAC module is loaded (modprobe amd64_edac_mod)

*As a data-point I have a Ryzen 7 1800x running in an ASRock x370 Taichi with bios 4.60, Slackware Linux 64bit-current (kernel 4.14.35) and VirtualBox 5.2.6. It’s stable to the extent that I know I have one of the pre-week 25 CPUs which crash under heavy compilation workload. I have an approved RMA to swap the CPU, but have not sent it off yet…

stoatally · April 23, 2018, 6:12pm

Here’s a short snippet from dmesg that covers starting VirtualBox and up to the point where a machine hangs:

[29888.981943] vboxdrv: 00000000da811cc0 VMMR0.r0
[29889.061337] VBoxNetFlt: attached to 'vboxnet0' / 0a:00:27:00:00:00
[29889.061890] device vboxnet0 entered promiscuous mode
[29889.170453] vboxdrv: 0000000001a6ec03 VBoxDDR0.r0
[29889.235309] vboxpci: created IOMMU domain 00000000746b3096

There doesn’t seem to be anything obvious in that. Here’s the full log:

dmesg.txt (77.6 KB)

catsay · April 23, 2018, 6:39pm

Interestingly your ASRock board also seems to have the clocksource instability/calibration problem when overclocked via the BIOS.

A workaround is to OC from within linux itself rather than via the BIOS settings:

For now, post the output of this command

cat /sys/devices/system/clocksource/clocksource0/current_clocksource

If I’m correct it will report hpet, under normal circumstances it should be tsc. But that’s not the only issue.

I’m also suspicious of this line:

[   71.268601] acpi_cpufreq: overriding BIOS provided _PSD data

Since I don’t think that code has yet been adapted for Ryzen

github.com

torvalds/linux/blob/master/drivers/cpufreq/acpi-cpufreq.c

/*
 * acpi-cpufreq.c - ACPI Processor P-States Driver
 *
 *  Copyright (C) 2001, 2002 Andy Grover <[email protected]>
 *  Copyright (C) 2001, 2002 Paul Diefenbaugh <[email protected]>
 *  Copyright (C) 2002 - 2004 Dominik Brodowski <[email protected]>
 *  Copyright (C) 2006       Denis Sadykov <[email protected]>
 *
 * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 *
 *  This program is free software; you can redistribute it and/or modify
 *  it under the terms of the GNU General Public License as published by
 *  the Free Software Foundation; either version 2 of the License, or (at
 *  your option) any later version.
 *
 *  This program is distributed in the hope that it will be useful, but
 *  WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 *  General Public License for more details.
 *

This file has been truncated. show original

As for Vbox I’m not exactly sure what’s going on there. It’s creating IOMMU domains and freeing them before the log ends. Chances are that the log is also not fully written out to disk when the crash occurs.

Try connecting to the machine via ssh from another system and view journalctl or dmesg with journalctl -f / dmesg -w to catch the output on another machine.

catsay · April 23, 2018, 6:52pm

Just tested on my own ASRock X370 Gaming K4 platform with arch and 4.16.3 kernel and I can’t reproduce this Virtualbox behavior.
It just works, even when I break the clocksource into hpet mode. Hpet is real slow btw.

This btw is the full extent of iommi kernel events when dealing when starting a VBox VM in my system.

journalctl -k | grep vbox

Note how IOMMU domains are only created and freed as vbox VM’s are started and stopped.

MarcT · April 23, 2018, 6:59pm

I’m also concerned about the clocksource instability. I don’t see that on my machine (which is not overclocked).

root@deepthought:~# dmesg | egrep -i "clocksource|PSD"
[    0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[    0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
[    0.043420] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[    0.310039] clocksource: Switched to clocksource hpet
[    0.322039] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    1.503318] tsc: Refined TSC clocksource calibration: 3600.282 MHz
[    1.503876] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x33e55d41bc4, max_idle_ns: 440795207871 ns
[    2.511970] clocksource: Switched to clocksource tsc
[    8.406697] acpi_cpufreq: overriding BIOS provided _PSD data

…but I do have the _PSD data line.

stoatally · April 23, 2018, 7:01pm

Yeah, the output of

cat /sys/devices/system/clocksource/clocksource0/current_clocksource

Is indeed hpet. Can you tell me/link me to more information about this?

As for Vbox I’m not exactly sure what’s going on there. It’s creating IOMMU domains and freeing them before the log ends. Chances are that the log is also not fully written out to disk when the crash occurs.

I just want to be clear that it’s the virtual machine that hangs, not the host machine. I’ve not been able to access any of the virtual machines over SSH, one of those demonstrated in the video is a vagrant machine which fails to provision as it always hangs before SSH is started.

Also I think that the IOMMU domains are being created/destroyed correctly, it was just hard to see in the earlier log, when looking at the output of journalctl you can see those actions were a few minutes apart, enough time to start the machine and for it to hang:

Apr 23 20:07:16 smeg kernel: VBoxNetFlt: attached to 'vboxnet0' / 0a:00:27:00:00:00
Apr 23 20:07:16 smeg kernel: device vboxnet0 entered promiscuous mode
Apr 23 20:07:16 smeg kernel: vboxdrv: 0000000001a6ec03 VBoxDDR0.r0
Apr 23 20:07:16 smeg kernel: vboxpci: created IOMMU domain 00000000746b3096
Apr 23 20:12:53 smeg kernel: vboxpci: freeing IOMMU domain 00000000746b3096
Apr 23 20:13:00 smeg kernel: device vboxnet0 left promiscuous mode
Apr 23 20:13:00 smeg kernel: vboxnetflt: 0 out of 4 packets were not sent (directed to host)

catsay · April 23, 2018, 7:02pm

Yours is about the same as mine on a stock 1700X

cat@jupiter:~$ journalctl -k | egrep -i "clocksource|PSD"
Apr 23 13:33:34 jupiter kernel: clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370452778343963 ns
Apr 23 13:33:34 jupiter kernel: clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
Apr 23 13:33:34 jupiter kernel: clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x30e974a77bc, max_idle_ns: 440795308615 ns
Apr 23 13:33:34 jupiter kernel: clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370867519511994 ns
Apr 23 13:33:34 jupiter kernel: clocksource: Switched to clocksource tsc-early
Apr 23 13:33:34 jupiter kernel: clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
Apr 23 13:33:34 jupiter kernel: tsc: Refined TSC clocksource calibration: 3393.623 MHz
Apr 23 13:33:34 jupiter kernel: clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x30eac649395, max_idle_ns: 440795252668 ns
Apr 23 13:33:34 jupiter kernel: clocksource: Switched to clocksource tsc
Apr 23 13:33:34 jupiter kernel: acpi_cpufreq: overriding BIOS provided _PSD data

catsay · April 23, 2018, 7:06pm

The fact that it’s the VM that hangs is crucial, I was under impression that the entire host crashed.

First lets find a way to get your clocksource onto tsc.
Set your system to baseline stock clocks. It MUST be stable on stock settings and default to tsc again.

If it is not you may well have a hardware (mainboard/cpu) issue at hand.

stoatally · April 23, 2018, 7:18pm

Yep. I reset to UEFI defaults and enabled CPU virtualisation, now running stock CPU and memory clocks. All three virtual machines featured in the video booted fine first time.

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

And here is the full dmesg output, including starting the virtual machines: dmesg-2.txt (74.6 KB)

catsay · April 23, 2018, 7:22pm

Odd question:
Are you by any chance using this system with a PS/2 keyboard?

stoatally · April 23, 2018, 7:23pm

Nope, all USB peripherals.