Broken virtualisation (and more) on ASRock AB350 Pro4

The system was run at non overclocked settings when the BIOS updates were applied, it was faulty with completely stock clocks.

It does sound like a hardware issue, BUT be aware the latest Ryzen BIOSes includes a CPU microcode update to add support for the “ibpb” CPU flag (indirect branch prediction barrier), which helps with Spectre mitigation.

You may need to be on a recent host kernel version, and latest VirtualBox hypervisor version to handle this.
There were lots of RCU kernel updates recently too.

Yeah, I’m a bit worried about rolling back to a BIOS release from mid last year. I’m not even sure if that’s a thing that is safe to do ever.

The video was recorded with kernel 4.16.3 and VirtualBox 5.2.10, this appears to be the latest stable version in both cases.

Run a dmesg as root in the host OS and see what (if anything) is reported. You may see DMA timeouts or something.
Post the output here if you can.

If you have ECC RAM, ensure the EDAC module is loaded (modprobe amd64_edac_mod)

*As a data-point I have a Ryzen 7 1800x running in an ASRock x370 Taichi with bios 4.60, Slackware Linux 64bit-current (kernel 4.14.35) and VirtualBox 5.2.6. It’s stable to the extent that I know I have one of the pre-week 25 CPUs which crash under heavy compilation workload. I have an approved RMA to swap the CPU, but have not sent it off yet…

Here’s a short snippet from dmesg that covers starting VirtualBox and up to the point where a machine hangs:

[29888.981943] vboxdrv: 00000000da811cc0 VMMR0.r0
[29889.061337] VBoxNetFlt: attached to 'vboxnet0' / 0a:00:27:00:00:00
[29889.061890] device vboxnet0 entered promiscuous mode
[29889.170453] vboxdrv: 0000000001a6ec03 VBoxDDR0.r0
[29889.235309] vboxpci: created IOMMU domain 00000000746b3096

There doesn’t seem to be anything obvious in that. Here’s the full log:

dmesg.txt (77.6 KB)

Interestingly your ASRock board also seems to have the clocksource instability/calibration problem when overclocked via the BIOS.

A workaround is to OC from within linux itself rather than via the BIOS settings:

For now, post the output of this command

cat /sys/devices/system/clocksource/clocksource0/current_clocksource

If I’m correct it will report hpet, under normal circumstances it should be tsc. But that’s not the only issue.

I’m also suspicious of this line:

[   71.268601] acpi_cpufreq: overriding BIOS provided _PSD data

Since I don’t think that code has yet been adapted for Ryzen

As for Vbox I’m not exactly sure what’s going on there. It’s creating IOMMU domains and freeing them before the log ends. Chances are that the log is also not fully written out to disk when the crash occurs.

Try connecting to the machine via ssh from another system and view journalctl or dmesg with journalctl -f / dmesg -w to catch the output on another machine.

Just tested on my own ASRock X370 Gaming K4 platform with arch and 4.16.3 kernel and I can’t reproduce this Virtualbox behavior.
It just works, even when I break the clocksource into hpet mode. Hpet is real slow btw. :wink:

This btw is the full extent of iommi kernel events when dealing when starting a VBox VM in my system.

journalctl -k | grep vbox

Note how IOMMU domains are only created and freed as vbox VM’s are started and stopped.

I’m also concerned about the clocksource instability. I don’t see that on my machine (which is not overclocked).

root@deepthought:~# dmesg | egrep -i "clocksource|PSD"
[    0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[    0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
[    0.043420] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[    0.310039] clocksource: Switched to clocksource hpet
[    0.322039] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    1.503318] tsc: Refined TSC clocksource calibration: 3600.282 MHz
[    1.503876] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x33e55d41bc4, max_idle_ns: 440795207871 ns
[    2.511970] clocksource: Switched to clocksource tsc
[    8.406697] acpi_cpufreq: overriding BIOS provided _PSD data

…but I do have the _PSD data line.

1 Like

Yeah, the output of

cat /sys/devices/system/clocksource/clocksource0/current_clocksource

Is indeed hpet. Can you tell me/link me to more information about this?

As for Vbox I’m not exactly sure what’s going on there. It’s creating IOMMU domains and freeing them before the log ends. Chances are that the log is also not fully written out to disk when the crash occurs.

I just want to be clear that it’s the virtual machine that hangs, not the host machine. I’ve not been able to access any of the virtual machines over SSH, one of those demonstrated in the video is a vagrant machine which fails to provision as it always hangs before SSH is started.

Also I think that the IOMMU domains are being created/destroyed correctly, it was just hard to see in the earlier log, when looking at the output of journalctl you can see those actions were a few minutes apart, enough time to start the machine and for it to hang:

Apr 23 20:07:16 smeg kernel: VBoxNetFlt: attached to 'vboxnet0' / 0a:00:27:00:00:00
Apr 23 20:07:16 smeg kernel: device vboxnet0 entered promiscuous mode
Apr 23 20:07:16 smeg kernel: vboxdrv: 0000000001a6ec03 VBoxDDR0.r0
Apr 23 20:07:16 smeg kernel: vboxpci: created IOMMU domain 00000000746b3096
Apr 23 20:12:53 smeg kernel: vboxpci: freeing IOMMU domain 00000000746b3096
Apr 23 20:13:00 smeg kernel: device vboxnet0 left promiscuous mode
Apr 23 20:13:00 smeg kernel: vboxnetflt: 0 out of 4 packets were not sent (directed to host)
1 Like

Yours is about the same as mine on a stock 1700X

cat@jupiter:~$ journalctl -k | egrep -i "clocksource|PSD"
Apr 23 13:33:34 jupiter kernel: clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370452778343963 ns
Apr 23 13:33:34 jupiter kernel: clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
Apr 23 13:33:34 jupiter kernel: clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x30e974a77bc, max_idle_ns: 440795308615 ns
Apr 23 13:33:34 jupiter kernel: clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370867519511994 ns
Apr 23 13:33:34 jupiter kernel: clocksource: Switched to clocksource tsc-early
Apr 23 13:33:34 jupiter kernel: clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
Apr 23 13:33:34 jupiter kernel: tsc: Refined TSC clocksource calibration: 3393.623 MHz
Apr 23 13:33:34 jupiter kernel: clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x30eac649395, max_idle_ns: 440795252668 ns
Apr 23 13:33:34 jupiter kernel: clocksource: Switched to clocksource tsc
Apr 23 13:33:34 jupiter kernel: acpi_cpufreq: overriding BIOS provided _PSD data

The fact that it’s the VM that hangs is crucial, I was under impression that the entire host crashed.

First lets find a way to get your clocksource onto tsc.
Set your system to baseline stock clocks. It MUST be stable on stock settings and default to tsc again.

If it is not you may well have a hardware (mainboard/cpu) issue at hand.

1 Like

Yep. I reset to UEFI defaults and enabled CPU virtualisation, now running stock CPU and memory clocks. All three virtual machines featured in the video booted fine first time.

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

And here is the full dmesg output, including starting the virtual machines: dmesg-2.txt (74.6 KB)

2 Likes

Odd question:
Are you by any chance using this system with a PS/2 keyboard?

Nope, all USB peripherals.

Hmmm…

What output does this now give:

journalctl -k | egrep -i "tsc|clocksource"

Also @MarcT can you post your’s too as a comparison?

Apr 23 13:33:34 jupiter kernel: tsc: Fast TSC calibration failed
Apr 23 13:33:34 jupiter kernel: tsc: Using PIT calibration value
Apr 23 13:33:34 jupiter kernel: clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370452778343963 ns
Apr 23 13:33:34 jupiter kernel: clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
Apr 23 13:33:34 jupiter kernel: tsc: Fast TSC calibration failed
Apr 23 13:33:34 jupiter kernel: tsc: PIT calibration matches HPET. 1 loops
Apr 23 13:33:34 jupiter kernel: tsc: Detected 3393.265 MHz processor
Apr 23 13:33:34 jupiter kernel: clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x30e974a77bc, max_idle_ns: 440795308615 ns
Apr 23 13:33:34 jupiter kernel: clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370867519511994 ns
Apr 23 13:33:34 jupiter kernel: clocksource: Switched to clocksource tsc-early
Apr 23 13:33:34 jupiter kernel: clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
Apr 23 13:33:34 jupiter kernel: tsc: Refined TSC clocksource calibration: 3393.623 MHz
Apr 23 13:33:34 jupiter kernel: clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x30eac649395, max_idle_ns: 440795252668 ns
Apr 23 13:33:34 jupiter kernel: clocksource: Switched to clocksource tsc
Apr 23 13:33:34 jupiter kernel: vboxdrv: TSC mode is Invariant, tentative frequency 3393610709 Hz

Note how calibration fails twice and then matches first hpet and then acpi_pm.
There’s something up with Ryzen and the clocksource calibration, particularly when overclocking.

Apr 23 21:12:25 smeg kernel: tsc: Fast TSC calibration failed
Apr 23 21:12:25 smeg kernel: tsc: Using PIT calibration value
Apr 23 21:12:25 smeg kernel: clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370452778343963 ns
Apr 23 21:12:25 smeg kernel: clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
Apr 23 21:12:25 smeg kernel: tsc: Fast TSC calibration failed
Apr 23 21:12:25 smeg kernel: tsc: PIT calibration matches HPET. 1 loops
Apr 23 21:12:25 smeg kernel: tsc: Detected 2993.962 MHz processor
Apr 23 21:12:25 smeg kernel: clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2b27fd42f6e, max_idle_ns: 440795240346 ns
Apr 23 21:12:25 smeg kernel: clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370867519511994 ns
Apr 23 21:12:25 smeg kernel: clocksource: Switched to clocksource tsc-early
Apr 23 21:12:25 smeg kernel: clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
Apr 23 21:12:25 smeg kernel: tsc: Refined TSC clocksource calibration: 2994.373 MHz
Apr 23 21:12:25 smeg kernel: clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x2b2981a153b, max_idle_ns: 440795305653 ns
Apr 23 21:12:25 smeg kernel: clocksource: Switched to clocksource tsc
Apr 23 21:12:25 smeg kernel: vboxdrv: TSC mode is Invariant, tentative frequency 2994354421 Hz
1 Like

Here’s mine (NB: Slackware doesn’t have systemd or journalctl, so dmesg for me):

root@deepthought:~# dmesg | egrep -i "tsc|clocksource"
[    0.000000] tsc: Fast TSC calibration failed
[    0.000000] tsc: Using PIT calibration value
[    0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[    0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
[    0.000000] tsc: Fast TSC calibration failed
[    0.002000] tsc: PIT calibration matches HPET. 1 loops
[    0.002000] tsc: Detected 3600.011 MHz processor
[    0.043420] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[    0.310039] clocksource: Switched to clocksource hpet
[    0.322039] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    1.503318] tsc: Refined TSC clocksource calibration: 3600.282 MHz
[    1.503876] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x33e55d41bc4, max_idle_ns: 440795207871 ns
[    2.511970] clocksource: Switched to clocksource tsc
[   31.673139] vboxdrv: TSC mode is Invariant, tentative frequency 3600241268 Hz
1 Like

As an aside, my system does have PS/2 KB & mouse.

Interesting how this has the same timer calibration problem. Then succeeds later.
Something is up with the TSC Calibration code for Ryzen, in particular when overclocking via the BIOS things go out of whack.

Unfortunately kernel timers is not my specialty, I know enough to poke around but this may require an expert to review what exactly is going on here.

Of Note:

On an Intel Ivy Bridge system, not even a single failure of any sort.

Apr 23 09:24:56 saturn kernel: tsc: Fast TSC calibration using PIT
Apr 23 09:24:56 saturn kernel: clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
Apr 23 09:24:56 saturn kernel: clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns
Apr 23 09:24:56 saturn kernel: tsc: Fast TSC calibration using PIT
Apr 23 09:24:56 saturn kernel: tsc: Detected 2494.496 MHz processor
Apr 23 09:24:56 saturn kernel: TSC deadline timer enabled
Apr 23 09:24:56 saturn kernel: clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
Apr 23 09:24:56 saturn kernel: clocksource: Switched to clocksource hpet
Apr 23 09:24:56 saturn kernel: clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
Apr 23 09:24:56 saturn kernel: tsc: Refined TSC clocksource calibration: 2494.333 MHz
Apr 23 09:24:56 saturn kernel: clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x23f45085418, max_idle_ns: 440795285711 ns
Apr 23 09:24:56 saturn kernel: clocksource: Switched to clocksource tsc

If I look at the tsc.c code the calibration issues may have to do with alternative clock sources not being available yet for some reason?

1 Like

That’s progress…! :slight_smile: