So back in February this year I decided to update my BIOS in anticipation of new Ryzen CPUs launching. At the time I was on v3.20, so I did the upgrade process to v3.30 and then v3.40. This is when I first noticed something was wrong.
When powering on the machine for a regular boot, it would go past the BIOS screen and then lock up with the screen turning on and off every few seconds, the keyboard was non-responsive, tapping caps lock would not do anything. I rebooted and entered the BIOS setup and discovered that the screen would randomly turn off for a few seconds, sometimes minutes apart, other times in very rapid succession.
Thankfully I still had a few BIOS updates to apply, I figured this must just be a problem with this specific version so I upgraded all the way to v4.70. The problem still persisted however, and the only way to actually boot this machine now is to enter the BIOS and between the screen turning on and off try and select the device to boot from.
I’ve had a support request open with ASRock for this issue since mid March and have heard nothing back from them. So much for “We will have technical support personnel to contact you soon.”
Just this morning I noticed something else broken, virtual machines run in VirtualBox will most often crash while booting with an “rcu_sched detected stall on CPU” message. I managed to get a Fedora live CD to boot once and got screenshots of the dmesg output:
I also recorded a video demonstrating the frustration:
I’m not actually sure if this is a hardware level issue, or some kind of freak VirtualBox issue. Can anyone help?
My system is as follows:
Arch Linux
ASRock AB350 Pro4
AMD Ryzen 1700 @ 3.7GHz
Corsair Vengeance LPX 32GB DDR4 2666MHz CL16 KIT CMK32GX4M2A2666C16
MSI Radeon RX 580 Gaming X 8GB
Hey @stoatally I moved your thread to the Software & Operating Systems subforum because I believe this to be more appropriate for the issue at hand; I also added tags to help people search for your thread better.
Have you tried clearing CMOS?
Have you tried downgrading the BIOS back to the last known working version?
Have you tried to reseat the CPU, GPU, RAM, etc?
I’ve not reseated any hardware since this happened, before that it was running stable for the best part of a year and had not been moved or bumped, so I had not considered doing so.
The CMOS was cleared as part of the BIOS upgrade process, I re-applied my overclock between each upgrade and checked that it was stable with mprime.
I’ve not tried to downgrade the BIOS back to a known working version as I was worried that this would cause more issues. Is this safe to do? Because if it is, I’ll do it before I try reseating hardware.
It does sound like a hardware issue, BUT be aware the latest Ryzen BIOSes includes a CPU microcode update to add support for the “ibpb” CPU flag (indirect branch prediction barrier), which helps with Spectre mitigation.
You may need to be on a recent host kernel version, and latest VirtualBox hypervisor version to handle this.
There were lots of RCU kernel updates recently too.
Run a dmesg as root in the host OS and see what (if anything) is reported. You may see DMA timeouts or something.
Post the output here if you can.
If you have ECC RAM, ensure the EDAC module is loaded (modprobe amd64_edac_mod)
*As a data-point I have a Ryzen 7 1800x running in an ASRock x370 Taichi with bios 4.60, Slackware Linux 64bit-current (kernel 4.14.35) and VirtualBox 5.2.6. It’s stable to the extent that I know I have one of the pre-week 25 CPUs which crash under heavy compilation workload. I have an approved RMA to swap the CPU, but have not sent it off yet…
If I’m correct it will report hpet, under normal circumstances it should be tsc. But that’s not the only issue.
I’m also suspicious of this line:
[ 71.268601] acpi_cpufreq: overriding BIOS provided _PSD data
Since I don’t think that code has yet been adapted for Ryzen
As for Vbox I’m not exactly sure what’s going on there. It’s creating IOMMU domains and freeing them before the log ends. Chances are that the log is also not fully written out to disk when the crash occurs.
Try connecting to the machine via ssh from another system and view journalctl or dmesg with journalctl -f / dmesg -w to catch the output on another machine.
Just tested on my own ASRock X370 Gaming K4 platform with arch and 4.16.3 kernel and I can’t reproduce this Virtualbox behavior.
It just works, even when I break the clocksource into hpet mode. Hpet is real slow btw.
This btw is the full extent of iommi kernel events when dealing when starting a VBox VM in my system.
Is indeed hpet. Can you tell me/link me to more information about this?
As for Vbox I’m not exactly sure what’s going on there. It’s creating IOMMU domains and freeing them before the log ends. Chances are that the log is also not fully written out to disk when the crash occurs.
I just want to be clear that it’s the virtual machine that hangs, not the host machine. I’ve not been able to access any of the virtual machines over SSH, one of those demonstrated in the video is a vagrant machine which fails to provision as it always hangs before SSH is started.
Also I think that the IOMMU domains are being created/destroyed correctly, it was just hard to see in the earlier log, when looking at the output of journalctl you can see those actions were a few minutes apart, enough time to start the machine and for it to hang:
Apr 23 20:07:16 smeg kernel: VBoxNetFlt: attached to 'vboxnet0' / 0a:00:27:00:00:00
Apr 23 20:07:16 smeg kernel: device vboxnet0 entered promiscuous mode
Apr 23 20:07:16 smeg kernel: vboxdrv: 0000000001a6ec03 VBoxDDR0.r0
Apr 23 20:07:16 smeg kernel: vboxpci: created IOMMU domain 00000000746b3096
Apr 23 20:12:53 smeg kernel: vboxpci: freeing IOMMU domain 00000000746b3096
Apr 23 20:13:00 smeg kernel: device vboxnet0 left promiscuous mode
Apr 23 20:13:00 smeg kernel: vboxnetflt: 0 out of 4 packets were not sent (directed to host)
The fact that it’s the VM that hangs is crucial, I was under impression that the entire host crashed.
First lets find a way to get your clocksource onto tsc.
Set your system to baseline stock clocks. It MUST be stable on stock settings and default to tsc again.
If it is not you may well have a hardware (mainboard/cpu) issue at hand.
Yep. I reset to UEFI defaults and enabled CPU virtualisation, now running stock CPU and memory clocks. All three virtual machines featured in the video booted fine first time.