KVM crashes when in VR

The way the crash works is it starts with a bit of lag a minute or so before the crash, then the main display shuts down, then Windows just yeets everything and a buzzing sound is played through the HMD speakers with a frozen display. At this point I have maybe 2 seconds to switch over to the host and kill the VM manually, or else the entire system will crash (without anything in the logs, it just stops).

I’ve tried to troubleshoot this for days without much success other than confirming that it’s when I’m in VR.

Non-VR performance looks about what you’d expect from a Vega, no overheating or anything like that.

I use FpsVR to monitor the performance while in VR, much like MSI Afterburner but with more details and graphs. Depending on the game I see different things, but whenever those lagspikes occur it’s the GPU having a seizure. CPU remains in its normal percentage, but the GPU is going from 100% to 0% to 50% to 40%, etc. in the span of a few seconds. This is especially apparent in Boneworks where it’s happening all the time and is very much unplayable, though it doesn’t crash the system, that tends to be the lighter games (Beat Saber, Bigscreen, etc.), so I’m really clueless as to what exactly is causing this.

The Windows Event Viewer gives the most “useful” information in this case, but all it tells me is what I already know (that it crashed). There’s nothing in the host that I can find that gives any information. The journal has a bunch of “amdgpu [powerplay]” entries, but that’s it.

I originally ran Manjaro with the 5.6 kernel (patched too), but I switched over to Arch a month ago. I had the same problems then as I do now, just that I didn’t bother too much with it since I didn’t play much VR back then.

Specs:

  • CPU: Ryzen 7 2700X (stock)
  • RAM: 32GB Ballistix 3200MHz
  • GPU (host): RX 580 (limited to 40W)
  • GPU (guest): RX Vega 64 Nitro+
  • PSU: EVGA Supernova G2 750W (it’s not a PSU issue, I’ve debunked that already)

Domain xml: https://pastebin.com/snaEEkiJ

3AM update:

I did dmesg -wL through SSH on my phone to catch any errors and I caught a vfio error.

[  861.135114] vfio-pci 0000:0d:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain
=0x001c address=0xff7e046260 flags=0x0030]
[  861.135120] vfio-pci 0000:0d:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain
=0x001c address=0xff7e05a000 flags=0x0030]
[  862.139415] vfio-pci 0000:0d:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain
=0x001c address=0xff7e046280 flags=0x0030]
[  862.139421] vfio-pci 0000:0d:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain
=0x001c address=0xff7e05a000 flags=0x0030]

I found this thread from 7 years ago when I looked up the error flag: https://bbs.archlinux.org/viewtopic.php?id=162768&p=36

I have seen this happen too - but I don’t have any VR. Similar hardware except I’m running a 2950X and ECC memory. It’s very rare for me though.

When did it happen for you? Just out of the blue or did you do something demanding?

I was recommended to run dmesg -wL through SSH to ensure that I could see the log on crash, and after doing that it suddenly stopped crashing. Maybe it’s a “heisenbug” and I just need to keep a log running.

I still noticed some hitching with GPU spikes, but nothing weird in the logs.

Actually doing visual studio and database stuff — but nothing I consider very taxing. I’m also on arch as well — usually with me however there isn’t enough time to react and shut down the vm. The whole system goes down.

I double checked my card and I to have a Nitro+ Vega 64 – I’m wondering if this is some sort of Windows graphics driver and VFIO interaction bug.