Updateing AMD driver leads to VM not starting!

Hi folks,

some time ago I had trouble updating the AMD driver in my virtual machine, so I stayed on the old driver thinking maybe it is an issue that will get fixed in a future version, but it did not.

The issue I am experiencing is the following: running AMD driver version 20.4.2. in my Windows 10 VM leads to no issues at all. VM start, shutdown, restart, everything works. Installing 20.4.2. works as on a bare metal system. The screen goes black during the installation but comes back online and I can see the installation finishing.
As soon as I clean install or update to any version after that starting from 20.5.1 up to the current driver version 20.9.2. leads to the screen going dark during installation and not coming back up. After waiting a few minutes for the installation to finish and shutting down the VM via virt-manager and turning the VM on again the Tianocore boot screen has a lot of graphical artifacts and when the Windows loading icon comes on this has artifacts as well an then the VM gets stuck in this screen.

Thanks in advance for any tips to resolving this problem.

Edit:

  • Manjaro with 5.8 Kernel an Navi Reset Patch V2
  • Powercolor 5700XT

I attached two pictures of what I see when I start the VM:

By now I tried a few more things.

  1. I started the VM in safe mode with network drivers and tried to do a clean install of an up-to-date (WHQL) AMD driver in this mode. This leads to the installation failing with the message: “Oops! Something went wrong. Error 192 - AMD installer cannot continue due to an Operating System issue”.
    After that I get a messed up Tianocore logo like pictured above only not in pink but in white.

  2. I put the graphics card (5700XT by the way) into another Computer and checked that installing the driver with a native Windows works. On native Windows 10 the driver install just fine!

I tried a few more things:

  1. With the old driver installed I attached a Spice display and after starting Windows set it up as primary display. With Spice as primary display and the GPU still passed through I started the driver installation of AMD 20.9.1 (WHQL) and was able to see via Spice that the installation was successfully finished.
    I did a restart then and the Spice display was coming up successfully, however the GPU outputted Artifacts again and was stuck on the Tianocore screen even tough in Spice I could see that Windows had already fully booted. In Windows I was now unable to find the other display output from the GPU, the only display output in the settings window was the Spice display.

looks to me like a sign of the reset failing. what does the dmesg of the host machine look like during this?

@mathew2214: I activated the virtual network interface and started the VM let it stay stuck for a minute and forcefully shut it off, this is the part of dmesg for this:

[35958.623050] tun: Universal TUN/TAP device driver, 1.6
[35958.623739] virbr0: port 1(virbr0-nic) entered blocking state
[35958.623796] virbr0: port 1(virbr0-nic) entered disabled state
[35958.623876] device virbr0-nic entered promiscuous mode
[35958.754655] virbr0: port 1(virbr0-nic) entered blocking state
[35958.754658] virbr0: port 1(virbr0-nic) entered listening state
[35958.785269] virbr0: port 1(virbr0-nic) entered disabled state
[35965.067922] virbr0: port 2(vnet0) entered blocking state
[35965.067938] virbr0: port 2(vnet0) entered disabled state
[35965.068017] device vnet0 entered promiscuous mode
[35965.068276] virbr0: port 2(vnet0) entered blocking state
[35965.068278] virbr0: port 2(vnet0) entered listening state
[35965.109510] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
[35966.123661] vfio-pci 0000:0a:00.0: enabling device (0002 -> 0003)
[35966.123856] vfio-pci 0000:0a:00.0: Navi10: SOL 0x0
[35966.123856] vfio-pci 0000:0a:00.0: Navi10: device doesn't need to be reset
[35966.124083] vfio-pci 0000:0a:00.0: vfio_ecap_init: hiding ecap [email protected]
[35966.124094] vfio-pci 0000:0a:00.0: vfio_ecap_init: hiding ecap [email protected]
[35966.124098] vfio-pci 0000:0a:00.0: vfio_ecap_init: hiding ecap [email protected]
[35966.124099] vfio-pci 0000:0a:00.0: vfio_ecap_init: hiding ecap [email protected]
[35966.124101] vfio-pci 0000:0a:00.0: vfio_ecap_init: hiding ecap [email protected]
[35966.143663] vfio-pci 0000:0a:00.1: enabling device (0000 -> 0002)
[35966.210750] vfio-pci 0000:0a:00.0: Navi10: SOL 0x0
[35966.210752] vfio-pci 0000:0a:00.0: Navi10: device doesn't need to be reset
[35967.216980] virbr0: port 2(vnet0) entered learning state
[35969.350303] virbr0: port 2(vnet0) entered forwarding state
[35969.350305] virbr0: topology change detected, propagating
[35991.351111] [drm] SADs count is: 0, don't need to read it
[35998.251357] [drm] SADs count is: 0, don't need to read it
[36004.617643] virbr0: port 2(vnet0) entered disabled state
[36004.620556] device vnet0 left promiscuous mode
[36004.620567] virbr0: port 2(vnet0) entered disabled state
[36004.720357] vfio-pci 0000:0a:00.0: Navi10: SOL 0xc6af88b2
[36004.720359] vfio-pci 0000:0a:00.0: Navi10: performing BACO reset

Bump in the hopes that someone can help me out!

I had this same issue. arch wiki fixed it for me

2 Likes

Thanks I’ve had this issue forever and this finally fixed it. It used to work fine on my card until I had to reset my config and didn’t apply this afterwards since I wasn’t using an Nvidia card anymore and didn’t think it would cause issues on my AMD card.

@ken3: Flippin hell! You are a lifesaver! This actually fixed it for me too!

This, however, makes me quite unhappy since I can not believe that something like this lands in the driver by accident. The fact that all driver version before worked without this mitigation tells me that this detection mechanism has not been in there before. I do not know anything about driver development, but from my layman’s understanding this seems like AMD is going in the opposite direction of where I want them heading to!

I have the same problem with a RX590 with a 2700x, x470 Taichi, Debian host to a Win 10 guest. I was able to make it work thanks to this forum a year and a half ago. But now, if I update GPU’s drivers, I get black screen. I tried ken3’s solution but it don’t work, VM hangs on TianoCore screen if I add <vendor_id state='on' value='randomid'/> to XML config

Try the Pro series drivers instead. I think the issue relates to the AMD Link changes made for the 20.5.1 and up mainstream drivers. AMD Link does not appear to be present in the Pro Series drivers.

Have you also added?:

<kvm>
    <hidden state='on'/>
</kvm>

Yes I did.
With only

<kvm>
    <hidden state='on'/>
</kvm>

it works, indeed, thats what I have right now. But when I also add <vendor_id state='on' value='randomid'/> it hangs.

Thats strange, can you maybe check, the following is my complete features block, if this works:

<features>
    <acpi/>
    <apic/>
    <hyperv>
      <vendor_id state="on" value="whatever"/>
    </hyperv>
    <kvm>
      <hidden state="on"/>
    </kvm>
    <vmport state="off"/>
  </features>

I changed my block:

  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state="on"/>
      <vapic state="on"/>
      <spinlocks state="on" retries="8191"/>
    </hyperv>
    <kvm>
      <hidden state="on"/>
    </kvm>
    <vmport state="off"/>
  </features>

With your code and same result. It stuck on TianoCore screen.

I don’t get it. I’m just at a loss. It seems like other people are managing to resolve this, but I can’t fix mine. It’s just broken.

Ever since a Windows 10 update was forced on me last week, I’ve been experiencing this issue. My VM begins to start up, showing the TianoCore boot screen, then Windows loading (with a low resolution, before the driver is started). Then the Windows user login process starts, and this is normally where the graphics driver would kick in for me and I’d see my desktop in high resolution. Instead, I get a brief graphical glitch at the top of a black screen, and it just goes completely black. Completely dead.

I’ve tried rolling back to driver 20.4.2. I’ve tried rolling back to a November driver. I’ve tried installing the latest driver. I’ve tried the Radeon Pro 2020 Q4 driver. I’ve tried the KVM hidden line in my XML file, both on its own as well as together with the vendor_id line. None of these have made any difference.

I’m at a loss for what to do now. It’s just broken. I can’t use my virtual machine anymore. It works perfectly fine if I boot into the same Windows disk natively without the VM.

What can I do now?

Edit: I’m using an AMD Radeon 5700 XT by the way.

Sounds like something else might be wrong. The thing I find curious is why your driver comes on so late? For me as soon as the Windows loading icon would come on it got stuck!

BTW what distribution are you using?

Edit: It sounds like you pass though the whole block device to the VM and don’t use an image (file) to store Windows. A thing that helped me already a few times was to make a clean start. Create a new VM and firstly only mount the image, in your case the block device and start Windows with Spice. Make sure everything is setup right and works without passthrough. Then add the GPU and try again. Then slowly readding all the specific settings like CPU isolation and so on step by step. After every step boot the VM and check if it is still working.

Also if this does not work, if you haven’t done so already do the same as mentioned above but with a fresh Windows install. Sometimes Windows is the problem in this situation.

Sorry I can’t be of more help, but this is what comes to my mind after reading your description.

That’s because of this workaround that I was using for the GPU reset bug. It basically involves using startup/shutdown scripts on Windows to enable/disable the GPU, and then suspending the entire PC to RAM between runs of the VM.

Just in case that was causing the issue, I removed the startup/shutdown scripts, but it still doesn’t work. That just causes it to hang immediately after the VM’s TianoCore boot screen. It doesn’t even go black anymore — it just freezes on the boot screen with the loading circle there at the bottom.

Fedora Silverblue 33.

That’s right.

Okay, I gave that a go. The VM works fine without GPU passthrough. I made a new VM configuration, using basically default settings except for telling it to use my Windows drive. It booted up just fine with Spice. Once the GPU is passed through, it no longer works.

I don’t know what else to try… Is that my only option left? I really don’t want to have to reinstall Windows.

Thanks.

Before you delete your VM just create a new additional one, based on an Image on your disk, to try it. If a fresh installtion of Windows in another VM does not help it is unlikely that the problem lies there and you can use your old VM.

Can you please post your VMs XML here?

I believe I’ve found the cause of the problem, although I don’t understand why.

I’d recently enabled the Windows Sandbox feature, in the Windows enable/disable features menu. I didn’t think of that as a potential cause as I don’t see why it would have anything to do with this… Yet after disabling the feature from a native Windows boot, I’m now able to launch my VM with GPU passthrough again! To confirm, I even re-enabled the Windows Sandbox feature and it again broke the VM.

I have no idea why this is causing an issue, but it seems like it is. It’s therefore unrelated to this issue, so I’m sorry for hijacking the thread.

I’m even able to run the latest AMD driver with no issues.

Thanks for the help.