Linux Host, Windows Guest, GPU passthrough reinitialization fix

I noticed in Wendell’s recent GPU pass-through live stream, that he mentioned that there isn’t a reliable solution for the AMD GPU reinitialization problem. He mentioned that once a Windows guest VM on linux is shut down, the AMD GPU will refuse to reinitialize, and that this requires a reboot of the Host machine to fix.

In my personal experience, this actually seemed to be rather random. Sometimes the VM would be able to boot back up, sometimes not. I found that it was not actually the shut down of the VM, but if the host machine enters a sleep state, while the VM was shut down, after having run at least once.

The root issue I suspect has less to do with how Linux is treating the GPU, but the behaviour of the Windows OS while it is a guest. Mainly the Windows guest OS is not sending a signal to the GPU to shut-down or start up while it is running in a VM. I assume this is because when on bare-metal, when a Windows OS shuts down, power is actually killed to the GPU from the motherboard/PSU but when in a guest VM, this does not happen, leaving the GPU in an initialized state. Then if/when the host machine enters a low power state, and power is significantly reduced to the GPU, it enters a non-responsive state after the host machine wakes up.

Anyway, I found a fix/work-around. I’ve borrowed bits from around the web for this. I don’t write how-to’s often, so bare with me.

(note, all these steps take place inside the Windows Guest OS)

Step 1

We are going to need the Windows Device Console (DevCon)

It is included bundled with the Windows Driver Kit, which you can download and install like a chump, but you don’t require all the added bloat.

So, instead download and install Chocolatey, so you can install it from terminal like a bad-ass. Follow the instructions on the Chocolatey website to install.

Once Chocolatey is installed open a Windows command prompt and run this command to install DevCon:

choco install devcon.portable

DevCon is rather straight forward to use. It Enables/Disables hardware. It runs in terminal and can easily be executed by a bat script.

devcon64.exe enable|disable "<device_id>"

Step 2

Now go to Device Manager and make note of the Hardware ID for the GPU and the GPU’s sound device. You will notice that there are multiple device ID’s. We are only going to use the first half, ending where the device ID’s become unique. DevCon accepts wild-cards. This way we will be able to de-initialize the entire device. Do not forgot the sound device. If you forget this, the Windows Guest will become unstable over time after reboots. Games and 3D applications will start randomly crashing. I’ve included some example images.

pfngBxU

ZCHHKU8

QZyb7Fp

Step 3

We are now going to create two bat files. I created them under C:\ so they will be easy to find later. Maybe not the best practice, but whatever.

You can use notepad to create two separate bat files. The first one I called enablegpu.bat and included these two lines.

devcon64.exe enable "PCI\VEN_1002&DEV_67DF*"

devcon64.exe enable "HDAUDIO\FUNC_01&VEN_1002&DEV_AA01*"

We are having the bat file use DevCon to enable the GPU. Use that first half the of GPU hardware id that we made note of in device manager. This hardware ID will be different on your system, depending on you GPU model/AIB. Make sure to end it with a wild-card asterix, so we get all the sub-devices. Do not forget to include the GPU’s sound device, which has a separate hardware ID.

QKJIRhA

Now create the second bat file. I called this one disablegpu.bat and saved it in the same location as the first one. It is exactly the same as the first bat file but it disables the GPU instead of enabling it.

devcon64.exe disable "PCI\VEN_1002&DEV_67DF*"

devcon64.exe disable "HDAUDIO\FUNC_01&VEN_1002&DEV_AA01*"

AmTNTS4

Step 4

Add the two bat files to the Windows Group Policy Startup and Shutdown scripts.

You can open the Local Group Policy Editor by running gpedit.msc in the Windows run prompt.

Then, navigate to Local Computer Policy > Computer Configuration > Windows Settings > Scripts (Startup/Shutdown) and select Startup. Click the Add button and select your enablegpu.bat script. Repeat this process for your disablegpu.bat script via the Shutdown item in the Scripts (Startup/Shutdown) section.

upwHdlm

zs9aaSE

Drbq2q3

Conclusion

Now, you should be set as far as user-initiated resets of your VM are concerned – you shouldn’t have any issues with your VM failing to start whenever you explicitly reset your VM or stability issues over subsequent resets. I have noticed however, that restarts due to Windows Update still seem to cause issues, perhaps because these scripts aren’t run. You’ll probably want to disable automatic updates in order to prevent these unwanted reboots from happening.

25 Likes

Interesting read with pretty pictures :slight_smile:

So as I understand this kills the power on the PCIe slot like wendell mentioned in some video recently?

Pretty much… It has the Windows Guest turn off the GPU when it shuts down or resets. Pretty much returning the GPU to the state that it was at before the Guest VM took control of it. This prevents it from entering a non-responsive state if the host machine enters any kind of low powered mode while the VM is shut-down.

Without it (most AMD GPU’s made in the past 5 years and some Nvidia GPU’s) after shutting down or resetting the Windows Guest VM even once… the GPU becomes either unstable or will refuse to reinitialize, when you try to start the VM. And… as Wendell has said, you have to reboot the host machine.

This isn’t an issue with Enterprise GPU’s, that are made for virtualization… but with most of the consumer models, there seems to be this issue.

Remembering to add these startup/shutdown scripts has given me rock stable performance with PCI-E passthrough’d GPU’s. Before this I wasn’t just noticing reinitialization problems, but after the first VM reset, I would start having lag spikes and game crashes.

So far this has fixed all stability and reinitialization issues. One thing to note is, this only works if the Guest VM makes the shut-down/reset and not the host machine. Otherwise it is not able to make a graceful shut-down of the GPU using the scripts.

2 Likes

So now what I’m wondering… If the GPU is uninitialized properly and can be reinitialized properly the next time the VM starts up, couldn’t this also be used to reinitialize that very same GPU on the host instead?

As far as I understand currently you need a dedicated GPU that cannot be used on the host when it’s supposed to be passed through, right? Or are those 2 separate issues?

1 Like

Cool, will this also work with the Vega reset bug?

Yes, it will. I’ve tested it with the Fury X, R9 290. R9 390x, RX 480, RX 550, and RX Vega 64.

5 Likes

Oh god you are a saint! Now i know what i’ll do this weekend!

Well… you set the host up to not initialize the GPU in the first place. I am pretty sure this wouldn’t work from the host to get an unresponsive card to start working again, so you don’t have to reboot the entire machine. I am pretty sure that once you get the card into an unresponsive state, by not de-initializing it properly, then letting the computer enter a low power state, your only choice is to reboot the whole system.

Reading through these forums it seems like most people are having this problem. I have been surprised that everyone on the Level1tech forums seems to be in the dark because the ArchWiki actually goes into this in detail.

When the VM shuts down, all devices used by the guest are deinitialized by its OS in preparation for shutdown. In this state, those devices are no longer functionnal and must then be power-cycled before they can resume normal operation. Linux can handle this power-cycling on its own, but when a device has no known reset methods, it remains in this disabled state and becomes unavailable. Since Libvirt and Qemu both expect all host PCI devices to be ready to reattach to the host before completely stopping the VM, when encountering a device that won’t reset, they will hang in a “Shutting down” state where they will not be able to be restarted until the host system has been rebooted. It is therefore reccomanded to only pass through PCI devices which the kernel is able to reset, as evidenced by the presence of a reset file in the PCI device sysfs node, such as /sys/bus/pci/devices/0000:00:1a.0/reset.

Most consumer GPU’s lack the feature for the OS to reset them once unresponsive. While most Enterprise GPU’s have this ability.

You can run this bash command, to see which devices have the ability to be reset once unresponsive.

for iommu_group in $(find /sys/kernel/iommu_groups/ -maxdepth 1 -mindepth 1 -type d);do echo "IOMMU group $(basename "$iommu_group")"; for device in $(\ls -1 "$iommu_group"/devices/); do if [[ -e "$iommu_group"/devices/"$device"/reset ]]; then echo -n "[RESET]"; fi; echo -n $'\t';lspci -nns "$device"; done; done

If you pass through a PCI-E device that isn’t listed as having a reset file, in the PCI device sysfs node, you need to make sure to add startup/shutdown scripts in the Guest OS for the said device. Otherwise the passed-through device will become unresponsive and require the host machine to be power cycled.

4 Likes

Do Linux or Mac guest suffer from this problem also?

I didn’t post a guide for Linux/Mac guests because there are plenty of guides out there for adding startup/shutdown scripts for devices on UNIX based systems. Anyway, a lot of linux distros that I’ve used with passthroughed devices are actually well behaved and don’t have sloppy startup/shutdown procedures… Windows does not.

1 Like

Nono, that’s not what I meant :smiley: I didn’t mean to reset it from the host. What I mean is would you be able to deinitialise and reset the GPU in the guest and then be able to initialise it in the host. If you could then deinitialise and reset it in the host, would you be able to initialise it again in the guest?

I don’t see why you couldn’t. I have never tried it. But… I vaguely remember (while I was searching around for the root cause and possible solutions to the non-responsive GPU problem) coming across some docs and spreadsheets, on Google Docs of all places, where people were discussing doing exactly what you are describing on laptops that had only one GPU.

1 Like

Hmmm… that might actually be a possibility to play both native linux games and windows games on passthrough without the need to buy 2 expensive gaming cards and having one of them unused at every given time… mmmhhhh :thinking:

On a sidenote… @wendell ping

I thought the RX cards didn’t suffer from the bus reset issue. Mine doesn’t.

I think wendell said in the NPT bug stream/video that it doesn’t affect every manufacturer. Supposedly you can even get competing manufacturer’s vBIOS and flash that and fix it sometimes.

2 Likes

Nice. Ejecting the GPU via “safely remove devices” is a workaround that has been functional at times, I used it some time ago with a Radeon 7790. This looks like the same principle, but you managed to automatise it.

2 Likes

So I tried to follow these steps on my system, but apparently Windows 10 Home doesn’t even have gpedit.msc. So…yeah. I’m a Linux guy so I’m kind of stuck on that step. I have no idea how else you might do it.

Yes, Group Policies are a Pro thing.

1 Like

I had issues with my Gigabyte G1 RX480, I think flashing the RX580 mod helped there. Not sure if this was directly related to the reset bug, because the VM continued to run several seconds after reaching the GUI.

More interestingly, I encountered a different bug again when upgrading to beta drivers of earlier this month, causing the host to lock up on the second time I booted the VM, immediatly when windows switches from bootscreen to GUI. The card is in the primary slot, therefore already initialized by the UEFI before VFIO claims it, booting the VM a first time still worked fine. Reverting to the stable driver fixed this. This tells us, that even an apperantly non-buggy card/VBIOS is left in a wired state after a reset.

Apparently. But it’s a bit ridiculous that you would need that just to have startup/shutdown scripts. Is there any other way to do it?

I mean, I only use Windows for gaming and…wait, there is no ‘and’. It’s literally just gaming. Thus I can’t really see much value in Pro…