Linux Host, Windows Guest, GPU passthrough reinitialization fix

Hmm.

I have 2x reference vega 64s here.
1x Sapphire (purchased January 2018 - BECAUSE CRYPTO MADNESS - i got it for $100 above RRP. :smiley: )
1x XFX (purchased on release day for vega 64)

They shipped with different BIOS versions.

I flashed the XFX with the Sapphire Bios when i got the Sapphire (it was newer, figured i’d put the same BIOS on both of them).

I don’t have a setup at the moment to test with, but believe i ran into the reset bug last time i tried making it all work.

For what it’s worth, YMMV, etc.

Haven’t had a lot of time to do nerd stuff as of late unfortunately.

When i get time i should pull one of them out, replace with my old GTX760 and try get it working with 2 different cards rather than trying to start out the hard way with 2 identical AMD cards…

I heard this was broken by the last Windows 10 update. Do you know if that is true?

Couldn’t you get around this by flashing the bios?

I didn’t read the whole thread, I’m sorry if the answer to my questions is already in here.

As far as I understood, the cards that have the reset bug can only be resetted with a full PCIe adapter reset. What exactly does happen at such a reset? Would it be sufficient to toggle the PERST# pin? If yes, one could build a simple adapter PCB to do this.

Cheers!

No, the bus must be ready for it, everything stalls if you just assert the pin.

I get the same issue with my PowerColor Vega 64 even after applying the scripts

Which Vega 64 did you get to work? My powercolor red devil v64 doesn’t. Maybe AIB bios specific

I hear kernel 4.20 is going to have a fix for this? At least that’s what i remember from the last time i went googling.

No, there is no “fix” for this, it’s simply impossible at this point to reset the AMD SOC, even AMD admits that this functionallity is incomplete.

I am hoping the next gen GPUs restore that functionality.

The AMDGPU has the PSP_mode1 reset.
Would that be enough for passing the GPU?
I could try to port it to qemu as a quirk.

@Pixo, no, I already implemented this as a quirk (https://gist.github.com/gnif/a4ac1d4fb6d7ba04347dcc91a579ee36) and worked with AMD to try to get it to function. The reset works, but the card can not be re-initialized as it ends up in a completely broken state.

If you check the amdgpu driver source you will see that the mode1 reset code is even commented as incomplete and you will also note the great lengths that they have gone to to “recover” a GPU after it has been reset, which doesn’t work.

Yup, same here; desperately hope the Navi GPUs don’t have this issue especially if it is true they will support PCIe 4.0. My R9 390 that I upgraded from works perfectly fine.
I wonder if it is to do with HBM2 and the 2 additional parts of the card that show up when listing the pci devices (lspci -nnv). I read somewhere probably here at Level1 that the RX 5## cards and below aren’t affected, just Vega and Fury.
But could be wrong

Thank you! This was really helpful. I can confirm this works with amd r9 280x with windows 10 guest and gentoo dom0 on xen-4.10

Has anyone considered a software controlled hardware solution between the PCIe socket and the GPU card? I suppose the cost would likely sink that idea. I am sure this is way more complex that just recycling the power to the card.

I read that there was a workaround until MS killed it with a Win 10 update. If I remember correctly it disabled the card before the VM shutdown. Maybe a watchdog program that responds when Windows is shutdown. I suppose it would need to run inside the VM in Windows that detects the shutdown and responds. I have other ideas tied into this line of thought. I just don’t know enough to know if this would be possible or where to begin.

I ran across this post over in the Google Groups qubes-devel group. I am reposting it here in case it gives anyone ideas.

"AMD messed up time and again (in all cards) by not connecting its PCIe bridge reset to full card reset, instead relying on some command or a series of commands beforehand.

The reset workaround magic is not yet implemented for Vega and up.
I’ve been working on that for some time and am very slowly
implementing required MMIO logging to grab only command accesses to see how Windows driver installer hard resets the card, since that actually works.
But the “recovery” soft reset after GPU crash does not, so there is something missing.

As a nasty but secure workaround, you can enter S3 (suspend to ram) which cuts power to the card, then issue resets to card side (GPU and audio) of the PCIe bridge, and then rescan the card’s bridge.
I’m pretty sure S3 will fail with Xen in many interesting ways on many mainboards."

https://groups.google.com/forum/#!msg/qubes-devel/wbvr-_YrEFI/wCpxV3h-AwAJ

1 Like

Yeah this is what I do. There was a script in linux for this posted here: Vega 64 VFIO Reset Issue Threadripper

now this is a bit old … but it just popped up for me on google search and I wanted to give the OP some love for the original post and to suggest an easier approach.

For those who can’t be bothered to download the driver development kit … you can get the same effect as devcon using the pnputil command (built into ALL versions of win10 … and earlier … I think … definitely used it on Win 7 and Win 10 at any rate). This does all you need. First you need to get the id of the device you want to shut down and start up … e.g. for my graphics card I’m passing through to win 10 (I don’t pass through the audio card).

C:\WINDOWS\system32>pnputil /enum-devices /class Display
Microsoft PnP Utility

Instance ID: PCI\VEN_1002&DEV_66AF&SUBSYS_081E1002&REV_C1\4&23f93f77&0&00E0
Device Description: AMD Radeon VII
Class Name: Display
Class GUID: {4d36e968-e325-11ce-bfc1-08002be10318}
Manufacturer Name: Advanced Micro Devices, Inc.
Status: Started
Driver Name: oem1.inf

You can then pass the instance ID to pnputil /enable-device or pnputil /disable-device.

Thus my startup script is:

pnputil /enum-devices /class Display
pnputil /enable-device "PCI\VEN_1002&DEV_66AF&SUBSYS_081E1002&REV_C1\4&23f93f77&0&00E0

And my shutdown script is:

pnputil /enum-devices /class Display
pnputil /disable-device "PCI\VEN_1002&DEV_66AF&SUBSYS_081E1002&REV_C1\4&23f93f77&0&00E0

I keep the enum-devices command in the script in case somethng changes and they need updating … so I can rememebr how. I have Win Pro so use the GPEdit.msc startup/shudown scripts but you can probably find something in the task scheduler if you don’t.

New here … so if people think I should post this as a new thread instead of pinging a super old one … let me know and I will.

The pnputil command is a must have for all Win10 users as it lets you delete the old and unused drivers cluttering up your drive. After you un-install a driver … Win10 loves to auto install these old uninstalled drivers onto your devices when your trying to update to a newer version. It’s always good to purge those previous versions (pnptuile -e to find them and pnputil -d to delete them).

It’s better than rolling back your VM. Though … frankly … I’ve no idea how your supposed to be able to cope with windows 10 if you can’t do that.

Switch on … BSOD or crash from failed auto-update … roll back Windows 10 to a week ago … Peace on earth.

Now back to ma Jonesin!

Good to have another solution, but all of this is pretty much not necessary anymore, because FLR bug has been worked out with dkms for older stuff. And on new rx 6800 it has been fixed at last.

It seems there is another bug happening with at least the 7800XT - the host locks up completely and eventually hard resets once you shutdown the VM. After the reset, the GPU is not available anymore. See this bug report : hxxps://gitlab.freedesktop.org/drm/amd/-/issues/2955 (can’t post links, replace the xx with tt).

If someone stumbles over this: Disabling the devices before shutdown (and then enabling them again on startup) as described somewhat fixes the issue, a good enough workaround until there is a fix.

Thanks to OP for documenting this years ago :slight_smile: