Linux Host, Windows Guest, GPU passthrough reinitialization fix

much appreciated.

you can download bioses from techpowerup.

As long as you NEVER init the card with its on-board bios, you can do lots of experiments to find the one(s) that are less problematic by using external uefi file w/the card.

e.g. I think the sapphire cards have a fix?

according to our survey data, they have the fewest cards with the bug, but there are bios revisions of the pulse that have it.

Our problem at the moment is that our staff only own WX cards, so testing others is problematic.

It’s something we’re working on.

Hmm.

I have 2x reference vega 64s here.
1x Sapphire (purchased January 2018 - BECAUSE CRYPTO MADNESS - i got it for $100 above RRP. :smiley: )
1x XFX (purchased on release day for vega 64)

They shipped with different BIOS versions.

I flashed the XFX with the Sapphire Bios when i got the Sapphire (it was newer, figured i’d put the same BIOS on both of them).

I don’t have a setup at the moment to test with, but believe i ran into the reset bug last time i tried making it all work.

For what it’s worth, YMMV, etc.

Haven’t had a lot of time to do nerd stuff as of late unfortunately.

When i get time i should pull one of them out, replace with my old GTX760 and try get it working with 2 different cards rather than trying to start out the hard way with 2 identical AMD cards…

I heard this was broken by the last Windows 10 update. Do you know if that is true?

Couldn’t you get around this by flashing the bios?

I didn’t read the whole thread, I’m sorry if the answer to my questions is already in here.

As far as I understood, the cards that have the reset bug can only be resetted with a full PCIe adapter reset. What exactly does happen at such a reset? Would it be sufficient to toggle the PERST# pin? If yes, one could build a simple adapter PCB to do this.

Cheers!

No, the bus must be ready for it, everything stalls if you just assert the pin.

I get the same issue with my PowerColor Vega 64 even after applying the scripts

Which Vega 64 did you get to work? My powercolor red devil v64 doesn’t. Maybe AIB bios specific

I hear kernel 4.20 is going to have a fix for this? At least that’s what i remember from the last time i went googling.

No, there is no “fix” for this, it’s simply impossible at this point to reset the AMD SOC, even AMD admits that this functionallity is incomplete.

I am hoping the next gen GPUs restore that functionality.

The AMDGPU has the PSP_mode1 reset.
Would that be enough for passing the GPU?
I could try to port it to qemu as a quirk.

@Pixo, no, I already implemented this as a quirk (https://gist.github.com/gnif/a4ac1d4fb6d7ba04347dcc91a579ee36) and worked with AMD to try to get it to function. The reset works, but the card can not be re-initialized as it ends up in a completely broken state.

If you check the amdgpu driver source you will see that the mode1 reset code is even commented as incomplete and you will also note the great lengths that they have gone to to “recover” a GPU after it has been reset, which doesn’t work.

Yup, same here; desperately hope the Navi GPUs don’t have this issue especially if it is true they will support PCIe 4.0. My R9 390 that I upgraded from works perfectly fine.
I wonder if it is to do with HBM2 and the 2 additional parts of the card that show up when listing the pci devices (lspci -nnv). I read somewhere probably here at Level1 that the RX 5## cards and below aren’t affected, just Vega and Fury.
But could be wrong

Thank you! This was really helpful. I can confirm this works with amd r9 280x with windows 10 guest and gentoo dom0 on xen-4.10

Has anyone considered a software controlled hardware solution between the PCIe socket and the GPU card? I suppose the cost would likely sink that idea. I am sure this is way more complex that just recycling the power to the card.

I read that there was a workaround until MS killed it with a Win 10 update. If I remember correctly it disabled the card before the VM shutdown. Maybe a watchdog program that responds when Windows is shutdown. I suppose it would need to run inside the VM in Windows that detects the shutdown and responds. I have other ideas tied into this line of thought. I just don’t know enough to know if this would be possible or where to begin.

I ran across this post over in the Google Groups qubes-devel group. I am reposting it here in case it gives anyone ideas.

"AMD messed up time and again (in all cards) by not connecting its PCIe bridge reset to full card reset, instead relying on some command or a series of commands beforehand.

The reset workaround magic is not yet implemented for Vega and up.
I’ve been working on that for some time and am very slowly
implementing required MMIO logging to grab only command accesses to see how Windows driver installer hard resets the card, since that actually works.
But the “recovery” soft reset after GPU crash does not, so there is something missing.

As a nasty but secure workaround, you can enter S3 (suspend to ram) which cuts power to the card, then issue resets to card side (GPU and audio) of the PCIe bridge, and then rescan the card’s bridge.
I’m pretty sure S3 will fail with Xen in many interesting ways on many mainboards."

https://groups.google.com/forum/#!msg/qubes-devel/wbvr-_YrEFI/wCpxV3h-AwAJ

1 Like

Yeah this is what I do. There was a script in linux for this posted here: Vega 64 VFIO Reset Issue Threadripper