Linux Host, Windows Guest, GPU passthrough reinitialization fix

fae76 · February 18, 2018, 4:05pm

From what I have red, more likely 4.17. Even that is questionable. If this is a HW bug we might be out of luck and never get a fix/workaround. And at some point I really have to migrate from my old rig which has been running 7/24 for 8+ years now.

randomized · February 18, 2018, 4:27pm

One could possibly test 4.16rc to see if it there is a fix since amd have done alot for it.

Pixo · February 19, 2018, 7:40pm

Can you check the lspci -vv for your Vega and its audio device ?
I had a problem that after reboot my PCIE header went to state 127 and the device was unusable.
If your audio is in separate IOMMU group you could try to only pass the GPU while your audio is bound to snd_hda_intel driver.
During my debugging i had found this allowed me to reboot the guest.
I had performance degradation due to it but you can at least test.

For me kernel 4.15 fixed this problem.
Or it may had been new AGESA + 4.15 kernel that ultimately fixed it.
Also reboot may be dependent on the card BIOS.
I have read that some RX 580 cannot be reset but other dont have this problem, all depending on manufacturer.

My configuration is as follows:
kernel 4.15.3
Q35 emulated chipset
passing vBIOS from a file for GPU

Also using iommu=pt as a kernel parameter due to the AMD-Vi: Completion-Wait loop timed out
It may not be relevant as I have it due to AMD GCN card in my chipset slot.

fae76 · February 19, 2018, 8:38pm

Thanks a lot for your help Pixo.
I will try your suggestions as soon as I’m back from ski-holidays (probably Sun evening).

randomized · February 21, 2018, 6:39pm

I find that one would need the acs patch since the acs (atleast on this motherboard) is bugged and wouldnt allow for the devices to be in their own iommu group. Something I’ve noticed though is with the vega frontier edition, if I switch into gaming mode, I am able to reboot the vm without rebooting of the host, but in professional, it is at random. I could try only passing the gpu and not the audio portion but I do not believe it will make to much of a difference. Honestly, I hope we get a real solution because the patch from gnif doesnt work for me either.

randomized · February 21, 2018, 6:44pm

Also, OP should mention that the scripts may cause a crash when updating the gpu drivers as well. I noticed that when updating (or switching between modes) while the scripts are enable could result in the computer crashing before the update is complete. Removing the scripts fixed the issue.

Pixo · February 28, 2018, 9:29pm

So I upgraded to 1920X because of the PCIE lanes.
Using gnifs,ACS and Zen SMT patches for kernel to pass all the devices I need.
Qemu patchedfor Zen SMT and caches CPUID.
No problems so far. I can reboot or shutdown and startup as i see fit without problem.
All passed devices are connected to the first MCM as is chipset.
I wanted to have devices connected to the second MCM but could not get Vega working when not using CSM, booting in EFI mode.
When using CSM, the boot GPU is the one in first slot which is wired to the second MCM.

I pass following devices:

RX Vega 64 + audio
NVMe disk as pcie device
SOC integrated audio + USB controller
Renesa based USB controller
Intel NIC connected to chipset

Using Q35 emulated chipset and connecting all devices to emulated pcie-root-port

I wish my board BIOS would use newer AGESA , if there is one for TR, that splits all the SOC devices to separate IOMMU groups like on Ryzen.
Second wish would be for PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43b1] to support ACS so the chipset devices will be split.

Also MS could fix its L3 topology decoding for Zen architecture on Win10.

fae76 · March 1, 2018, 1:15pm

@Pixo what is you exact HW config? Do you have multiple GPUs? Could you please share you configs and kernel version? Thank you very much!

Pixo · March 1, 2018, 6:42pm

I completely forgot how i got the vega to behave until i had to reinstall windows.
Here is the link with information that i followed:
[Linux Host, Windows Guest, GPU passthrough reinitialization fix]

MB: ASROCK Fatal1ty X399 Professional Gaming
CPU: TR 1920X
RAM: 2x8GB + 2x16GB G.Skill 3200 C14 Trident Z , memory interleaving set to channel to expose NUMA
GPU: R9 290 (primary), RX 550, Vega 64
kernel: 4.15.6
config: (https://forum win.txt (6.6 KB)
.level1techs.com/t/linux-host-windows-guest-gpu-passthrough-reinitialization-fix/121097)

mihawk90 · March 1, 2018, 9:10pm

What if I told you that is the exact same thread you were posting this in

Smerrills · March 2, 2018, 4:15am

^That, right there, is the best post on this thread! hahaha! And I read this thread every time I get a notification it’s been updated! Priceless!! Hahaha!

Pixo · March 2, 2018, 6:47am

Ups… got lost somewhere, or probably merged the 2 threads in my head.
Have to find the correct thread now, something about Vega on Threadripper.

fae76 · March 6, 2018, 8:21am

thank you very much for the HW specs and config file.

tkoham · July 29, 2018, 8:13pm

Sorry to necro (ish), but this seems to be the best place to ask rather than starting a new thread, as everyone here has AMD hardware with the reset bug.

Can anyone confirm that the 4.16 branch with updated qemu ACUTALLY fixes the reinit issues you were having on vega, or any other series?

in general, yes, but I’ve not seen any that replicate this same functionality.

If you know of any that do, could you point me to them?

gnif · July 29, 2018, 11:05pm

There is no fix for the Vega FLR bug… AMD’s engineers have confirmed that they know of this bug, and it is a firmware and/or hardware bug.

tkoham · July 29, 2018, 11:08pm

My suspicion is that it’s firmware, as we’ve seen bios revisions shipped with non-reference cards where the bug is mitigated or not present.

The reason I ask is because I’ve been seeing anecdotal reports of it being fixed in recent kernels. I doubted it was true, but I needed to independently verify that to some extent.

x3sphere · July 30, 2018, 12:41am

fwiw, my Vega 64 Liquid Cooled model didn’t have any reset issues on kernel 4.15. The LC model I think has a different BIOS than the other reference cards, however. This was on a Threadripper 1950X system with Zenith Extreme.

I ended up moving over to a 1080 Ti though.

tkoham · July 30, 2018, 12:53am

that’s what I’ve seen, and I can’t find good info on the reference bios differences other than power limit and mem straps

gnif · July 30, 2018, 2:03am

I will follow up with my contact at AMD and see if I can get an answer on this.

tkoham · July 30, 2018, 2:08am

much appreciated.