Linux Host, Windows Guest, GPU passthrough reinitialization fix

mihawk90 · November 8, 2017, 9:51am

Well, it depends. I’m not sure at what point in the startup (and shutdown) sequence the GroupPolicy scripts would be executed and it depends on that a bit.

The startup script is not a big deal, just put it in the autostart folder (that could work, not sure though).

With shutdown… little tricky. Could use a bat to execute the reset and shutdown via commandline maybe? I assume the GUI will crash when the GPU is disconnected, so you wouldn’t be able to hit shutdown otherwise. Could try at least

SgtAwesomesauce · November 8, 2017, 5:59pm

Ah, glad to hear. I’ve got the MSI 580.

Interesting. On a somewhat related note, I’m having lockups trying to rebind an initialized GPU to VFIO as well. I’ve been trying to write QEMU hooks that will automatically read the PCI devices attached to the VM, unbind the driver, bind it to VFIO and then boot the vm, all through libvirt. It’s a process, but I think I’m going to have to abandon Fedora because of inconsistent results. (and the absolutely shit initrd support) Never thought I’d move back to Arch for stability. Can someone look out the window to check for flying pigs?

risk · November 8, 2017, 6:38pm

I’m hoping someone comes up with a Windows service that does nothing except bring up and down GPU at startup/shutdown.

This should allow it to work with updates and installs of things requiring restart.

Ability to recover after a VM is killed shouldn’t really be the use case worth optimizating for

Kevadu · November 9, 2017, 3:14am

I’ve been looking on ways to use these scripts without gpedit.msc and this has been the most solid info I could find:

But I have absolutely nothing under HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\Windows\System or HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Group Policy\State\Machine\Scripts so I was hoping that somebody who has added these scripts could give me an example of what these keys should look like.

I guess I might also need the contents of C:\WINDOWS\SYSTEM32\GroupPolicy

mihawk90 · November 9, 2017, 9:17am

I read something along those lines as well, but someone else on the post I read said it wasn’t working for him because he also only got Home. That’s why I didn’t link it probably should have either way.

/edit
Luckily google spat out the same link this time:

Did you try the manual thing with CMD shutdown I suggested earlier?

gnif · November 9, 2017, 9:26am

This is exactly why I stated in the NPT thread that I believed that this is not a software issue. If I can get my hands on some hardware that exhibits this behavior I believe I can fix it in KVM (VirtIO) which would then remove the need for any kind of guest workarounds.

Kevadu · November 9, 2017, 9:56am

Not yet. I suppose I should but I’m a little skeptical about the timing of the startup sequence and such. There are a few options for startup scripts but I’m pretty sure most of them run pretty late…

I was thinking I might just clone the whole Windows VM and then I can mess with the clone with impunity. But I haven’t had time to do that yet.

mihawk90 · November 9, 2017, 10:06am

Yeah that would be best

Philipose · January 3, 2018, 10:03pm

Hey I was going throught your GPU passthrough reinitialization fix and I was wondering if you have hit this issue.

I got passthrough working the way i wanted to recently via i440fx but was having the issue where I needed to reboot to boot the VM again. I tried your step by steps and am hitting this new error when launching Unknown PCI Header Type

The only thing I changed with your step by step was the Hardware IDs to match my VEGA 64 (Which I’ve read you’ve got working)

Did you encounter this issue or know how to resolve it?

EDIT: I did not mean to revive this thead but since I did I’ll leave it up in the hopes that someone has encountered (and hopefully fixed) this problem

EDIT: Should probably post some specs in case:

Ryzen 1700x
ASROCK AB350 Gaming K4
Vega 64 in PCIx16
1050 Ti in PCI x4
M.2 nVME SSD through SATA controller for VM

FurryJackman · January 3, 2018, 10:35pm

That error is most definitely a side effect of incorrect reinitialization. My Fresco Logic USB 3.0 controller has the same reset bug as Vega and does this if you shut down the VM and try to start it again. You may need a lower level solution than doing startup and shutdown scripts in the guest OS, like something patched in KVM itself, like @gnif said earlier in the thread.

Philipose · January 3, 2018, 11:04pm

I see. Well that’s unfortunate! thanks for the response!

Pixo · January 4, 2018, 9:01pm

I have a Vega 64 that has the same bug.
It is resolved or worked around in 4.15 rc series.

tecxx · January 21, 2018, 12:27am

hello,
doing a hyper-v 2016 build with gpu pass-through. close to completion (will prb post a full guide when i’m done!) but this gpu reset thing is killing me.

first off, i have two GPUs to test, an R9 270X, where this simply is a hardware bug, according to sources on the internet. so let’s ignore that one for now. i also have an AMD RX480, and i swear, in the last days of testing, no reset bug at all with this card! it worked just fine during countless VM reboots.

today i reinstalled everything from scratch, to get a clean system and document everything for the guide i want to post. and i can’t reboot my vm anymore! reinitialization bug - black screen, nothing happens (probably a BSOD but i can’t see it).

one thing i noticed during extensive testing, is that when the guest vm has the Microsoft default RX480 drivers installed (those from 2017-04), the bug doesn’t happen. once i install either the 2017-12-06 or 2018-01-12 original AMD drivers, the bug happens. i just wanted to leave this piece of info here, maybe it helps someone.

EDIT: here’s the guide if anyone cares

FurryJackman · January 28, 2018, 3:00am

Just posting my results with a Fresco Logic controller that has a reset issue:

With the Windows WDDM built-in driver, devcon64 properly enables and disables the controller.

With the official Fresco Logic driver, the controller DOESN’T properly unintialize and freezes the host when restarting.

So if you’re using Fresco Logic, use the Windows built-in driver rather than the official driver.

luqasz · February 4, 2018, 7:24pm

I can confirm that with NPT patch all works fine for me. No compiling or applying any patch.
My specs:
ASUS f2a85-v pro mobo (supports IOMMU)
AMD A10 5800K CPU
Radeon R7 260X

Debian 9 with kernel 4.14.0-0.bpo.3-amd64

gnif · February 5, 2018, 2:02am

Thanks but the NPT issue is not the same as the reset problem.

FurryJackman · February 8, 2018, 10:38am

Well, I had a working system capable of rebooting properly… until I added a 2nd Virtio disk AND/OR marking a Virtio Disk read only. Then devcon64 stopped working COMPLETELY to fix the initialization bug. Only when Windows figured out what to do when a hardware config changed did it start working again. Adding drives can make devcon64 over-sensitive to changes

fae76 · February 18, 2018, 12:45am

This is not working for me. I can initialize my Vega64 once after I have put the host into suspend.
Once I shutdown/reboot the Win10 guest (with the guest scripts in place as described above) my Vega refuses to display anything on the screen even though I can reboot the VM successfully (blind login, I can ping and I can ALT+F4 to shut the VM down). I have no clue why. Any ideas?
What chipset are you guys using for the Win guest? i440 or q35? So far I have only tested i440 (default for Win guests).

Here my logs from the host during VM shutdown:

kernel: AMD-Vi: Completion-Wait loop timed out
kernel: AMD-Vi: Event logged [
kernel: IOTLB_INV_TIMEOUT device=0d:00.0 address=0x00000007fe18f880]
kernel: AMD-Vi: Event logged [
kernel: IOTLB_INV_TIMEOUT device=0d:00.0 address=0x00000007fe18f8a0]
kernel: AMD-Vi: Event logged [
kernel: IOTLB_INV_TIMEOUT device=0d:00.0 address=0x00000007fe18f8c0]

This is what I get once I try to restart the vm a 2nd time without suspending the host first:

kernel: vfio-pci 0000:0d:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff

running kernel: 4.15.4-300.vanilla.knurd.1.fc27.x86_64

razorskinned · February 18, 2018, 5:34am

Yeah, I’ve only tested on Intel Mobo chipsets.

I forgot to mention before I’ve disable in mobo S states S3 and above. Not all mobo bioses have this option.

To be safe I also make sure to disable suspend and powering off monitors in gnome power settings. Then I just use win/menu-key + L when I go afk. The gnome lock screen will blank/power off monitors after about a minute but not power off GPUs. When combined with the devcon enable/disable startup/shutdown scripts, this consistently works. I have tested now with Ubuntu 16.04 LTS all the way to Fedora 27 with the 4.15+ kernels installed. Without all this I found on Ubuntu 16.04 LTS the guest GPU doesn’t just fail to reinitiatize but the host sometimes actually freezes, hard locks, when the guest is rebooted.

I was really excited when I heard from a bunch of people that this was all fixed in kernels 4.15 onward. Broke out all my AMD cards and put myself thru hell testing. Found this to be Fake News Bigly on most cards. Was a real let down.

I’ve noticed in a bunch of AMD documentation that they consistently state they don’t test or validate S4 power states, even on the AMDgpu-pro drivers and Radeon Pro GPU firmware.

kernel: vfio-pci 0000:0d:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff

Yeah, I’ve seen similar to this in the VM logs for my host machine a lot. I’ve actually seen similar messages not pretaining to vfio but just pci rom header and GPU in dmsgs.

It has sparked some glimmer of ideas for me. There is a mod/hack that GPU crypto miners, who bios mod their AMD cards, do in order to get them to function properly after they modify the firmware/bios on their cards. They flash some code onto their cards to disable the hardware’s stupid checks/validation of the signature that AMD/AIBs sign their firmware/bioses with and the cards always check for when initialized. No idea if this will impove things, but it might be worth a try.

fae76 · February 18, 2018, 3:17pm

I have turned off all possible power-saving options I could find in my bios. It doesn’t make any difference. I can only initialize my Vega once.
This is so frustrating, I am at a point where I wish I would have gone with a team blue & green build instead of supporting the underdog and going with a pure team red build.

Unraid would have been the perfect solution for me. Any other Hypervisor capable of GPU passthrough would have done the trick as well. I could have run Unraid or any other NAS distro in a VM as well as anything else I need, incl Win10. Doesn’t work with my HW it seems.
At this point I’m basically out of options and have to seriously consider going Win10 + storage-spaces + hyper-v to run my server-VM’s instead of the other way around. Rebooting all the VMs because the Win10 host needs to reboot due to updates is BS and is going to drive me mad…

I seriously hope someone is coming up with a solution.