[SOLVED] Testers Needed - PCI Passthrough with 4.19rcX - PCI Reset Regression?

Hi all,

I’ve been trying to narrow down a regression I’m experiencing with the 4.19 release candidates and I need a few more people to test and provide their feedback before I dig more into it and submit a bug report with the findings.

Arch Linux with linux-mainline-vfio 4.19rc3 but I tested rc1 and rc2 as well
I start the VM via virt-manager so I am using libvirtd/qemu to run the VM.
Last working version for me is 4.18.5 (linux-vfio package).
Asus Z270E w/ i7 7700K
Asus STRIX 1080Ti (guest)
MSI VEGA 64 (host)

I believe the issue is related to PCI device resets, I did notice some commits that were made going into 4.19 regarding that, mostly related to the new Vega cards and I also saw a function renamed related to pci device reset, but I’ve not been able to narrow down if it’s pci/kvm/iommu/pci/vfio related exactly yet, and I don’t have a second PCI device to pass through to test with to confirm what I’m experiencing is any PCI device, or just a GPU.

What I’m experiencing is when booting the VM for the first time it boots fine, I do notice it takes longer than on previous kernels to show the TianoCore logo (4.18.5 was actually the fastest) but it seems to be mostly similar to how long it takes to show the logo on 4.17.11. However, when either shutting down and starting the VM again or rebooting the running VM, it hangs with 1 core pinned at 100% and the logo never appears. The kicker is if I suspend the host system and then try to start the VM, it starts fine. Without suspending, after forcing off the VM, if I remove the GPU from the config it will boot, I also pass through a SATA controller and a USB controller and those seem to reset properly since no issue with them being passed with multiple boots, must only be with PCI devices.

While suspending is a workaround, it’s certainly not the experience on previous kernels so I don’t think it’s related to anything user space such as libvirt or qemu, though I did try git versions of both in case there were code changes around releasing/grabbing PCI devices and it made no difference. I’ve looked at the link states of the device and it appears to be changing states properly, I can see it’s in the D3 state when the VM is off and it switches to D0 after being powered on and back to D3 after shut down. I think vfio is handling the device correctly, libvirt won’t force reset it when vfio is the driver, but that’s what I’m trying to narrow down so at least if a couple others can confirm they experience the same behaviour then I know it’s more broad than just my setup.

Thanks in advance! Also if anyone has any ideas of what I may want to look more specifically at let me know. I’d like to try bisecting the kernel to narrow down to the exact commit but I need both time to learn how to do that and time to actually do it. :slight_smile:

EDIT: At work right now, can’t recall if I tested with 4.18.7 just to verify it started with 4.19rc1 but I will check tonight. I think I tested it.

EDIT2: Hadn’t tested with 4.18.7 but I had a build compiled from before that I didn’t end up using, tested it and it made no difference once again, so still due to something changed between 4.18.7 and 4.19rc1.

Ah, ignore me, it appears it will be resolved with a later release, tagged release pci-v4.19-fixes-1 contains a couple PCI fixes but notably this pci_reset_bus fix, I’ve just tested this build (again with ACS patch as well) and I can now reboot my VM without having to suspend or reboot the host. :slight_smile:

https://www.spinics.net/lists/kernel/msg2903702.html