I’m having some issues with GPU passthrough to a linux VM for desktop use. I’m running unraid 6.4.0_rc21b on a ryzen system (r7 1700X + Asus prime x370-pro), and my GPU is a RX 580 (this one: https://www.newegg.ca/Product/Product.aspx?Item=N82E16814202293). When I pass through the GPU to a windows 10 VM, I have almost no issues, except that the actual BIOS and boot screen (with the swirly dots) doesn’t show up, and I just get to the log in screen a moment later. I can reboot, shutdown, etc no problem.
When I pass through to a linux VM though (Ubuntu 17.10 for sake of documenting this issue) things are a little bit different. I first install the VM with a VNC screen for to confirm that the base system installed correctly, selecting to install updates and third-party drivers as I go, and everything installs fine. I then proceed to pass through my graphics card, along side some other PCI devices like audio and USB controllers. The first boot, everything works as well as it does with a windows VM. I install any updates available via apt update and upt dist-upgrade. When I go to reboot or shutdown though, something fails. Instead of shutting down, the VM gets stuck. If I run lspci looking for my graphics card at this point, before the VM has actually shut down (from what I can tell), all I find is this
0a:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/580] (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: vfio-pci
Looking in /var/log/syslog, I see that it has been flooded with a somewhat randomly ordered set of messages like the following
Jan 11 18:32:43 core kernel: IOTLB_INV_TIMEOUT device=0a:00.0 address=0x00000007fb631070]
Jan 11 18:32:43 core kernel: AMD-Vi: Completion-Wait loop timed out
If I try to force stop the VM and then start it again I get the error
internal error: Unknown PCI header type '127'
The only way I can get the graphics card usable again is to reboot the entire system.
Now I’m no expert, but in doing some research this looks very much like the infamous PCI reset bug. It’s strange though that it only manifests with linux guests.
I’m curious if anyone out there has had similar issues? Anyone know a fix, or at least why windows works and linux doesn’t?