I just got hardware passthrough working with ArchLinux, Kaby Lake, linux-vfio kernel, and KVM/QEMU, AMD RX 580, with a Windows 10 VM. It was working fine, but then I went to install the AMD drivers and midway into the install, it locked up both the VM & my Linux box (the host). Now I can’t keep the machine on for more than a couple minutes or my Linux host locks up when the VM crashes. I did some research and I think enabling MSI might do the trick because it is indeed disabled (probably by the AMD installer), but I can’t seem to get it enabled. Has anybody had a similar experience!?
Any error messages when issuing dmesg in a terminal?
There is also a chance it’s not a VM-realted issue. I had kernel lockups when I used a Linksys WiFi USB-stick on Linux. It was working fine until there was network traffic.
Did you check the respective hardware IDs in Windows device manager? These are the ones you need to to edit in the registry. Also important, you have to enable MSI interrupts on both the GPU and it’s audio device.
MSI interrupts will only be activated after you have rebooted the host and started the VM you have edited. Check lspci -v if you see something like MSI: Enable+, then check if the interrupt ID in Windows Device manager changed to a negative number i.e. -21.
If it didn’t work despite having it done correctly, your GPU is likely to not support MSI interrupts although I can’t imagine this to be the case with RX 580.
Prior to that, double-check if you passed through GPU + audio. You should try a fresh install of Windows once again and install the driver before activating MSI. The legacy interrupt handling should work out of the box (eventual stuttering aside).
Another tip: Check energy settings in your host’s UEFI and disable ASPM if enabled. My host always froze upon starting the VM when this was enabled (Nvidia card).
Try following this guide. Maybe you have overlooked something. Other than that I can’t really help you further. Wish you good luck.
Thank you for your help. Yeah I checked dmesg and nothing. Yeah the lspci stuff is while it is running (or both really). Yeah I can pass through audio, if anything audio installs fully, but the PCI drivers I’m having difficulty installed. I found out if I hotplug the gfx cards in after the VM boots, then I can seem to get away with a driver install, but then it junks out again. I think I’m going to try to reinstall Windows again, good call…
Okay, so I fresh installed Windows 10 over again, with the proper RX 5xx drivers. Like before, used hardware passthrough with the gfx card and the audio. Everything is working beautiful up until the AMD driver tries to install graphics drivers, VM crashes, freezes host ArchLinux. I didn’t touch anything either, just like before. Should I enable MSI first? Can I even do this? Maybe I’m better off just using the card with no drivers? haha.
I’m using virtual machine manager. Basically if I boot the VM without the graphics card / audio controller passed through, it’s fine. But as soon as I pass them through, VM will either freeze during boot or within a minute or two afterwards. I found that if I hotplug (passthrough) the gfx/audio PCI in after the VM boots (while it’s on), then it will work for at least long enough to install the AMD drivers, and it installs the audio driver, but once the AMD graphics driver goes to install, it freezes and dies and once that driver is in (managed to get them in fully with a hotplug), it still freezes and dies.
Everything here is default except I added/modified:
...
user = "admin"
# The group for QEMU processes run by the system instance. It can be
# specified in a similar way to user.
group="78"
...
nvram = [
"/usr/share/ovmf/ovmf_code_x64.bin:/usr/share/ovmf/ovmf_vars_x64.bin"
]
...
I downloaded the most recent AMD drivers (they just came out with new ones) and those installed fully (with Hotplugging of the cards after boot)
Once installed, I still cannot fully boot with cards on passthrough, but I did notice the VM puts BOTH the video AND audio portion of the graphics card into MSI+, right before freezing. Not sure that this is even the cause yet because it still crashed with MSI+ enabled.
I’ve tried this with NO USB devices. It ONLY happens with the card passed through. Just for the heck of it I’ve just tried it with only the video component and it still fails. Wondering if this is even MSI-related/fixable.
I’m not sure why you have all these PCI root ports in your XML. I’m not familiar with this method of passthrough. (not to mention that it appears you’re passing through an awful lot of devices. I should be seeing 4 devices at most in your XML.
What are the PCIe addresses of the devices you’re trying to pass through?
This is what I’m familiar with for passing through PCIe devices:
You also need to make sure that you pass through the GPU and HDMI audio segments together so that GPU shows up as xx:yy.0 and HDMI audio shows up as xx:yy.1
Thanks for the help. Yeah I was questioning all those PCI root ports too. This was all generated from the Virtual Machine Manager. Maybe that’s why it doesn’t work, I think I’ll start playing around with removing those. My AMD card is: 02:00.0/1. I separated it from another Nvidia card which is 01:00.0/1. Really, all I’m passing is the GPU (video/audio), a USB mouse/keyboard and a NIC (I have dual NICs on my mobo). I took it as just these were passthrough:
I’ve also made some changes since I posted that. I’m going to screw with virsh and just try to rip out all unnecessary garbage. Maybe it’s a Windows thing? I’m using Winblowz 10 Education Edition. I’ll repost after another 30+ minutes of testing.
I tried to remove those “pcie-root-port” controllers in many ways, but it kept saying malformed XML, even though I was careful not to mess up the tag pairing. I think it was just malformed because it became misconfigured. Here are screenshots of every setting in virtual machine manager:
Note: CPU/NIC changes I made after, it was breaking before this was added, still doesn’t work with them defaulted/removed respectively. I also only add Spice display / Video when I want to see video, Guest OS still has issues even if I remove it. All that mouse/keyboard/controller garbage is added by Virt Manager.
I just noticed that it doesn’t show “Ellesmere [Radeon RX 470/480/570/580]” on the Audio device, but LSPCI shows it as vfio loaded as the driver.
Oh also… so when the two PCI devices for the graphics card are NOT passed through, it works fine. But when I pass them through, at first the monitor turns on and it works, but then video ceases to the monitor, but the VM keeps going (I see it on the Spice display), but shortly after the VM dies and will kill/freeze my host computer unless I force off the VM when the video signal from the monitor cuts out.
Not sure if that’s got anything to do with it. I’m not a windows guy, but it may be worth trying a regular version of W10.
That’s not a problem, the audio device is probably the same across multiple generations of card.
That sounds like an IOMMU issue to me.
Run this and paste the output please? It’s going to display the IOMMU groups that all your devices are in.
#!/bin/bash
shopt -s nullglob
for d in /sys/kernel/iommu_groups/*/devices/*; do
n=${d#*/iommu_groups/*}; n=${n%%/*}
printf 'IOMMU Group %s ' "$n"
lspci -nns "${d##*/}"
done;
From the screenshots, everything appears to be working properly. What version of libvirt do you have and have you enabled the ACS override in kernel options?
Yeah normally when testing I will boot the VM without the Spice display / virtio video. It’s just so I can see video output because I can’t get the card working for more than a minute or two.
Interestingly the card completely takes over when in repair mode, and I’m using it. Separately, it seems like the VM is having trouble initializing the card when booting Windows. The card is on at first, but then output ceases.
For pcie_acs_override, the ID option, do you specify the “pch pci express root port”? If so, how do I find which root port to use? I’m wondering if I try that instead of the downstream option, if that’ll fix it. I just realized I have the AMD card on the 8x bus interface and the nvidia on the 16x, it probably should be the other way around, not sure if that would cause it to fail.
I updated Windows 10 to 1703, updated some issue with a HID driver and it suddenly worked, but the graphics card had artifacts. So I updated to the latest AMD driver, and it will boot, but will crash after about a minute. Definitely going to try to mess around with updates/downgrading graphics drivers maybe?
The reason the driver crashes is because it’s not true pcie isolation and the drivers are seeing something… somewhere… and causing an issue.
Thems the breaks when using ACS override. Sometimes it’s k. Sometimes not.
Kernel CLI should be intel_iommu=on iommu=1
Nothing about ACS. Does your board require ACS? If so try physically re arranging your cards. Your symptom is exactly what I ran into on Ryzen before agesa 1006. ACS was no hope there because “one big group”
Okay I finally swapped the cards (and added iommu=1), thanks. I disabled ACS override, but the cards were still in “One big group” (Both in Nvidia+AMD in Group 1). The difference though is when I boot now, the AMD card starts first, but when loading the kernel, it switches over to nvidia.
With ACS disabled, “one big group”:
IOMMU Group 0 00:00.0 Host bridge [0600]: Intel Corporation Device [8086:591f] (rev 05)
IOMMU Group 10 00:1f.0 ISA bridge [0601]: Intel Corporation 200 Series PCH LPC Controller (Z270) [8086:a2c5]
IOMMU Group 10 00:1f.2 Memory controller [0580]: Intel Corporation 200 Series PCH PMC [8086:a2a1]
IOMMU Group 10 00:1f.3 Audio device [0403]: Intel Corporation 200 Series PCH HD Audio [8086:a2f0]
IOMMU Group 10 00:1f.4 SMBus [0c05]: Intel Corporation 200 Series PCH SMBus Controller [8086:a2a3]
IOMMU Group 11 00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (2) I219-V [8086:15b8]
IOMMU Group 12 05:00.0 Ethernet controller [0200]: Intel Corporation I211 Gigabit Network Connection [8086:1539] (rev 03)
IOMMU Group 13 06:00.0 USB controller [0c03]: ASMedia Technology Inc. Device [1b21:2142]
IOMMU Group 14 07:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 [144d:a804]
IOMMU Group 1 00:01.0 PCI bridge [0604]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) [8086:1901] (rev 05)
IOMMU Group 1 00:01.1 PCI bridge [0604]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x8) [8086:1905] (rev 05)
IOMMU Group 1 01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/580] [1002:67df] (rev e7)
IOMMU Group 1 01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:aaf0]
IOMMU Group 1 02:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] [10de:1c03] (rev a1)
IOMMU Group 1 02:00.1 Audio device [0403]: NVIDIA Corporation GP106 High Definition Audio Controller [10de:10f1] (rev a1)
IOMMU Group 2 00:14.0 USB controller [0c03]: Intel Corporation 200 Series PCH USB 3.0 xHCI Controller [8086:a2af]
IOMMU Group 3 00:16.0 Communication controller [0780]: Intel Corporation 200 Series PCH CSME HECI #1 [8086:a2ba]
IOMMU Group 4 00:17.0 RAID bus controller [0104]: Intel Corporation SATA Controller [RAID mode] [8086:2822]
IOMMU Group 5 00:1b.0 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #17 [8086:a2e7] (rev f0)
IOMMU Group 6 00:1c.0 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #1 [8086:a290] (rev f0)
IOMMU Group 7 00:1c.1 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #2 [8086:a291] (rev f0)
IOMMU Group 8 00:1c.4 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #5 [8086:a294] (rev f0)
IOMMU Group 9 00:1d.0 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #9 [8086:a298] (rev f0)
With ACS enabled, now I get:
IOMMU Group 0 00:00.0 Host bridge [0600]: Intel Corporation Device [8086:591f] (rev 05)
IOMMU Group 10 00:1d.0 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #9 [8086:a298] (rev f0)
IOMMU Group 11 00:1f.0 ISA bridge [0601]: Intel Corporation 200 Series PCH LPC Controller (Z270) [8086:a2c5]
IOMMU Group 11 00:1f.2 Memory controller [0580]: Intel Corporation 200 Series PCH PMC [8086:a2a1]
IOMMU Group 11 00:1f.3 Audio device [0403]: Intel Corporation 200 Series PCH HD Audio [8086:a2f0]
IOMMU Group 11 00:1f.4 SMBus [0c05]: Intel Corporation 200 Series PCH SMBus Controller [8086:a2a3]
IOMMU Group 12 00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (2) I219-V [8086:15b8]
IOMMU Group 13 01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/580] [1002:67df] (rev e7)
IOMMU Group 13 01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:aaf0]
IOMMU Group 14 02:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] [10de:1c03] (rev a1)
IOMMU Group 14 02:00.1 Audio device [0403]: NVIDIA Corporation GP106 High Definition Audio Controller [10de:10f1] (rev a1)
IOMMU Group 15 05:00.0 Ethernet controller [0200]: Intel Corporation I211 Gigabit Network Connection [8086:1539] (rev 03)
IOMMU Group 16 06:00.0 USB controller [0c03]: ASMedia Technology Inc. Device [1b21:2142]
IOMMU Group 17 07:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 [144d:a804]
IOMMU Group 1 00:01.0 PCI bridge [0604]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) [8086:1901] (rev 05)
IOMMU Group 2 00:01.1 PCI bridge [0604]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x8) [8086:1905] (rev 05)
IOMMU Group 3 00:14.0 USB controller [0c03]: Intel Corporation 200 Series PCH USB 3.0 xHCI Controller [8086:a2af]
IOMMU Group 4 00:16.0 Communication controller [0780]: Intel Corporation 200 Series PCH CSME HECI #1 [8086:a2ba]
IOMMU Group 5 00:17.0 RAID bus controller [0104]: Intel Corporation SATA Controller [RAID mode] [8086:2822]
IOMMU Group 6 00:1b.0 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #17 [8086:a2e7] (rev f0)
IOMMU Group 7 00:1c.0 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #1 [8086:a290] (rev f0)
IOMMU Group 8 00:1c.1 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #2 [8086:a291] (rev f0)
IOMMU Group 9 00:1c.4 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #5 [8086:a294] (rev f0)
Then when I try to boot my VM, it takes over the AMD again, but graphics output ceases. I figured I’d keep messing with the VM now. Can I target ACS override to just the card? (or do I not get it?).
I had the Nvidia in slot 1 (closest to my CPU) and the AMD in slot 2, but I just reversed it. Slot 3 is max x4 mode and it doesn’t look like I want to put a gfx card in there.
so the acs override just tells the drivers “its cool everythings isolated” its like having a party where two people that don’t get along will be there. with acs override, you lie to the people and they see each other, maybe, and bad things happen. without acs override and proper isolation, the two people that hate each other are at the party but never see each other even though everyone else can see them. e.g. one is sequestered to the l living room, the other to the dining room.
It sucks, but there is probably no easy fix with that mobo. Pickup a z170 mobo? ironically most of those are fine.