Cannot start VM with any PCI device being passed through

So I had my gpu passthrough working for about a day and then something happened and it broke. I tried a fresh install of fedora on another drive and run into the same problem. I do have my IOMMU groups separate and my guest gpu is bound to the vfio driver.

Specs are:
Ryzen 9 3900x
Vega 56 (host)
1080ti (guest)
MSI Mag b550
Fedora 33 (with ACS patch)

This is the error message I get:

Summary

Error starting domain: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainCreate)

Traceback (most recent call last):
File “/usr/share/virt-manager/virtManager/asyncjob.py”, line 65, in cb_wrapper
callback(asyncjob, *args, **kwargs)
File “/usr/share/virt-manager/virtManager/asyncjob.py”, line 101, in tmpcb
callback(*args, **kwargs)
File “/usr/share/virt-manager/virtManager/object/libvirtobject.py”, line 57, in newfn
ret = fn(self, *args, **kwargs)
File “/usr/share/virt-manager/virtManager/object/domain.py”, line 1330, in startup
self._backend.create()
File “/usr/lib64/python3.9/site-packages/libvirt.py”, line 1234, in create
if ret == -1: raise libvirtError (‘virDomainCreate() failed’, dom=self)
libvirt.libvirtError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainCreate)

Screenshot from 2020-11-20 21-46-56

This happens as soon as I try to pass through a PCI device. VMs start fine without any PCI device being passed through. I’ve read quite a few bugzilla posts of similar issues now and what I’ve gathered is that something is locking up libvirt and that is why it cannot start. A large amount of them also had their issue resolved by updating libvirt/qemu. I don’t know how to find what process is locking up libvirt. I know that the callback files are all python related things and don’t know how to proceed further.

Just out of curiosity, have you tested this setup without the ACS patch? Are you sure it is required for your board? Not sure it would be a problem in this case but ACS should probably be your last resort. I’ve read that the MSI Mag B550 Tomahawk has the primary GPU slot separated into its own group so you could try making sure your pass-through card is in the primary slot.

Probably need more info, because otherwise its just guessing.
Attach your ls-iommu, lspci, xml. If you were messing with something in /etc/libvirt then attach it too. What does libvirt log for domain say?
You can change log level in /etc/libvirt/livird.conf.

Quick search shows that most people blame this on locked vm that’s already running. So maybe some autostart on boot?

It was already working for him… But yes i would check iommu groups on stock kernel, because seems B550 boards have at least gpu separated.
Here is MSI b550-a pro:


If thats your case without ACS, then you can easily pass GPU and one USB controller.
Also check if you have Vega as primary display device in UEFI.

That is true, but in my case, my host GPU has an aftermarket cooler and only fits in the top slot due to interference with the nvme drive on the lower x16 slot.

Bummer, so your 1080ti is going trough chipset with x4 lanes.

In this case I would just pass Vega. FLR reset supposed to work now, and your Vega might be on par with that GTX if its so handicapped.

Did you try to benchmark both cards in Linux?

IOMMU groups:
IOMMU Group 0 00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 10 00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 11 00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
IOMMU Group 12 00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 61)
IOMMU Group 12 00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
IOMMU Group 13 00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 0 [1022:1440]
IOMMU Group 13 00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 1 [1022:1441]
IOMMU Group 13 00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 2 [1022:1442]
IOMMU Group 13 00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 3 [1022:1443]
IOMMU Group 13 00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 4 [1022:1444]
IOMMU Group 13 00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 5 [1022:1445]
IOMMU Group 13 00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 6 [1022:1446]
IOMMU Group 13 00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 7 [1022:1447]
IOMMU Group 14 01:00.0 Non-Volatile memory controller [0108]: Phison Electronics Corporation E12 NVMe Controller [1987:5012] (rev 01)
IOMMU Group 15 02:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Device [1022:43ee]
IOMMU Group 16 02:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] Device [1022:43eb]
IOMMU Group 17 02:00.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43e9]
IOMMU Group 18 20:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43ea]
IOMMU Group 19 20:02.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43ea]
IOMMU Group 1 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
IOMMU Group 20 20:03.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43ea]
IOMMU Group 21 20:06.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43ea]
IOMMU Group 22 20:07.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43ea]
IOMMU Group 23 20:08.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43ea]
IOMMU Group 24 20:09.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43ea]
IOMMU Group 25 21:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
IOMMU Group 26 21:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
IOMMU Group 27 23:00.0 USB controller [0c03]: VIA Technologies, Inc. VL805 USB 3.0 Host Controller [1106:3483] (rev 01)
IOMMU Group 28 28:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15)
IOMMU Group 29 2a:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller [10ec:8125] (rev 04)
IOMMU Group 2 00:01.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
IOMMU Group 30 2b:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Vega 10 PCIe Bridge [1022:1470] (rev c3)
IOMMU Group 31 2c:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Vega 10 PCIe Bridge [1022:1471]
IOMMU Group 32 2d:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] [1002:687f] (rev c3)
IOMMU Group 33 2d:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
IOMMU Group 34 2e:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function [1022:148a]
IOMMU Group 35 2f:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP [1022:1485]
IOMMU Group 36 2f:00.1 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP [1022:1486]
IOMMU Group 37 2f:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller [1022:149c]
IOMMU Group 38 2f:00.4 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse HD Audio Controller [1022:1487]
IOMMU Group 3 00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 4 00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 5 00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
IOMMU Group 6 00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 7 00:05.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 8 00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 9 00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
lspci -vnn for guest gpu
21:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3602]
	Flags: fast devsel, IRQ 10
	Memory at fb000000 (32-bit, non-prefetchable) [disabled] [size=16M]
	Memory at b0000000 (64-bit, prefetchable) [disabled] [size=256M]
	Memory at c0000000 (64-bit, prefetchable) [disabled] [size=32M]
	I/O ports at e000 [disabled] [size=128]
	Expansion ROM at fc000000 [disabled] [size=512K]
	Capabilities: <access denied>
	Kernel driver in use: vfio-pci
	Kernel modules: nouveau

21:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
	Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3602]
	Flags: bus master, fast devsel, latency 0, IRQ 5
	Memory at fc080000 (32-bit, non-prefetchable) [size=16K]
	Capabilities: <access denied>
	Kernel driver in use: vfio-pci
	Kernel modules: snd_hda_intel

VM XML is nothing special. Nothing added other than my gpu. No disk, cpu pinning, hugh pages, etc. When it was working the only changes I made were regarding nvidia and their DRM.

I’ve tried uncommenting #relaxed_acs_check = 1 in /etc/libvirt/qemu.conf but it made no difference.

I only have the 1 VM and it is definitely not running on boot so I think it is safe to rule this out.

Unless I missed something in the BIOS I don’t belive MSI allows you to specify the primary display device. When my GPU passthrough was working I had this issue where a warning during fedora boot would show up out the guest GPU and never going away preventing the GPU to pass through correctly. It appeared to me that the second slot (guest GPU) was the primary GPU during boot and I was able to change this by setting BIOS to CSM mode. (It was UEFI previously). This made the warning during boot output from the correct GPU but then the VM wouldn’t start due to my current issue.
The first thing I tried was going back to UEFI mode in the BIOS, I updated the BIOS, then went back to the previous version of the BIOS I was on, then tried it on a fresh fedora install on another disk and haven’t had any success since.

I benchmarked the 1080ti in windows using 3dmark and it scores about 40% faster than the Vega.
I’d rather avoid using NVidia drivers in linux and possible AMD reset issues, so even though the 1080ti is gimped by the PCI-E lanes I still think this is the way to go.

Well if you cannot change primary GPU slot, and on UEFI its treating lower slot as primary, then its even more logical to pass vega and use gtx on host.

But in this case i think its possible that GTX is initialized during boot, and it shouldn’t. Maybe resetting that card through FLR before booting VM will help.

Also try to add to “<ioapic driver=“kvm”/>” to <features>
Those qemu direct commands also can be replaced with XML. Here’s my features with GTX 1070:

 <features>
   <acpi/>
   <apic/>
   <hyperv>
     <relaxed state="on"/>
     <vapic state="on"/>
     <spinlocks state="on" retries="8191"/>
     <vendor_id state="on" value="AuthenticAMD"/>
   </hyperv>
   <kvm>
     <hidden state="on"/>
   </kvm>
   <vmport state="off"/>
   <ioapic driver="kvm"/>
 </features>

One more thing. You sure you made UEFI VM after reinstalling? Your config might be nothing special, but it would show me few things without asking :slight_smile:

Just a quick note: Error 43 for NVidia GPUs will not prevent it from starting. It will just prevent the driver from loading in windows.

Another thing: make sure the modules for iommu and VFIO are loaded. VT-D has a tendency to reset do default in some BIOS releases.

Just to confirm, we are still talking about the physical motherboard’s BIOS right? The gtx was getting initialized when booting fedora if I had a monitor plugged into it (while in UEFI mode). I could disconnect the monitor from the gtx during boot and reconnect it afterwards and it passed through fine.

Setting the BIOS to CSM mode fixed this though. Is there any reason not to use CSM mode? I am using the Q35 chipset and OVMF firmware for the VM.

I’ll look into this. I tried enabling libvirt logging but haven’t gotten it to work yet. I also tried logging it with gdp and was able to record a log but I don’t think I did it right, there doesn’t seem to be much useful information in it.

Yes, mobo. Weird thing about plugging monitor. I have 2 monitors plugged to both cards with 2 ports each all the time.
So I can select on monitors which port is displayed.
But like I said I have option in bios (x470 Aorus Somehing) to make sure which card is initialized during boot.

When cards boot in CSM mode various things doesn’t work. AFAIR last year I tried to do that on x370 and then current x470, took me days, lead me nowhere. I found explanation for it back then, but it was a while, so I don’t exactly remember.
Maybe there were some developments with CSM meantime, that I’m not aware of, but I wasn’t looking, since all-UEFI setup works for me for at least 2-3 years, trough 2 mobos and 3 cpus.

I was able to get the GPU passthrough working almost two weeks ago. I swapped what slots the guest and host GPU were in, and got a pci-e extension cable to fit the Vega card and its cooler so it doesn’t interfere with the M.2 beneath it. Then I set my BIOS to UEFI mode and did a fresh install of Fedora 33 and everything has been working as it should. I no longer need the ACS override patch as well which is nice.

Everything has been working great so far (with the exception of league of legends for some reason but that’s another issue for another day)

1 Like