Need help setting up vfio-pci for passthrough

I am trying to set up a windows passthrough on my new threadripper system and I have run in to a problems. I didn’t realize when purchasing my components that having two identical video cards would give me such a headache.

Anyway, I have a successfully installed version of win7 running on qemu through virt-manager (using the guide from Archlinux) and I know that PCI passthrough in general must be working as I was able to pass through my nvme HD for the install and it works.

However, I am failing to get PCI passthrough to work on my video cards. I have identical cards (Radeon pro wx 3100) and I have even managed with some frustration to get the system running were one card is bound to the amdgpu driver and the other is bound to the vfio-pci driver.

When I edit virt-manager to pass through the card bound to vfio-pci (and the associated audio deivce), things fail.

Virt-manager seems to show something running and there is a qemu process which is visible, but there is nothing coming into the display. X will become much less interactive and my logs show a lot of something not going right:

Output
Jun  1 14:07:13 threadripper kernel: br0: port 3(vnet0) entered disabled state
Jun  1 14:07:13 threadripper kernel: device vnet0 left promiscuous mode
Jun  1 14:07:13 threadripper kernel: br0: port 3(vnet0) entered disabled state
Jun  1 14:07:13 threadripper kernel: input: Belkin Corporation Flip CC as /devices/pci0000:40/0000:40:07.1/0000:44:00.3/usb5/5-4/5-4.1/5-4.1:1.0/0003:050D:3201.0008/input/input17
Jun  1 14:07:13 threadripper kernel: belkin 0003:050D:3201.0008: input,hiddev96,hidraw0: USB HID v1.10 Device [Belkin Corporation Flip CC] on usb-0000:44:00.3-4.1/input0
Jun  1 14:07:14 threadripper kernel: nvme nvme1: pci function 0000:42:00.0
Jun  1 14:07:14 threadripper kernel: nvme 0000:42:00.0: enabling device (0400 -> 0402)
Jun  1 14:07:14 threadripper kernel:  nvme1n1: p1 p2 p3
Jun  1 14:07:14 threadripper ntpd[5133]: Deleting interface #12 vnet0, fe80::fc54:ff:fe64:842e%9#123, interface stats: received=0, sent=0, dropped=0, active_time=33 secs
Jun  1 14:09:41 threadripper kernel: nvme nvme1: failed to set APST feature (-19)
Jun  1 14:09:41 threadripper kernel: br0: port 3(vnet0) entered blocking state
Jun  1 14:09:41 threadripper kernel: br0: port 3(vnet0) entered disabled state
Jun  1 14:09:41 threadripper kernel: device vnet0 entered promiscuous mode
Jun  1 14:09:41 threadripper kernel: br0: port 3(vnet0) entered blocking state
Jun  1 14:09:41 threadripper kernel: br0: port 3(vnet0) entered forwarding state
Jun  1 14:09:43 threadripper kernel: vfio_ecap_init: 0000:42:00.0 hiding ecap 0x19@0x168
Jun  1 14:09:43 threadripper kernel: vfio_ecap_init: 0000:42:00.0 hiding ecap 0x1e@0x190
Jun  1 14:09:43 threadripper kernel: vfio-pci 0000:08:00.0: enabling device (0002 -> 0003)
Jun  1 14:09:43 threadripper kernel: vfio_ecap_init: 0000:08:00.0 hiding ecap 0x19@0x270
Jun  1 14:09:43 threadripper kernel: vfio_ecap_init: 0000:08:00.0 hiding ecap 0x1b@0x2d0
Jun  1 14:09:43 threadripper kernel: vfio_ecap_init: 0000:08:00.0 hiding ecap 0x1e@0x370
Jun  1 14:09:43 threadripper kernel: vfio-pci 0000:08:00.1: enabling device (0000 -> 0002)
Jun  1 14:09:44 threadripper kernel: usb 5-4.1: reset low-speed USB device number 3 using xhci_hcd
Jun  1 14:09:44 threadripper ntpd[5133]: Listen normally on 13 vnet0 [fe80::fc54:ff:fe64:842e%10]:123
Jun  1 14:09:45 threadripper kernel: usb 5-4.1: reset low-speed USB device number 3 using xhci_hcd
Jun  1 14:09:45 threadripper kernel: vfio_bar_restore: 0000:08:00.1 reset recovery - restoring bars
Jun  1 14:09:45 threadripper kernel: vfio_bar_restore: 0000:08:00.0 reset recovery - restoring bars
Jun  1 14:09:45 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:45 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:45 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:46 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:46 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:46 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:46 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:46 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:46 threadripper kernel: AMD-Vi: Event logged [
Jun  1 14:09:46 threadripper kernel: IOTLB_INV_TIMEOUT device=08:00.0 address=0x000000083d4fbdb0]
Jun  1 14:09:46 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:46 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:46 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:47 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:47 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:47 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:47 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:47 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:47 threadripper kernel: AMD-Vi: Event logged [
Jun  1 14:09:47 threadripper kernel: IOTLB_INV_TIMEOUT device=08:00.0 address=0x000000083d4fbde0]
Jun  1 14:09:47 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:47 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:47 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:48 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:48 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:48 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:48 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:48 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:48 threadripper kernel: AMD-Vi: Event logged [
Jun  1 14:09:48 threadripper kernel: IOTLB_INV_TIMEOUT device=08:00.0 address=0x000000083d4fbe10]
Jun  1 14:09:48 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:48 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:48 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:49 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:49 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:49 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:49 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:09:49 threadripper kernel: AMD-Vi: Completion-Wait loop timed out

.....
Jun  1 14:10:06 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:10:06 threadripper kernel: AMD-Vi: Event logged [
Jun  1 14:10:06 threadripper kernel: IOTLB_INV_TIMEOUT device=08:00.0 address=0x000000083d4fa170]
Jun  1 14:10:06 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:10:06 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun  1 14:10:07 threadripper kernel: AMD-Vi: Command buffer timeout
Jun  1 14:10:07 threadripper kernel: WARNING: CPU: 15 PID: 22340 at drivers/iommu/amd_iommu.c:1255 __domain_flush_pages+0xe1/0xf0
Jun  1 14:10:07 threadripper kernel: Modules linked in: rfcomm snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm amdkfd amdgpu mfd_core chash gpu_sched ttm vfio_pci vfio_virqfd ip_set_hash_net xt_set xt_recent xt_comment xt_addrtype ipt_rpfilter ip_set_hash_ip xt_hashlimit xt_CT xt_multiport xt_LOG nf_conntrack_sane nf_nat_snmp_basic nf_conntrack_snmp nf_conntrack_broadcast nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre devlink bnep iwlmvm mac80211 uas iwlwifi btusb btrtl btbcm usbhid btintel cfg80211 bluetooth igb rfkill efivarfs
Jun  1 14:10:07 threadripper kernel: CPU: 15 PID: 22340 Comm: qemu-system-x86 Not tainted 4.16.3-gentoo #4
Jun  1 14:10:07 threadripper kernel: Hardware name: Gigabyte Technology Co., Ltd. X399 DESIGNARE EX/X399 DESIGNARE EX-CF, BIOS F1 09/06/2017
Jun  1 14:10:07 threadripper kernel: RIP: 0010:__domain_flush_pages+0xe1/0xf0
Jun  1 14:10:07 threadripper kernel: RSP: 0018:ffff904ec79bfc30 EFLAGS: 00010282
Jun  1 14:10:07 threadripper kernel: RAX: 00000000fffffffb RBX: ffff8a12c2264210 RCX: 0000000000000000
Jun  1 14:10:07 threadripper kernel: RDX: 0000000000000000 RSI: 0000000000000286 RDI: ffff8a13bd409814
Jun  1 14:10:07 threadripper kernel: RBP: ffff904ec79bfc78 R08: 0000000000000780 R09: 0000000000000001
Jun  1 14:10:07 threadripper kernel: R10: 0000000000000002 R11: 0000000000000001 R12: ffff8a12c2264210
Jun  1 14:10:07 threadripper kernel: R13: 00000000fffffffb R14: 0000000000000000 R15: 7fffffffffffffff
Jun  1 14:10:07 threadripper kernel: FS:  00007f0154012b80(0000) GS:ffff8a13bddc0000(0000) knlGS:0000000000000000
Jun  1 14:10:07 threadripper kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun  1 14:10:07 threadripper kernel: CR2: 0000555731935000 CR3: 0000000704722000 CR4: 00000000003406e0
Jun  1 14:10:07 threadripper kernel: Call Trace:
Jun  1 14:10:07 threadripper kernel:  amd_iommu_unmap+0x6c/0xa0
Jun  1 14:10:07 threadripper kernel:  __iommu_unmap+0xed/0x190
Jun  1 14:10:07 threadripper kernel:  vfio_unmap_unpin+0xfe/0x1d0
Jun  1 14:10:07 threadripper kernel:  vfio_remove_dma+0x1d/0x50
Jun  1 14:10:07 threadripper kernel:  vfio_iommu_type1_ioctl+0x74a/0x9b0
Jun  1 14:10:07 threadripper kernel:  do_vfs_ioctl+0xba/0x6c0
Jun  1 14:10:07 threadripper kernel:  ? syscall_trace_enter+0x181/0x320
Jun  1 14:10:07 threadripper kernel:  SyS_ioctl+0x91/0xa0
Jun  1 14:10:07 threadripper kernel:  do_syscall_64+0x5b/0x100
Jun  1 14:10:07 threadripper kernel:  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Jun  1 14:10:07 threadripper kernel: RIP: 0033:0x7f014e2d2607
Jun  1 14:10:07 threadripper kernel: RSP: 002b:00007fff7d55d498 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jun  1 14:10:07 threadripper kernel: RAX: ffffffffffffffda RBX: 00007fff7d55d590 RCX: 00007f014e2d2607
Jun  1 14:10:07 threadripper kernel: RDX: 00007fff7d55d4a0 RSI: 0000000000003b72 RDI: 000000000000002b
Jun  1 14:10:07 threadripper kernel: RBP: 00000000000c0000 R08: 000000007ff40000 R09: 000000007fffffff
Jun  1 14:10:07 threadripper kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000555732416270
Jun  1 14:10:07 threadripper kernel: R13: 000000007ff40000 R14: 0000555732416260 R15: 000000007fffffff
Jun  1 14:10:07 threadripper kernel: Code: 8b 1b 4c 39 e3 75 dd 45 85 ed 75 1f 48 8b 44 24 18 65 48 33 04 25 28 00 00 00 75 13 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b eb dd e8 96 67 95 ff 66 0f 1f 44 00 00 53 48 8d 9f c0 fe 
Jun  1 14:10:07 threadripper kernel: ---[ end trace 0b9754d6ffe42242 ]---
Jun  1 14:10:07 threadripper kernel: AMD-Vi: Command buffer timeout
Jun  1 14:10:07 threadripper kernel: AMD-Vi: Command buffer timeout

This text will be hidden

This goes on for quit a while, more or less repeating until I kill the system.

I don’t even know where to begin trouble shooting this problem.

Any suggestions?

I have edited your output for clarity.

I’m about to get busy, so I won’t be able to respond for a while, but could you share all your config? I’m specifically looking for lspci -vnn, /etc/modprobe.d changes, your grub command line and your libvirt xml.

What are your outputs from

and are you able to rebind the driver post boot as shown in: https://wiki.debian.org/VGAPassthrough#Driver_association ?

lspci -vnn output

output.txt (36.2 KB)

I run a script for modprobe slightly modified from the one on the archlinux tutorial

#!/bin/sh

for i in /sys/devices/pci*///boot_vga; do
if [ $(cat “$i”) -eq 0 ]; then
GPU="${i%/boot_vga}"
AUDIO="$(echo “$GPU” | sed -e “s/0$/1/”)"
echo “vfio-pci” > “$GPU/driver_override”
if [ -d “$AUDIO” ]; then
echo “vfio-pci” > “$AUDIO/driver_override”
fi
fi
done

modprobe -i vfio-pci

I can unbind the driver, though the driver is already set to the vfio_pci so I don’t think I should need to.

Maybe it all must be done by boot?
https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF#Using_identical_guest_and_host_GPUs

Amdgpu cannot be unbound.

I don’t think OP will need to.

OH wait, @drc, are you using the acs override patch? That could be causing problems.

I am not using the acs overide patch to the best of my knowledge

I tried using qemu without libvirt and ran into the same problem. Sifting through the logs, I found the following line which also was similar to an error on the command line from qemu:

Jun 2 07:59:14 threadripper kernel: AMD-Vi: Completion-Wait loop timed out
Jun 2 07:59:14 threadripper kernel: vfio_bar_restore: 0000:08:00.0 reset recovery - restoring bars
Jun 2 07:59:14 threadripper kernel: vfio-pci 0000:08:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
Jun 2 07:59:14 threadripper kernel: vfio_bar_restore: 0000:08:00.0 reset recovery - restoring bars
Jun 2 07:59:14 threadripper kernel: AMD-Vi: Completion-Wait loop timed out

on the command line from qemu, I received

Device option ROM contents are probably invalid (check dmesg)

Any ideas?

These two errors make me think that you are suffering from the Radeon reset bug. I was having the same errors on my Ryzen 7 system until I stopped trying to pass through my RX 580, and instead made it my host GPU and passed through a GTX 1050 Ti.

EDIT: Also, it is strongly advised to have two different GPUs instead of identical models. They can be slightly different, but most importantly need to have different PCI ids. I’ve heard that what you’re doing possible, but haven’t seen someone actually succeed at it (except for UNRAID.)

Maybe its time to bite the bullet and invest in a new card :frowning:, I hate to give up though

Ok, so someone suggested I try my setup with Fedora 28. I had been using Debian Stable. But now that I’ve switched to Fedora 28, the IOTLB and AMD-Vi errors are completely gone with the 580 as guest. If you’re using Debian, or anything with an older kernel, you might want to give Fedora a spin.

@imrazor
could you upload your kernel config? I can start a comparison to see if there is something about my config that is causing this issue. FYI I am running gentoo and have no plans to change distro’s as I am sure if it were to run on Redhat then I could make it run on gentoo.

Also, if anyone is interested, here is my kernel config

drc-kernel.txt (120.9 KB)

It’s a bog standard Fedora 28 kernel config. However, it seems I spoke too soon. Even though I am no longer receiving the IOTLB and AMD-Vi events, the Windows VM with the RX 580 was unstable after the first reboot. Sometimes it would boot up, sometimes I’d see a black screen, other times I’d get the Windows Repair prompt. Rebooting the host would resolve the problem.

For now I’m just using the RX 580 as my host GPU. I’m considering how I might be able to get my GTX 970 as the guest GPU, but thanks to the card, case and motherboard design it’s complicated.

Ok, bios flash seems to have resolved that particular issue. Card is being passed through to windows but it is not able to access it…but moving on, this issue itself seems to be resolved.

Did you flash the BIOS on the GPU or your motherboard?

Even as the host GPU, my RX 580 made the system slightly unstable. It would just power off or become unresponsive randomly every day or two. I switched to a GTX 970 and it’s been stable for a week. Not sure if I’ve got a flaky card, or the amdgpu driver is squirrely.

I flashed the MB bios