Avoid amdgpu attaching to one specific Radeon card for GPU passthrough?

Hey everyone.

I’ve been battling this Windows 10 VM setup with GPU passthrough on my Gentoo host for the past couple days. I’m at the point where I know it will work if I can correctly bind the VFIO driver at boot rather than letting the amdgpu driver attach to my guest’s graphics card but can’t figure out the proper method. I am using TWO Navi 10 graphics cards between my Gentoo host and Windows guest. A 5600XT and 5700XT respectively. So, because I am using a card on my host that requires the amdgpu driver I can’t just blacklist it like what is recommended in many posts online.

Here’s a little background on the setup:
MSI X570 Ace motherboard (excellent IOMMU group support, everything is basically in its own group so no issues here)
1x Radeon 5600XT (Gentoo card, PCIe slot 1)
1x Radeon 5700XT (Windows card, PCIe slot 2)
64GB RAM
Virt-manager handling VM setup and management

What I’ve been doing up until this point was trying to manually unbind the guest’s card from the amdgpu driver by echoing the PCI ID of the card into the ‘unbind’ file of the driver and then echoing to the vfio-pci driver… here is my script I was using:

#!/bin/bash
modprobe vfio-pci
for dev in "$@"; do
        vendor=$(cat /sys/bus/pci/devices/$dev/vendor)
        device=$(cat /sys/bus/pci/devices/$dev/device)
        if [ -e /sys/bus/pci/devices/$dev/driver ]; then
                echo $dev > /sys/bus/pci/devices/$dev/driver/unbind
        fi
        echo $vendor $device > /sys/bus/pci/drivers/vfio-pci/new_id
done

The PCI device IDs are here for the Windows guest graphics card, the 5700XT. IDs: 32:00.0 and 32:00.1 (the card and it’s audio device). Both of these are passed through to the VM.

$ lspci -nn
32:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] [1002:731f] (rev ff)
32:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio [1002:ab38] (rev ff)

I would call that script above with these IDs as arguments. The script would complete but in ‘dmesg’ I would get:

$ dmesg
[   24.217402] VFIO - User Level meta-driver version: 0.3
[   24.219567] vfio_pci: add [1002:731f[ffffffff:ffffffff]] class 0x000000/00000000
[   24.219572] vfio_pci: add [1002:ab38[ffffffff:ffffffff]] class 0x000000/00000000
**** [   24.221441] [drm:amdgpu_pci_remove [amdgpu]] *ERROR* Hotplug removal is not supported ****
[   24.221851] [drm] amdgpu: finishing device.
[   24.237020] [drm] free PSP TMR buffer
[   24.297696] ------------[ cut here ]------------
[   24.297739] WARNING: CPU: 6 PID: 2652 at drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:3191 amdgpu_vm_manager_fini+0x19/0x50 [amdgpu]
[   24.297740] Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio kvm_amd amdgpu iwlmvm mfd_core gpu_sched ttm iwlwifi efivarfs
[   24.297747] CPU: 6 PID: 2652 Comm: vfio-bind Not tainted 5.8.9-gentoo #7
[   24.297748] Hardware name: Micro-Star International Co., Ltd. MS-7C35/MEG X570 ACE (MS-7C35), BIOS 1.80 01/16/2020
[   24.297787] RIP: 0010:amdgpu_vm_manager_fini+0x19/0x50 [amdgpu]
[   24.297788] Code: a0 4e 00 00 02 00 00 00 eb a3 0f 0b eb e4 0f 1f 00 55 48 89 fd 48 81 c7 a8 4e 00 00 48 83 ec 08 48 83 bd b0 4e 00 00 00 74 14 <0f> 0b e8 e0 16 27 d0 48 83 c4 08 48 89 ef 5d e9 53 a4 00 00 31 f6
[   24.297789] RSP: 0018:ffff9ed088727da8 EFLAGS: 00010282
[   24.297791] RAX: ffffffffc0862be0 RBX: ffff8a6462020010 RCX: ffff8a647600af80
[   24.297791] RDX: 0000000000000001 RSI: ffff8a647600a780 RDI: ffff8a6462024ea8
[   24.297792] RBP: ffff8a6462020000 R08: 0000000000000001 R09: ffffffffc0308300
[   24.297793] R10: ffff8a646b95c400 R11: 0000000000000000 R12: 0000000000000001
[   24.297793] R13: ffff8a6462036940 R14: ffff9ed088727f10 R15: ffff8a647793c7a0
[   24.297794] FS:  00007fa30c6d1b80(0000) GS:ffff8a647e980000(0000) knlGS:0000000000000000
[   24.297795] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   24.297796] CR2: 00007f107be03448 CR3: 0000000fd6396000 CR4: 0000000000340ee0
[   24.297797] Call Trace:
[   24.297841]  gmc_v10_0_sw_fini+0x9/0x30 [amdgpu]
[   24.297884]  amdgpu_device_fini+0x270/0x439 [amdgpu]
[   24.297920]  amdgpu_driver_unload_kms+0x44/0x80 [amdgpu]
[   24.297955]  amdgpu_pci_remove+0x3c/0x70 [amdgpu]
[   24.297958]  pci_device_remove+0x44/0xb0
[   24.297961]  device_release_driver_internal+0xdf/0x1c0
[   24.297962]  unbind_store+0xfa/0x130
[   24.297965]  kernfs_fop_write+0xd3/0x1c0
[   24.297968]  vfs_write+0xf0/0x220
[   24.297970]  ksys_write+0x6b/0x100
[   24.297973]  do_syscall_64+0x3e/0x70
[   24.297975]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   24.297977] RIP: 0033:0x7fa30c804133
[   24.297979] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18
[   24.297979] RSP: 002b:00007fff35468a38 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   24.297980] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fa30c804133
[   24.297981] RDX: 000000000000000d RSI: 0000558a2be2d580 RDI: 0000000000000001
[   24.297982] RBP: 0000558a2be2d580 R08: 000000000000000a R09: 000000000000000c
[   24.297982] R10: 0000558a2b3adbc2 R11: 0000000000000246 R12: 000000000000000d
[   24.297983] R13: 00007fa30c8cf6a0 R14: 000000000000000d R15: 00007fa30c8ca8a0
[   24.297984] ---[ end trace dc5177fd4508bc7d ]---
[   24.297989] ------------[ cut here ]------------
[   24.297990] Still active user space clients!
[   24.298048] WARNING: CPU: 6 PID: 2652 at drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:105 amdgpu_gem_force_release+0xf9/0x110 [amdgpu]
[   24.298049] Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio kvm_amd amdgpu iwlmvm mfd_core gpu_sched ttm iwlwifi efivarfs
[   24.298058] CPU: 6 PID: 2652 Comm: vfio-bind Tainted: G        W         5.8.9-gentoo #7
[   24.298060] Hardware name: Micro-Star International Co., Ltd. MS-7C35/MEG X570 ACE (MS-7C35), BIOS 1.80 01/16/2020
[   24.298099] RIP: 0010:amdgpu_gem_force_release+0xf9/0x110 [amdgpu]
[   24.298102] Code: 48 c7 c7 5d 87 b4 c0 c6 05 70 f0 3c 00 01 e8 6e 79 e9 cf 0f 0b eb 8a 48 c7 c7 68 d5 b1 c0 c6 05 5a f0 3c 00 01 e8 57 79 e9 cf <0f> 0b e9 50 ff ff ff e8 3b 7b a4 d0 66 66 2e 0f 1f 84 00 00 00 00
[   24.298105] RSP: 0018:ffff9ed088727d80 EFLAGS: 00010282
[   24.298108] RAX: 0000000000000000 RBX: ffff8a6475968400 RCX: 0000000000000027
[   24.298111] RDX: 0000000000000027 RSI: 0000000000000092 RDI: ffff8a647e997368
[   24.298113] RBP: ffff8a6462020000 R08: ffff8a647e997360 R09: 000000000000052d
[   24.298114] R10: ffffffff922df938 R11: 0000000000000001 R12: 0000000000000001
[   24.298115] R13: ffff8a646bad70f0 R14: ffff8a646bad70d0 R15: ffff8a647793c7a0
[   24.298116] FS:  00007fa30c6d1b80(0000) GS:ffff8a647e980000(0000) knlGS:0000000000000000
[   24.298117] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   24.298117] CR2: 00007f107be03448 CR3: 0000000fd6396000 CR4: 0000000000340ee0
[   24.298118] Call Trace:
[   24.298159]  gmc_v10_0_sw_fini+0x21/0x30 [amdgpu]

Or, in other words the unbind wasn’t completely successful and “Hotplug removal is not supported” so this is not going to work. I can still see the card in Windows but it doesn’t work properly and drivers wont completely recognize it. Device manager shows the card had a failure (error 43).

The guest card does show the ‘vfio-pci’ driver in use after running the script:

$ lspci -nnk
32:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] [1002:731f] (rev ff)
        Kernel driver in use: vfio-pci
        Kernel modules: amdgpu
32:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio [1002:ab38] (rev ff)
        Kernel driver in use: vfio-pci

If I could get the amdgpu driver to avoid attaching to this specific card at boot, I should be fine. Any suggestions on what approach to take? I’ve been reading dozens of posts online of course but some of them are over 8 years old and confusing since the methods are not the same as today. Any help is appreciated. Thanks!

1 Like

I’m on Ubuntu and I used this guide.

I basically put the PCI device id’s in the grub file like this:

GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=on iommu=pt kvm.ignore_msrs=1 vfio-pci.ids=10de:1b83,10de:10f0"

My Radeon RX550 binds perfectly with the VFIO-PCI driver. (And so does my USB controller and NVME drive). Something similar might work with Gentoo?

Like @Pleytos said, try binding via kernel arguments from your bootloader.

The other thing to try is to make the AMDGPU module have a dependency on the VFIO module, so the vfio module can grab the card before the normal driver loads and screws things up. On Debian based systems, it is a softdep in your initramfs config, but idk what gentoo would use.

I was about to try this but then I realized that the PCI vendor:device IDs are the same between both cards, the host and the guest card. I’ll still give it a try right now and see what happens though.

2f:00.0 and 2f:00.1 is the host card. 32:00.0 and 32:00.1 is the guest card. Both IDs are the same listed at the end.

$ lspci -nn
2f:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] [1002:731f] (rev ca)
2f:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio [1002:ab38]
32:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] [1002:731f] (rev ff)
32:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio [1002:ab38] (rev ff)

Alright I’ll look more into that. I don’t use an initramfs but can set one up if needed. Haven’t used one before so I will have to do some research on what’s required to do so and how the softdep option works.