Passthrough doesn't work with 5700 XT

Hi,

This is my first time posting here so sorry if I’ve posted in the wrong category.

I have a 3700x, an x570 Aorus Elite, a 5700 XT, and a HD 2400 PRO (just to drive my secondary monitor which is VGA only) in my system and wanted to try passthrough with the 5700XT instead of dual booting.

I’ve tried it first with the 2400 just to test things, added the PCI ids to /etc/modprobe.d/vfio.conf, restarted X with a config that doesn’t use the card, unloaded radeon and started the VM. It worked really well, I saw the VM’s login prompt on the display and I could very easily stop the VM use the card with X and the stop X and use it with the VM again so I moved on to the 5700XT.

Since the 5700XT is my boot GPU I ssh’ed into my machine from a laptop, stopped X, unbound the virtual console, unloaded amgpu and loaded vfio-pci. lspci showed that the 5700XT and its HDMI audio (which are the only two devices in their IOMMU group) were indeed using vfio-pci instead of their original drivers and I got these messages in dmesg:

[  109.723121] [drm] amdgpu: finishing device.
[  109.767358] Evicting PASID 0x8002 queues
[  109.820732] [drm] amdgpu: ttm finalized
[  185.521544] vfio-pci 0000:0b:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=io+mem:owns=io+mem
[  185.533054] vfio_pci: add [1002:731f[ffffffff:ffffffff]] class 0x000000/00000000
[  185.545501] vfio_pci: add [1002:ab38[ffffffff:ffffffff]] class 0x000000/00000000 

I’m not sure if ffffffff:ffffffff is normal but I whenever I tried starting the VM I got these kernel errors and had to restart the machine. https://pastebin.com/D7qdbL3i I really have no idea what is causing this the only thing I can think of is that I haven’t done anything related to the vbios but that shouldn’t make the host kernel print these errors.

Additional info about my setup:

  • Distro: Gentoo with Linux 5.6.2(navi reset patch applied)

  • GRUB_CMDLINE_LINUX_DEFAULT=“amd_iommu=on mem_encrypt=off iommu=pt text video=efifb:off”

  • VM uses Q35 and OVMF

I’m really lost at this point, I couldn’t find anything similar on the net either. Thanks in advance.

Welcome to the forums @09d08!

I’ve got a really similar setup to yours. 3700x, x570 AsRock Steel Legend (should’ve gotten the elite), 5700XT for passthru, and a rx580 for the host OS (arch with kernel 5.6.2).

I’m not using modprobe to pass the device IDs though. Decided to pass the IDs as a kernel parameter instead. My 5700XT and its audio device are in seperate IOMMU groups fwiw.

I am seeing almost identical dmesg output when grep’ing for ‘vfio’

[    0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-linux root=UUID=b10e8e6d-deb2-4bc5-a1c6-6f030ace0c7d rw amd_iommu=on iommu=pt vfio-pci.ids=1002:731f,1002:ab38 loglevel=3
[    2.504497] vfio-pci 0000:12:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[    2.520055] vfio_pci: add [1002:731f[ffffffff:ffffffff]] class 0x000000/00000000
[    2.536690] vfio_pci: add [1002:ab38[ffffffff:ffffffff]] class 0x000000/00000000
[    7.387363] vfio-pci 0000:12:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none

I’ve been trying to use an existing windows disk to boot of off with no luck so far. I didnt want to get too far down the rabbit hole while I am still using this computer for work, but after today I’ll try to get my passthrough working and will hopefully report back tonight.

EDIT: This is just a stab in the dark, but I wonder if you need to modify your initramfs to load the vfio-pci module earlier in the boot process?

Thanks for the friendly welcome. If you don’t want to use your 5700XT with the host you can try adding

 softdep amdgpu pre: vfio-pci

to your /etc/modprobe.d to make sure vfio-pci gets loaded before amdgpu. I don’t know if arch configures vfio-pci or vfio as built in or as a module but it’s how most people do it from what I understand. Also you might want to try it with a linux guest first so that you can ssh into it and see what the problem is.

I have made some progress, I blacklisted amdgpu out of curiosity, used the second card/display to start to the VM and it worked just fine! The VM was working just fine and dmesg didn’t have any error in both the guest and the host.

It works just fine if amdgpu never gets loaded but the issue is I want to use the 5700XT with the host as well and blacklisting or loading vfio-pci via initramfs makes that impossible

I also noticed something else too while trying, the kernel errors are not caused by starting the VM. I stopped X, unloaded amdgpu, loaded vfio-pci, and restarted X so that it only uses the 2400 PRO but virt-manager was stuck at connecting. I couldn’t stop libvirtd and the same errors as the ones in the pastebin link were in dmesg.

I then tried stopping libvirt and started it again after loading vfio-pci. The VM did start without any errors in the host but I couln’t see the efi splash screen and it got stuck at starting services. SSH revealed that the guest couldn’t initialize the GPU. “PSP firmware init failed” or something similar I think.

For some reason libvirtd doesn’t like amdgpu being unloaded at all. Good luck with your own system, I hope you can make it work.

I might play around with that later tonite.

Arch’s default kernel uses vfio-pci as a module from what I understand, and the recommended way to force it to load before amd_gpu is to modify /etc/mkinitcpio.conf (is this an arch specific util?)

MODULES=(... vfio_pci vfio vfio_iommu_type1 vfio_virqfd ...)
HOOKS=(... modconf ...)

and then regenerate the initramfs.

I’ve still got to figure out which usb controller to pass through and then I’ll give this a go. I’ll report back on how things go.

Small update…

I was able to get GPU pass through working… sort of… After a fresh boot things work fine, and I am able to use my windows VM. If I reboot the VM, I get the following error:
image
I am starting to think that might have something to do with the navi reset bug. I thought that the fix was in the kernel as of 5.3 and up but I am going to try to patch and see if that fixes anything.

@09d08 , seems like the [ffffffff:ffffffff] output in dmesg is normal considering my setup is doing the same thing, and is semi-working. Maybe you’re missing the vfio_iommu_type1 module?

That is the reset bug, some people also get it working by suspending the host perhaps that can make it work for you. From what I understand the fixes in the mainline kernel only make it possible for the amdgpu driver to recover from GPU crashes and error by resetting it. It doesn’t apply to VFIO as you need to unload amdgpu to use VFIO. The VFIO module needs to be modified to fix the error.

By the way I switched from gentoo to arch yesterday(same kernel) and it worked perfectly if unloaded and detached amdgpu, loaded vfio, suspended the system and then started the VM. Naively thinking that I finally got it working I converted my Windows installation into a VM and I get kernel errors again whenever I issue

virsh nodedev-detach

Also I have vfio_iommu_type1 builtin to the kernel, these are the vfio related options of my kernel

  • CONFIG_KVM_VFIO=y
  • CONFIG_VFIO_IOMMU_TYPE1=y
  • CONFIG_VFIO_VIRQFD=m
  • CONFIG_VFIO=y
  • CONFIG_VFIO_NOIOMMU is not set

  • CONFIG_VFIO_PCI=m
  • CONFIG_VFIO_PCI_VGA=y
  • CONFIG_VFIO_PCI_MMAP=y
  • CONFIG_VFIO_PCI_INTX=y
  • CONFIG_VFIO_PCI_IGD=y
  • CONFIG_VFIO_MDEV=y
  • CONFIG_VFIO_MDEV_DEVICE=y

The patch hasn’t made its way into the kernel. You need to patch the kernel yourself. Also, the patch isn’t perfect. E.g. rebooting macOS puts the GPU into a state it cannot recover from.

If you are also running a patched kernel, do you also need to suspend the host before starting the VM? I can’t get it to work without suspending patched or not.