Can't unbind 5700 XT from amdgpu - always hangs and then prevents system shutdown

Hello.

I have a newly built PC and have been working on setting up GPU passthrough for the first time - so I am a beginner. I’ve managed to get a basic passthrough setup working, by binding the guest GPU to vfio-pci on boot. Using the suspend-to-RAM trick, I am able to work around the AMD reset bug and launch the VM multiple times. So far, so good.

However, I don’t really want to have my guest GPU dedicated exclusively to the Windows guest. I want to be able to hot-swap it between the host and guest, making use of it from the host via DRI_PRIME=1. This means I need to be able to bind/unbind it to/from amdgpu. And I have been having a nightmare of a time trying to accomplish this. I can bind and unbind from vfio-pci just fine with no problems. I can bind to amdgpu just fine with no problems. But attempting to unbind from amdgpu always hangs with the command never returning, and I either get a kernel panic or just become unable to properly shut down the PC.

My system info:

  • Distro: Fedora Silverblue
  • Kernel: 5.3.16-300.fc31.x86_64
  • DE: GNOME with Wayland
  • Motherboard: MSI X570 Gaming Pro Carbon
  • Host GPU: AMD Radeon RX 550
  • Guest GPU: AMD Radeon RX 5700 XT
  • IOMMU groups: Both GPUs, and their HDMI audio components, are all in separate IOMMU groups (i.e. four separate groups).

My current setup:

  • I am using libvirt with virsh and Virtual Machine Manager.
  • I’ve followed the basic steps to set up a VM, pass through an NVMe drive, route audio to PulseAudio, use evdev passthrough for the keyboard and mouse, and am passing through my guest GPU and its HDMI audio component.
  • The guest GPU and its HDMI audio component are both using vfio-pci from boot, accomplished via kernel arguments.

What I’ve tried to get hot-swap working:

  • I’ve tried allowing the system to boot with amdgpu, and then simply launch the VM to have it automatically switch the driver over. This causes a kernel panic immediately.
  • I’ve tried booting the system with vfio-pci, and then manually switching over to amdgpu. Upon doing so, the system detects the new graphics card and it becomes usable with DRI_PRIME=1. Using lsof I can see that both GPUs are in use by systemd, systemd-logind, and gnome-shell. However, attempting to unbind from amdgpu by echoing 1 to /sys/bus/pci/devices/ID/driver/unbind causes the command to hang. I am not even able to cancel/kill it. It just hangs forever and prevents system shutdown.
  • I’ve tried switching to a tty, killing GNOME Shell and GDM entirely, and then manually swapping from vfio-pci to amdgpu. This time, lsof shows that nothing is using the guest GPU. Even so, attempting to unbind from amdgpu still causes the command to hang forever, and also prevents system shutdown.
  • I’ve tried switching between both Xorg and Wayland. It works even worse in Xorg, immediately freezing the display when attempting to unbind the guest GPU. At least in Wayland the display will continue to work, even if I can no longer shut down the PC.
  • I’ve tried using /remove instead of /unbind, and there’s no difference there - I assume it tries to unbind first anyway so this is probably to be expected.
  • I’ve tried suspending to RAM after the unbind attempt. No difference.
  • I’ve probably tried other things that I’m forgetting to mention here.

I’d greatly appreciate any help you guys could offer. Thanks.

What you’re trying to do is both unsupported by the AMDGPU driver at current as it’s not something that AMD see as a priority. And even if you could unbind it, very likely wont work as the AMD reset sequence is broken.

Please see this thread for more information:

Note that even the v2 patch further down in this thread is still not a complete solution.

1 Like

Huh. I was told by another person that this was definitely not the reset bug, so I believed it to be a separate issue. I am a bit confused then as to why the suspend-to-RAM trick lets me work around the issue when using vfio-pci and starting/stopping the VM, but doesn’t work for unbinding from amdgpu. If it’s the reset bug in both cases, then why does the workaround only work for one of the two? I guess it just depends on the specific state the card is left in?

Well, thanks for letting me know. I’ll just wait for the kernel patch to get upstreamed, and have my GPU dedicated exclusively to the VM in the meantime. Thanks a lot for all your hard work on this stuff. It is appreciated!

It isn’t exactly either, it’s a combination. In your case it’s more that the amdgpu driver’s unbind code path isn’t complete and doesn’t work properly.

Because this resets the GPU by literally powering down the GPU.

1 Like