GPU Passthrough: Virtual Machine Manager crashing when trying to start VM

First time for me trying to setup GPU passthrough on a desktop system, but it seems to not be working as it should.
I mostly followed this tutorial:

With these two exceptions: I only passed through the GPU with video and audio device, no NVMe; And I didn’t apply any Error Code 43 mitigations, since my second card that I am passing through is an RTX 4000 Ada which shouldn’t need it.

But when trying to start the VM, it just does nothing for a few minutes and then crashes Virtual Machine Manager. When restarting it, it just says “QEMU/KVM: Connecting…”

In case it has any relevance:
I am running an AMD Ryzen Threadripper PRO 5955WX on an ASUS WRX80E (Version 2) with 256GB RAM and two GPUs: an RTX 3090 (currently used for the host system) and an RTX 4000 Ada which I am trying to hand to my Windows 11 VM.

The RTX 4000 Ada is in its own IOMMU group, so there should be no problems from that side.

Happy for any pointers to get this working.

1 Like

Can you please check with the following command if for the GPU you want to pass through the kernel driver in use is in fact vfio-pci or not.

sudo lspci -nnk

Also can you please check with the following command if hardware virtualization is actually enabled.

sudo lscpu | grep -i virtualization

I think you might be onto something:

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD104GL [RTX 4000 SFF Ada Generation] [10de:27b0] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:16fa]
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22bc] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:16fa]
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel

even though the tutorial had instructions to remove and add the GPU through hooks on boot/ shutdown of the VM.

returns nothing, though in the UEFI it is turned on

Yes, the GPU is still bound to the nvidia driver and when you try to start a VM it crashes because it can not attach the PCI device. That is definitely a problem. In this case of the tutorial you linked the person is using custom scripts to attach and detach the GPU on demand, but as it seems that is not working correctly. If you want to follow this setup I leave it upon you to debug these scripts.

The other method is to attach the vfio-pci driver on boot. This means the GPU will be unavailable to the host, but can be attached freely to VMs at will.

That is interesting, for me it returns a line like the following to signal that AMD virtualization is enabled for my CPU.

Virtualization:                       AMD-V

It also irritated me, but sudo kvm-ok returns

INFO: /dev/kvm exists
KVM acceleration can be used

Have you placed the scripts in the according directories that correspond to the name of your VM? Because the win10 folder in qemu.d is a placeholder that needs the name of your VM? Have you done that?

Yes, the names match; I guess I will try to attach the vfio-pci driver on boot instead for now.

Do you have the correct addresses in the /etc/libvirt/hooks/kvm.conf. If there is a error or typo in there all else will fail.

This tutorial is in general interesting to me so I might try it out, but this is all I can advise you as of right now from looking at it.

Double checked, but it should be correct:

IOMMU Group 74 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD104GL [RTX 4000 SFF Ada Generation] [10de:27b0] (rev a1)
IOMMU Group 74 01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22bc] (rev a1)

which becomes

## Virsh devices
VIRSH_GPU_VIDEO=pci_0000_01_00_0
VIRSH_GPU_AUDIO=pci_0000_01_00_1

I am trying this on my end right now but I’ll need a little while, possibly a day or so because I have other tasks on my to do list as well.

1 Like

So I won’t be able to investigate this as much as I had hoped, because I can not execute the bind_vfio.sh at all. I have removed the unused variables from the script and when I run it my Gnome desktop environment completely freezes and I have to reboot the machine. I have a separate Intel GPU that the screen is attached to but it seems Gnome really does not like attaching and detaching GPU’s on the fly.

What OS are you on, maybe we can get vfio-pci loaded on boot at least?

i’m on Pop!_OS. I already did some basic setup for it now, to the point where my VM can be started, but all I get is a black screen… just trying to find out what causes that

Do you have a monitor or dummy plug attached to one of the HDMI or DisplayPort ports of the guest GPU?

I edited /etc/modules-load.d/vfio.conf and /etc/modprobe.d/nvidia.conf to attach my second GPU to VFIO and to ensure VFIO is loaded before the Nvidia driver.
I now have this output for sudo lspci -nnk

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD104GL [RTX 4000 SFF Ada Generation] [10de:27b0] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:16fa]
	Kernel driver in use: vfio-pci
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22bc] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:16fa]
	Kernel driver in use: vfio-pci
	Kernel modules: snd_hda_intel

A monitor via Display Port

This looks good!