Disabling NVidia driver for single PCI device for VM Passthrough

Hello!

I’ve got 2 GPU’s in my system (3070 and 1050Ti), I’m looking to pass through the 1050Ti into a VM while using the 3070 on the host system.

So far, I seem to have everything else working (kvm working, iommu enabled (the card are on separate iommu groups), all that jazz).

My problem is that the host OS’s (ubuntu 20.04) nvidia driver is grabbing the 1050 card, stopping the passthrough from working.

Every guide I’ve found just recommends disabling the nvidia kernel module, but that stops my 3070 from working.

Is there any way I can leave the nvidia driver enabled, but stop it from touching my 1050 PCI device?

This might help you:

Step 1: Enable IOMMU

Configuring the grub

Open the grub config file:

nano /etc/default/grub

Change the GRUB_CMDLINE_LINUX_DEFAULT line to:

GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on"

or

GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt"

PT Mode

Both Intel and AMD chips can use the additional parameter iommu=pt, added in the same way as above. This enables the IOMMU translation only when necessary, and can thus improve performance for PCIe devices not used in VMs.

Save the changes, update grub and reboot.

update-grub
reboot

Step 2: VFIO Modules

Add VFIO Kernel Modules

We have to make sure the following modules are loaded. This can be achieved by adding them to /etc/modules.

Open the /etc/modules file:

nano /etc/modules

Add the following content to the /etc/modules file:

vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
vfio_amd
vfio_nvidia

Then save and exit.

Update initramfs

After changing anything modules related, we need to refresh our initramfs. The reboot to bring the changes into effect.

update-initramfs -u -k all
reboot

Verify that IOMMU is enabled

dmesg | grep -e DMAR -e IOMMU -e AMD-Vi

This should display that IOMMU, Directed I/O or Interrupt Remapping is enabled.

Depending on hardware and kernel the exact message can vary.

Step 3: IOMMU Interrupt Remapping

It is not possible to use PCI passthrough without interrupt remapping.

All systems using an Intel processor and chipset that have support for Intel Virtualization Technology for Directed I/O (VT-d), but do not have support for interrupt remapping will see such an error. Interrupt remapping support is provided in newer processors and chipsets (both AMD and Intel).

Identify Interrupt Remapping support

Identify if the system has support for Interrupt Remapping:

dmesg | grep 'remapping'

If we see one of the following lines then Interrupt Remapping is supported:

  • AMD-Vi: Interrupt remapping enabled
  • DMAR-IR: Enabled IRQ remapping in x2apic mode (x2apic can be different on old CPUs, but should still work)

If we at some point encounter the following error e.g. it means that we have an Interrupt Remapping issue:

Failed to assign device "[device name]": Operation not permitted

or

Interrupt Remapping hardware not found, passing devices to unprivileged domains is insecure.

If the system doesn’t support Interrupt Remapping, we can allow unsafe interrupts with:

echo "options vfio_iommu_type1 allow_unsafe_interrupts=1" > /etc/modprobe.d/iommu_unsafe_interrupts.conf

Step 4: Verify IOMMU Isolation

For PCI passthrough to work, we need a dedicated IOMMU group for all PCI devices we want to assign to a VM.

Verify that we have dedicated IOMMU groups by running:

find /sys/kernel/iommu_groups/ -type l

Our output should look something like this:

/sys/kernel/iommu_groups/23/devices/0000:08:00.1
/sys/kernel/iommu_groups/13/devices/0000:00:08.2
/sys/kernel/iommu_groups/31/devices/0000:0c:00.0
/sys/kernel/iommu_groups/31/devices/0000:0c:00.1
/sys/kernel/iommu_groups/3/devices/0000:00:02.0
/sys/kernel/iommu_groups/21/devices/0000:03:05.0

Step 5: Blacklisting Drivers

We don’t want the Proxmox host system utilizing our GPU(s), so we need to blacklist the drivers.

Depending on our GPU card(s) bendor, run the following:

NVIDIA

echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nvidiafb" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nvidia" >> /etc/modprobe.d/blacklist.conf

AMD

echo "blacklist radeon" >> /etc/modprobe.d/blacklist.conf
echo "blacklist amdgpu" >> /etc/modprobe.d/blacklist.conf

Step 5: Determine the PCI Card Address

Locate the PCI card(s) using lspci. The address should be in the form of 01:00.0

lspci -nn

To locate AMD spesific cards, use the following command:

lspci -nn | grep Radeon

To locate NVIDIA specific cards, use the following command:

lspci -nn | grep NVIDIA

Our shell window should output a bunch of stuff. Look for the line(s) that show the video card. It’ll look something like this:

0b:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1)
0b:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
0b:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Controller [10de:1ad6] (rev a1)
0b:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 UCSI Controller [10de:1ad7] (rev a1)

Make note of the first set of numbers (e.g. 0b:00.0 and 0b:00.1). We’ll need them for the next step.

Run the command below. Replace 0b:00 with whatever number was next to your GPU when you ran the previous command:

lspci -n -s 0b:00

Doing this should output your GPU card’s Vendor IDs, usually one ID for the GPU and one ID for the Audio bus.It’ll look a little something like this:

0b:00.0 0300: 10de:1e07 (rev a1)
0b:00.1 0403: 10de:10f7 (rev a1)
0b:00.2 0c03: 10de:1ad6 (rev a1)
0b:00.3 0c80: 10de:1ad7 (rev a1)

What we want to keep, are these vendor id codes: 10de:1e07, 10de:10f7, 10de:1ad6 and 10de:1ad7.

Now we add the GPU’s vendor id’s to the VFIO (remember to replace the id’s with your own!):

Step 6: GPU OVMF PCI Passthrough (recommended)

Using OVMF, we can add disable_vga=1 to vfio-pci module, which try to to opt-out devices from vga arbitration if possible:

echo "options vfio-pci ids=1002:67df,1002:aaf0" > /etc/modprobe.d/vfio.conf
echo "options vfio-pci ids=10de:1e07,10de:10f7,10de:1ad6,10de:1ad7 disable_vga=1" > /etc/modprobe.d/vfio_nvidia.conf
echo "options vfio-pci ids=1002:67df,1002:aaf0 disable_vga=1" > /etc/modprobe.d/vfio_amd.conf

NVIDIA Tips

Some Windows applications like Geforce Experience, Passmark Performance Test and SiSoftware Sandra crash can crash the VM. To avoid this we need to add:

echo "options kvm ignore_msrs=1" > /etc/modprobe.d/kvm.conf

If we see a lot of warning messages in the dmesg system log, add the following instead:

echo "options kvm ignore_msrs=1 report_ignored_msrs=0" > /etc/modprobe.d/kvm.conf

Step 7: Update and Reboot

update-grub
update-initramfs -u -k all
reboot

I had the same issue with AMD cards. The solution was to add the following

vfio-pci.ids=1002:67df,1002:aaf0

to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub.

Replace 1002:67df,1002:aaf0 with the appropriate ids for your card. After doing that, run sudo update-grub and reboot.

Here’s an example of what it should look like:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash ... vfio-pci.ids=1002:67df,1002:aaf0"