Cant get vfio-pci to attach to GPU

I’ve tried everything I can think of but I cant get vfio-pci to show up as the driver for a Nvidia Tesla M2090. Here is a SE question that shows all my config files. I’ve tried blacklisting nouveau as described on other threads here, but then it just shows no driver loaded. Any ideas?

unix stackexchange com/questions/681018/cant-get-vfio-pci-driver-to-load-for-nvidia-gpu

One thing I’ve noticed is that disabling CSM in UEFI settings can cause issues when passing through a GPU. Enabling CSM can be worth a shot.

If that doesn’t help, then posting info about your config and what steps you took to isolate the GPU will help with next troubleshooting steps.

Thanks, I’ll try the CSM enable.

All the config print outs are in the StackExchange link above, just add two periods where the spaces are, or where the underscores are below. I’ll copy them over below to see how they get formatted.

unix_stackexchange_com/questions/681018/cant-get-vfio-pci-driver-to-load-for-nvidia-gpu


Okay, I’m not getting any further so asking for help… I’ve tried everything I can think of or find online. I’m trying to get the GPU passthrough working so I can use it in a VM with virt-manager/KVM.

I followed this guide mainly (below) set all files, updated kernel and set grub lines. I cant get any output from dmesg | grep vfio following another question (below), so maybe that’s a clue. One answer said vfio modules are integrated into the kernel, so lsmod wont show, and my kernel config file shows vfio entries. I’ve used pre: commands to try to load before the nvidia driver. I was able to use a blocklist.conf to block it, but my display card is nvidia also, and I couldnt get to a shell in recovery mode.

github_com/NVIDIA/deepops/blob/master/virtual/README.md#bootloader-changes

askubuntu_com/questions/1247058/how-do-i-confirm-that-vfio-is-working-in-20-04

---
lspci -nn | grep NVIDIA
03:00.0 VGA compatible controller [0300]: NVIDIA Corporation GF108GL [Quadro 600] [10de:0df8] (rev a1)
03:00.1 Audio device [0403]: NVIDIA Corporation GF108 High Definition Audio Controller [10de:0bea] (rev a1)
08:00.0 3D controller [0302]: NVIDIA Corporation GF110GL [Tesla M2090] [10de:1091] (rev a1)
08:00.1 Audio device [0403]: NVIDIA Corporation GF110 High Definition Audio Controller [10de:0e09] (rev a1)
---
lspci -nnk -d 10de:1091
08:00.0 3D controller [0302]: NVIDIA Corporation GF110GL [Tesla M2090] [10de:1091] (rev a1)
        Subsystem: NVIDIA Corporation GF110GL [Tesla M2090] [10de:0887]
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia
"linux /boot/vmlinuz root=UUID=$uuid acpi=noirq intel_iommu=on iommu=pt vfio-pci ids=10de:1091,10de:0e09  vfio_iommu_type1 allow_unsafe_interrupts=1"

I tried both vfio_iommu_type1 allow_unsafe_interrupts=1 and vfio_iommu_type1.allow_unsafe_interrupts=1 .

CONFIG_VFIO_IOMMU_TYPE1=y
CONFIG_VFIO_VIRQFD=y
CONFIG_VFIO=y
CONFIG_VFIO_NOIOMMU=y
CONFIG_VFIO_PCI=y
CONFIG_VFIO_PCI_VGA=y
CONFIG_VFIO_PCI_MMAP=y
CONFIG_VFIO_PCI_INTX=y
CONFIG_VFIO_PCI_IGD=y
CONFIG_VFIO_MDEV=m
CONFIG_VFIO_MDEV_DEVICE=m
grep -oE 'svm|vmx' /proc/cpuinfo | uniq
vmx
cat /etc/modules
# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.

bonding
pci_stub
vfio
vfio_iommu_type1
vfio_pci
kvm
kvm_intel
cat /etc/modules-load.d/vfio-pci.conf
vfio-pci
cat /etc/modprobe.d/vfio.conf
options vfio-pci ids=10de:1091,10de:0e09
options vfio_iommu_type1 allow_unsafe_interrupts=1
---
cat /etc/modprobe.d/nvidia.conf 
softdep nvidia_384 pre: vfio-pci
#softdep radeon pre: vfio-pci
#softdep amdgpu pre: vfio-pci
softdep snd_hda_intel pre: vfio-pci
softdep nouveau pre: vfio-pci
softdep nvidia pre: vfio-pci
softdep nvidia* pre: vfio-pci
#softdep drm pre: vfio-pci
#softdep xhci_hdc pre: vfio-pci
#options kvm_amd avic=1
modprobe -c | grep vfio
options vfio_pci ids=10de:1091,10de:0e09
options vfio_iommu_type1 allow_unsafe_interrupts=1
softdep mdev post: vfio_mdev
softdep nvidia_384 pre: vfio-pci
softdep snd_hda_intel pre: vfio-pci
softdep nouveau pre: vfio-pci
softdep nvidia pre: vfio-pci
softdep nvidia* pre: vfio-pci
cat /etc/initramfs-tools/modules
# List of modules that you want to include in your initramfs.
# They will be loaded at boot time in the order below.
#
# Syntax:  module_name [args ...]
#
# You must run update-initramfs(8) to effect this change.
#
# Examples:
#
# raid1
# sd_mod

vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
vhost-net
journalctl -b | grep vfio
Dec 10 19:35:17 osboxes kernel: Command line: BOOT_IMAGE=/boot/vmlinuz root=UUID=ef2ecb3b-8e9a-4b20-bf15-47e0c7c98a1f acpi=noirq intel_iommu=on iommu=pt vfio-pci ids=10de:1091,10de:0e09 vfio_iommu_type1 allow_unsafe_interrupts=1
Dec 10 19:35:17 osboxes kernel: Kernel command line: BOOT_IMAGE=/boot/vmlinuz root=UUID=ef2ecb3b-8e9a-4b20-bf15-47e0c7c98a1f acpi=noirq intel_iommu=on iommu=pt vfio-pci ids=10de:1091,10de:0e09 vfio_iommu_type1 allow_unsafe_interrupts=1
Dec 10 19:35:17 osboxes systemd-modules-load[518]: Module 'vfio' is built in
Dec 10 19:35:17 osboxes systemd-modules-load[518]: Module 'vfio_iommu_type1' is built in
Dec 10 19:35:17 osboxes systemd-modules-load[518]: Module 'vfio_pci' is built in
Dec 10 19:35:17 osboxes systemd-modules-load[518]: Module 'vfio_pci' is built in

EDIT: yeah, after only blacklisting nouveau, which still caused no driver to be loaded, I removed the all the settings except blacklist nouveau, and even nvidia driver doesn’t show… take of that blacklist and everything is fine.

There wasnt a CSM setting on this machine. Its a Dell T5600 workstation. However, there was a setting to specify the primary graphics slot in the BIOS, so I set that figuring it might be scanning early and blocking vfio-pci somehow. Didnt help… but worth a try.

The only hint I can find is that I’ve also set a USB device to in the vfio-pci ids, and it doesn’t change either. So, my problem seems to be to get vfio-pci to attach to any device. I dont know where to start troubleshooting. It is a built in module so not showing in lsmod. My dmesg is flooded with evbug messsages, possibly related, but without further info beside input number which seems to be for my keyboard and mouse. I’ll try to silence them or check right after another reboot. No hints in the boot log either, that I could find. I’ll post:

pastebin_com/D4NcJm0p

How do I figure out which IOMMU group PCI bridge is for the GPU, as maybe I need to include that in the ids?

        00:00.0 Host bridge [0600]: Intel Corporation Xeon E5/Core i7 DMI2 [8086:3c00] (rev 07)
IOMMU Group 1:
        00:01.0 PCI bridge [0604]: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 1a [8086:3c02] (rev 07)
IOMMU Group 2:
        00:01.1 PCI bridge [0604]: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 1b [8086:3c03] (rev 07)
IOMMU Group 3:
        00:02.0 PCI bridge [0604]: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 2a [8086:3c04] (rev 07)
IOMMU Group 4:
        00:03.0 PCI bridge [0604]: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 3a in PCI Express Mode [8086:3c08] (rev 07)
IOMMU Group 5:
        00:05.0 System peripheral [0880]: Intel Corporation Xeon E5/Core i7 Address Map, VTd_Misc, System Management [8086:3c28] (rev 07)
IOMMU Group 6:
        00:05.2 System peripheral [0880]: Intel Corporation Xeon E5/Core i7 Control Status and Global Errors [8086:3c2a] (rev 07)
IOMMU Group 7:
        00:05.4 PIC [0800]: Intel Corporation Xeon E5/Core i7 I/O APIC [8086:3c2c] (rev 07)
IOMMU Group 8:
        00:11.0 PCI bridge [0604]: Intel Corporation C600/X79 series chipset PCI Express Virtual Root Port [8086:1d3e] (rev 05)
IOMMU Group 9:
        00:16.0 Communication controller [0780]: Intel Corporation C600/X79 series chipset MEI Controller #1 [8086:1d3a] (rev 05)
IOMMU Group 10:
        00:19.0 Ethernet controller [0200]: Intel Corporation 82579LM Gigabit Network Connection (Lewisville) [8086:1502] (rev 05)
IOMMU Group 11:
        00:1a.0 USB controller [0c03]: Intel Corporation C600/X79 series chipset USB2 Enhanced Host Controller #2 [8086:1d2d] (rev 05)
IOMMU Group 12:
        00:1b.0 Audio device [0403]: Intel Corporation C600/X79 series chipset High Definition Audio Controller [8086:1d20] (rev 05)
IOMMU Group 13:
        00:1c.0 PCI bridge [0604]: Intel Corporation C600/X79 series chipset PCI Express Root Port 4 [8086:1d16] (rev b5)
IOMMU Group 14:
        00:1c.2 PCI bridge [0604]: Intel Corporation C600/X79 series chipset PCI Express Root Port 3 [8086:1d14] (rev b5)
IOMMU Group 15:
        00:1d.0 USB controller [0c03]: Intel Corporation C600/X79 series chipset USB2 Enhanced Host Controller #1 [8086:1d26] (rev 05)
IOMMU Group 16:
        00:1e.0 PCI bridge [0604]: Intel Corporation 82801 PCI Bridge [8086:244e] (rev a5)
IOMMU Group 17:
        00:1f.0 ISA bridge [0601]: Intel Corporation C600/X79 series chipset LPC Controller [8086:1d41] (rev 05)
        00:1f.2 SATA controller [0106]: Intel Corporation C600/X79 series chipset 6-Port SATA AHCI Controller [8086:1d02] (rev 05)
        00:1f.3 SMBus [0c05]: Intel Corporation C600/X79 series chipset SMBus Host Controller [8086:1d22] (rev 05)
IOMMU Group 18:
        03:00.0 VGA compatible controller [0300]: NVIDIA Corporation GF108GL [Quadro 600] [10de:0df8] (rev a1)
        03:00.1 Audio device [0403]: NVIDIA Corporation GF108 High Definition Audio Controller [10de:0bea] (rev a1)
IOMMU Group 19:
        04:00.0 3D controller [0302]: NVIDIA Corporation GF110GL [Tesla M2090] [10de:1091] (rev a1)
        04:00.1 Audio device [0403]: NVIDIA Corporation GF110 High Definition Audio Controller [10de:0e09] (rev a1)

cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz root=UUID=ef2ecb3b-8e9a-4b20-bf15-47e0c7c98a1f intel_iommu=on iommu=pt rd.driver.pre=vfio-pci vfio-pci ids=10de:1091,10de:0e09,1028:0496 vfio_iommu_type1.allow_unsafe_interrupts=1 video=vesafb:off,efifb:off video=efifb:off

I disabled kernel: evbug Dev: input messages by blacklisting the evbug module in the kernel boot args adding modprobe.blacklist=evbug.

I can see the nouveau messages better now, and those for the my GPU of interest:

[ 2.012573] nouveau 0000:03:00.0: NVIDIA GF108 (0c1c00a1)
[ 2.248271] nouveau 0000:03:00.0: bios: version 70.08.c1.00.02
[ 2.250175] nouveau 0000:03:00.0: fb: 1024 MiB DDR3
[ 3.560162] nouveau 0000:03:00.0: DRM: VRAM: 1024 MiB
[ 3.560164] nouveau 0000:03:00.0: DRM: GART: 1048576 MiB
[ 3.560168] nouveau 0000:03:00.0: DRM: TMDS table version 2.0
[ 3.560170] nouveau 0000:03:00.0: DRM: DCB version 4.0
[ 3.560172] nouveau 0000:03:00.0: DRM: DCB outp 00: 02000300 00000000
[ 3.560174] nouveau 0000:03:00.0: DRM: DCB outp 01: 01000302 00020030
[ 3.560176] nouveau 0000:03:00.0: DRM: DCB outp 02: 028113a6 0f220010
[ 3.560178] nouveau 0000:03:00.0: DRM: DCB outp 03: 02011362 00020010
[ 3.560180] nouveau 0000:03:00.0: DRM: DCB conn 00: 00001030
[ 3.560182] nouveau 0000:03:00.0: DRM: DCB conn 01: 00010146
[ 3.561127] nouveau 0000:03:00.0: DRM: MM: using COPY0 for buffer copies
[ 3.786940] nouveau 0000:03:00.0: DRM: allocated 1920x1080 fb: 0x60000, bo (ptrval)
[ 3.787002] fbcon: nouveaudrmfb (fb0) is primary device
[ 4.174695] nouveau 0000:03:00.0: fb0: nouveaudrmfb frame buffer device
[ 4.188204] [drm] Initialized nouveau 1.3.1 20120801 for 0000:03:00.0 on minor 0
[ 4.188441] nouveau 0000:04:00.0: enabling device (0000 → 0003)
[ 4.189197] nouveau 0000:04:00.0: NVIDIA GF110 (0c8880a1)
[ 4.334603] nouveau 0000:04:00.0: bios: version 70.10.46.00.01
[ 4.453139] nouveau 0000:04:00.0: fb: 5376 MiB GDDR5
[ 4.532067] nouveau 0000:04:00.0: DRM: VRAM: 5376 MiB
[ 4.532249] nouveau 0000:04:00.0: DRM: GART: 1048576 MiB
[ 4.532835] nouveau 0000:04:00.0: DRM: TMDS table version 2.0
[ 4.533511] nouveau 0000:04:00.0: DRM: DCB version 4.0
[ 4.534182] nouveau 0000:04:00.0: DRM: DCB outp 00: 02000300 00000000
[ 4.534784] nouveau 0000:04:00.0: DRM: DCB conn 00: 00000000
[ 4.536729] nouveau 0000:04:00.0: DRM: MM: using COPY0 for buffer copies
[ 4.538512] [drm] Initialized nouveau 1.3.1 20120801 for 0000:04:00.0 on minor 1

dmesg | grep 04:00.0
[ 0.811773] pci 0000:04:00.0: [10de:1091] type 00 class 0x030200
[ 0.811797] pci 0000:04:00.0: reg 0x10: [mem 0xf4000000-0xf4ffffff]
[ 0.811813] pci 0000:04:00.0: reg 0x14: [mem 0xd8000000-0xdfffffff 64bit pref]
[ 0.811829] pci 0000:04:00.0: reg 0x1c: [mem 0xe0000000-0xe1ffffff 64bit pref]
[ 0.811841] pci 0000:04:00.0: reg 0x24: [io 0x7000-0x707f]
[ 0.811852] pci 0000:04:00.0: reg 0x30: [mem 0xf5000000-0xf507ffff pref]
[ 0.811864] pci 0000:04:00.0: enabling Extended Tags
[ 0.811886] pci 0000:04:00.0: Enabling HDA controller
[ 0.910129] pci 0000:04:00.1: D0 power state depends on 0000:04:00.0
[ 1.265912] pci 0000:04:00.0: Adding to iommu group 19

Vfio* only has one message:

dmesg | grep -i vfio
[ 1.577009] VFIO - User Level meta-driver version: 0.3

So how can I find out more about what it’s doing?

If I blacklist the nouveau module for driver, nothing else loads, so I’m trying to focus on getting vfio-pci to work on something. Should I be seeing any more messages in dmesg for vfio?

I finally gave up and tried it on my proxmox partition and vfio-pci worked. First progress in about a week of working on it. Now, I’ll try installing on a fresh Kubuntu partition to see if that will work or if something I installed like Docker was grabbing the GPU early during boot.

vfio-pci ids=10de:1091,10de:0e09

That caught my eye. There’s a space in there and that might be causing the issue. vfio-pci.ids=... should work. Since you changed your setup, then maybe you can check the kernel parameters to see how the device ID-s are specified in a setup that works for you.

I’ve had similar issues with my VFIO setups as well so if it turns out to be caused by something as simple as that, then won’t worry, happens to the best of us.

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.