Proxmox epyc milan passthrough gpu on ubuntu vm cant load nvidia module

hi all,

Im running the latest proxmox with all non-paid updates on a new gigabyte 2U box, dual milan CPUs and NVIDIA GPUs (A40 RTX).

First time experience with AMD and proxmox!!! I run my there dual cpu/gpu systems with Unraid and Xeon CPUs. Passthrough was always running ok.

On proxmox I have done the usual “tutorial” steps:

PROXMOX:
GRUB_CMDLINE_LINUX_DEFAULT=“quiet amd_iommu=on iommu=pt pcie_acs_override=downstream,multifunction nofb nomodeset video=vesafb:off,efifb:off”

01:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)
Subsystem: NVIDIA Corporation Device 145a
Flags: fast devsel, IRQ 542, NUMA node 3
Memory at f8000000 (32-bit, non-prefetchable) [disabled] [size=16M]
Memory at 46000000000 (64-bit, prefetchable) [disabled] [size=64G]
Memory at 48040000000 (64-bit, prefetchable) [disabled] [size=32M]
Capabilities: [60] Power Management version 3
Capabilities: [68] #00 [0080]
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [b4] Vendor Specific Information: Len=14 <?>
Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] #19
Capabilities: [bb0] #15
Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
Capabilities: [c1c] #26
Capabilities: [d00] #27
Capabilities: [e00] #25
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau

echo “options vfio_iommu_type1 allow_unsafe_interrupts=1” > /etc/modprobe.d/iommu_unsafe_interrupts.conf
echo “options kvm ignore_msrs=1” > /etc/modprobe.d/kvm.conf
echo “blacklist radeon” >> /etc/modprobe.d/blacklist.conf
echo “blacklist nouveau” >> /etc/modprobe.d/blacklist.conf
echo “blacklist nvidia” >> /etc/modprobe.d/blacklist.conf
echo “options vfio-pci ids=10de:1b81,10de:10f0 disable_vga=1”> /etc/modprobe.d/vfio.conf
update-initramfs -u

I build VMs with many ubuntu versions 2004LTS, 2104 and both server versions but nvidia-smi never works…

Any ideas?
I have tried many “posts” that apt purge and install things… maybe not the correct one though.

Does proxmox keeps some elements of the GPU from the VM?
I like proxmox a lot and I would like to make it work!

k@u2104serv01:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
k@u2104serv01:~$ sudo lspci -s 01:00 -v
01:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A40] (rev a1)
Subsystem: NVIDIA Corporation Device 145a
Physical Slot: 0
Flags: fast devsel, IRQ 16
Memory at (32-bit, non-prefetchable) [disabled]
Memory at (64-bit, prefetchable) [disabled]
Memory at (64-bit, prefetchable) [disabled]
Capabilities: [60] Power Management version 3
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [b4] Vendor Specific Information: Len=14 <?>
Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

k@u2104serv01:~$ lspci -n -s 01:00
01:00.0 0302: 10de:2235 (rev a1)
k@u2104serv01:~$ grep nvidia /etc/modprobe.d/* /lib/modprobe.d/*
/etc/modprobe.d/blacklist-framebuffer.conf:blacklist nvidiafb
k@u2104serv01:~$ grep nouv /etc/modprobe.d/* /lib/modprobe.d/*
/lib/modprobe.d/nvidia-graphics-drivers.conf:blacklist nouveau
/lib/modprobe.d/nvidia-graphics-drivers.conf:blacklist lbm-nouveau
/lib/modprobe.d/nvidia-graphics-drivers.conf:alias nouveau off
/lib/modprobe.d/nvidia-graphics-drivers.conf:alias lbm-nouveau off
k@u2104serv01:~$ sudo modprobe nvidia
modprobe: ERROR: could not insert ‘nvidia’: No such device

I don’t have enough time to thoroughly understand your issue, but one thing that catches a lot of people off guard is that when proxmox is installed on an EFI/GPT partition with ZFS it uses system-boot instead, ignoring any traditional grub arguments you set.

See Host Bootloader - Proxmox VE for more details

Also, don’t use ACS override unless you actually need it. From what I hear It can potentially cause practical issues, as well as theoretical security issues.

I know it’s a pain in the ass with server boot times being minutes long, but you should seek to only set the minimum boot options you need. Most guides are “the strange hacks that I desperately tried on my particular machine without any scientific methodology” notes, rather than good recommendations.

my install is very basic, no ZFS, just a normal debian ext4 install. grub seems to work and IOMMU command worked.