Nvidia Passthru GPU Stops Working After VM Reboot

The problem:
Upon first boot of the physical system the card shows up under nvidia-smi and other tools, applications can use it, never an issue provided the VM does NOT reboot. The first (if I’m lucky second or so) reboot of the VM that has the GPU assigned and the issues start and the only “resolution” is to reboot the ESXi host. Right now I suspect that perhaps the GPU is just on its way out but I’d like to be certain if at all possible.

Once triggered the Nvidia Quadro P4000 doesn’t show up under “nvidia-smi” output or other tools that pool the GPU and is not usable for any application (Plex, Folding, etc) even though it shows up in lspci as a device so the VM knows its there and sees it at least. Issue also does not happen with a Quadro M2000 although I’ve got a different issue with that card where it will sometimes not be detected between host reboots but that’s not much of an issue for me. I also have other PCIe cards (NICs) passed through to other VMs, no issues with those either.

Host System Specs:
Mobo: Gigabyte C246-WU4
CPU: Intel Xeon E-2278G
Memory: 128GB DDR4 ECC memory
Hypervisor OS: ESXi 8.0a [just upgraded from 7.0u3i]
VM OS tested: Ubuntu 20.04 and 22.04, Windows Server 2019

What I’ve tried/additional info:

  • ESXi 7.0u3i and 8.0a - Same issue across both, fully patched (no change)
  • Persists across VMs but the card still shows up in ESXi as Passthru and in lspci on the VMs
  • I’ve tried clean/fresh installs of VMs (no change)
  • I’ve removed it from the VM, toggled passthru off and back on, readded the card to the VM (no change)
  • Passing the card through as DirectPath IO and Dynamic DirectPath IO (no change)
  • I’ve disabled the svga adapter and made other parameter modifications (no change)
  • I’ve tried BIOS and EFI on multiple VMs (no change)
  • Tried different Nvidia drivers, rolling back, etc. (no change)

Debug output:

~$ sudo dmesg | grep NVRM
[ 16.415298] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 525.78.01 Mon Dec 26 05:58:42 UTC 2022
[ 16.899405] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x23:0xffff:1413)
[ 16.899778] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0
[ 27.781972] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x23:0xffff:1413)
[ 27.782157] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0
[ 27.911073] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x23:0xffff:1413)
[ 27.911477] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0

~$ sudo nvidia-smi
No devices were found

~$ lspci

00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (rev 01)
00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge (rev 01)
00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 08)
00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:07.7 System peripheral: VMware Virtual Machine Communication Interface (rev 10)
00:0f.0 VGA compatible controller: VMware SVGA II Adapter
00:10.0 SCSI storage controller: Broadcom / LSI 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 01)
00:11.0 PCI bridge: VMware PCI bridge (rev 02)
00:15.0 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.1 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.2 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.3 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.4 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.5 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.6 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.7 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.0 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.1 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.2 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.3 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.4 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.5 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.6 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.7 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.0 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.1 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.2 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.3 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.4 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.5 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.6 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.7 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.0 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.1 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.2 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.3 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.4 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.5 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.6 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.7 PCI bridge: VMware PCI Express Root Port (rev 01)
02:01.0 SATA controller: VMware SATA AHCI controller
03:00.0 Ethernet controller: VMware VMXNET3 Ethernet Controller (rev 01)
0b:00.0 VGA compatible controller: NVIDIA Corporation GP104GL [Quadro P4000] (rev a1)

1 Like

I’ve got the same problem with a TrueNAS Core host and a Tesla P4 with Linux driver 535.98.

2 Likes

Same here with FreeBSD 13 and BHYVE with a P4000. Did you ever got that sorted?

Unfortunately I never did solve it but I moved to Proxmox and haven’t had a single issue since using the same setup.

Hmm, thought about that too. Maybe someday I’ll give that a go.

1 Like

PCI pass through and GPUs in VMs! Can be so messy! I had to go away to chop some veg for dinner and think back ten years to when I used to do this, although was Centos KVM with virt-manager. Once I learned about PCI addresses and started automating VMs with vagrant, life got a lot easier. So I feel your pain.

I don’t know VMWare, but there were some parallels to this problems I had back then.

  1. When starting a VM, you had to detach the GPU from the host, then attach it to the VM. I don’t think that is the case here as you can see the GPU with lspci. But maybe the GPU is still attached or allocated to the host in some way.

  2. I don’t think the card is faulty as it works on a clean boot. But could be.

  3. nvidia-smi requires the Nvidia driver to be loaded in the kernel. So, a few things there. The lspci shows another VGA card, if that is getting loaded by the VM at boot, the Nvidia driver won’t load. So try without that VGA card/disable in VMware?. There is also the nouveau driver in Linux, if that is getting loaded, then again the Nvidia driver won’t load. So try blacklisting nouveau driver if not already done. This might be some sort of hardware/driver timing issue, which might explain why sometimes works sometimes doesn’t, and different behaviours with different GPUs. Bit harder to track down and fix, but looking through the entire dmesg can be useful to see the order of what is loading when.

My money is on the Nvidia driver not actually loading into the kernel at vm boot. Have a look through the output of lsmod and see what kernel modules are actually loaded.

After that is there any way to reset the card hardware in VMware? In a VM or on the host if you have a terminal and Nvidia-smi is working, you can do something like nvidia-smi —gpu-reset. Maybe something similar in VMware?

Good luck.

Edit, just realised the date of your post. oh well, glad proxmox worked for you.

1 Like