The problem:
Upon first boot of the physical system the card shows up under nvidia-smi and other tools, applications can use it, never an issue provided the VM does NOT reboot. The first (if I’m lucky second or so) reboot of the VM that has the GPU assigned and the issues start and the only “resolution” is to reboot the ESXi host. Right now I suspect that perhaps the GPU is just on its way out but I’d like to be certain if at all possible.
Once triggered the Nvidia Quadro P4000 doesn’t show up under “nvidia-smi” output or other tools that pool the GPU and is not usable for any application (Plex, Folding, etc) even though it shows up in lspci as a device so the VM knows its there and sees it at least. Issue also does not happen with a Quadro M2000 although I’ve got a different issue with that card where it will sometimes not be detected between host reboots but that’s not much of an issue for me. I also have other PCIe cards (NICs) passed through to other VMs, no issues with those either.
Host System Specs:
Mobo: Gigabyte C246-WU4
CPU: Intel Xeon E-2278G
Memory: 128GB DDR4 ECC memory
Hypervisor OS: ESXi 8.0a [just upgraded from 7.0u3i]
VM OS tested: Ubuntu 20.04 and 22.04, Windows Server 2019
What I’ve tried/additional info:
- ESXi 7.0u3i and 8.0a - Same issue across both, fully patched (no change)
- Persists across VMs but the card still shows up in ESXi as Passthru and in lspci on the VMs
- I’ve tried clean/fresh installs of VMs (no change)
- I’ve removed it from the VM, toggled passthru off and back on, readded the card to the VM (no change)
- Passing the card through as DirectPath IO and Dynamic DirectPath IO (no change)
- I’ve disabled the svga adapter and made other parameter modifications (no change)
- I’ve tried BIOS and EFI on multiple VMs (no change)
- Tried different Nvidia drivers, rolling back, etc. (no change)
Debug output:
~$ sudo dmesg | grep NVRM
[ 16.415298] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 525.78.01 Mon Dec 26 05:58:42 UTC 2022
[ 16.899405] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x23:0xffff:1413)
[ 16.899778] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0
[ 27.781972] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x23:0xffff:1413)
[ 27.782157] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0
[ 27.911073] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x23:0xffff:1413)
[ 27.911477] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0
~$ sudo nvidia-smi
No devices were found
~$ lspci
00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (rev 01)
00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge (rev 01)
00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 08)
00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:07.7 System peripheral: VMware Virtual Machine Communication Interface (rev 10)
00:0f.0 VGA compatible controller: VMware SVGA II Adapter
00:10.0 SCSI storage controller: Broadcom / LSI 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 01)
00:11.0 PCI bridge: VMware PCI bridge (rev 02)
00:15.0 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.1 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.2 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.3 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.4 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.5 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.6 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.7 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.0 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.1 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.2 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.3 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.4 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.5 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.6 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.7 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.0 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.1 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.2 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.3 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.4 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.5 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.6 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.7 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.0 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.1 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.2 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.3 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.4 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.5 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.6 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.7 PCI bridge: VMware PCI Express Root Port (rev 01)
02:01.0 SATA controller: VMware SATA AHCI controller
03:00.0 Ethernet controller: VMware VMXNET3 Ethernet Controller (rev 01)
0b:00.0 VGA compatible controller: NVIDIA Corporation GP104GL [Quadro P4000] (rev a1)