Do your RTX 5090 (or general RTX 50 series) has reset bug in VM passthrough?

Just got my two 5090s and passthrough them in Proxmox. After the initial excitement, I found my host became unresponsive. Further debugging shows that the host CPU got soft lock after a FLR timeout, which is after a shutdown of Linux VM. No issue for my previous 4080. Both of mine got the reset bug so it could be a driver issue.

I’d like to see if anyone has the similar experience or has no issue with their RTX-50 series. The RTX-50 launch has been a disaster both in supply and software reliability, and Nvidia does need a change.

Envs:
Proxmox 8.3.5
Nvidia Driver 570.124.06 (open source version) in VM
VM Version: Ubuntu 24.04.2/Debian 12.10

I am running into similar issues with PCIe passthrough of an RTX 5070 to Linux VM’s. I’m running Manjaro for the host OS in my case, but it sounds like the same issue. The 5070 resets properly when shutting down / restarting a Windows 10 VM with ROM BAR disabled and running the latest 576.02 Windows drivers from Nvidia. No such luck with a Linux VM though. Sometimes the Linux VM restarts successfully, but other times I see the “not ready 65535ms after FLR; giving up” errors in the host OS dmesg output and the card is hung so hard it takes a power cycle to straighten it out.

So far I have tried the latest 575.51.02 beta drivers from Nvidia and the matching release of open GPU modules from Github in the Linux VM, with no difference. I have tried messing around with various Nvidia driver module parameters, also with no difference.

Sadly, it looks to me like PCIe passthrough of consumer-grade GPU’s is dying out. The latest Nvidia and AMD cards both have serious reset bugs, and the fact that nobody responded to your post for over three weeks until I started looking for answers myself tells me that most people have given up on it.

Windows VM’s do work for me, so in principle Linux should also be to work, but only if there’s enough demand to motivate Nvidia to fix the bugs.

5060TI over here. I can’t even get tianocore to show up.

vbios patched or not. Only black screen.
I’m using proxmox and my only “solution” was to install the nvidia driver on proxmox and untick rombar on the vm that will use the gpu. When the nvidia driver loads up in the vm (that you previously prepared) then I get an image. This happens on both single gpu and dual gpu (igpu).

I compiled some kernels to try in proxmox and anything above 6.11 gives me an instant guest vm crash when chromium based browsers try to play some video. (same thing happens with a 4060ti). So something is definitely deeply broken.

rebooting or shutting down the guest vm, you have almost 50% chance of a host crash.

What a sh*t show …


also how is it being rich, and knowing you’re part of the elite few to get a top tier GPU before ww3 kicks off and we all wind up gaming on mud puddles in our irradiated hovels trading bottlecaps for boxes of macaroni and cheese?

the peak of human gaming graphics technology, just for you, right before its all GONE :rofl:

Just got my 5080 up and running on my Windows VM and I can reload as much as I can without reset bug. It sucks horribly in games, utilization is very low but I am pretty sure this is because of how many PCI lanes were ultimately assigned to my GPU:

$ sudo lspci -s 42:00.0 -vvvvv | grep LnkSta
		LnkSta:	Speed 16GT/s (downgraded), Width x4 (downgraded)

Other than that, because I am somehow restricted to 24.04 for Corporate reasons, it was an inmesne pain in the butt to get the card to be first binded to the NVIDIA driver. After various hours of troubleshooting, I could get it to work by disabling PCI reallocation:

GRUB_CMDLINE_LINUX_DEFAULT="pci=realloc=off

Now let’s see if I can assign the proper ammount of PCI lanes on my TRX50-Sage-WiFi (right now it is in slot 4/Gen4 for cleareance issues), hopefully after the waterblocks it can work as expected in the 2nd slot.

How is utilization with some synthetic load or benchmark? Gaming performance can be bad because of many other reasons and if I read this correctly you have pcie 4x4 which is a limitation but shouldn’t be completely terrible. I.e. VBS making the game run using nested virtualisation, etc…

Following this thread since I was interested in adding a 5060ti to supplement my 3060 which has been working great for passthrough. Hope other users can add fail/success reports…

I haven’t tested synthetic benchmarks but games only, will give it a try later today.
For older games it works pretty fine (with some fps lows but very playable) like battlefield 4. But for newer stuff like Lies of P, I can’t get past 60FPS even in low settings, the same for Elden Ring, low 2 FPS and average 40 with max settings and no RT.

I have pinned 16 of my CPUs (same L3 cache group) and the only thing I have to hide virtualization is this:

I haven’t digged into Windows treating it as VM but will give it a try, as the only thing I am doing is to disable kvm steate with:

    <kvm>
      <hidden state="on"/>
    </kvm>

EDIT: Your suggestion was right! It was VBS…, I added the following lines and now it works so much better even in x4!. Before my GPU was about 60-70% utilization with CPU at 30-50% so CPU bottleneck was not a problem neither was storage (NVMe passthrough). Now I got steady 60FPS in Elden Ring at 3840x1600 Max Settings + Max RT (some drops when loading scenes but lowering RT fixes it). Also +144 FPS in Lies of P, before I was getting nowhere close to that.


 <hyperv mode="passthrough">
      <relaxed state="on"/>
      <vapic state="on"/>
      <spinlocks state="on" retries="8191"/>
    </hyperv>
    <kvm>
      <hidden state="on"/>
    </kvm>
    <vmport state="off"/>
  <cpu mode="host-passthrough" check="full" migratable="on">
    <topology sockets="1" dies="1" cores="8" threads="2"/>
    <cache mode="passthrough"/>
    <feature policy="require" name="topoext"/>
    <feature policy="disable" name="svm"/>
    <feature policy="require" name="hypervisor"/>

I am happy to share any more of my experience with the 5080 Passthrough.

Current Build is:

Ubuntu 24.04 Kernel 6.8.0-52, Secure Boot Enabled
Looking-Glass B7
Motherboard - TRX50 Sage WiFi
CPU: AMD Threadripper 7980X
Host GPU: NVIDIA RTX A2000 12GB (2nd slot) - NVIDIA Open Drivers 570.133.07
Guest GPU: NVIDIA RTX 5080 - Windows Drivers 576.02

BIOS Settings:

CSM Enabled
Resizable BAR Disabled
SR-IOV Disabled
SVM Enabled

Relevant GRUB Lines:

GRUB_CMDLINE_LINUX_DEFAULT="pci=realloc=off amd_iommu=on iommu=pt vfio-pci.ids=10de:2c02,10de:22e9 vfio_iommu_type1.allow_unsafe_interrupts=1 "

/etc/modprobe.d/vfio.conf

softdep nvidia pre: vfio-pci
softdep nouveau pre: vfio-cpi
options vfio-pci ids=10de:2c02,10de:22e9

Relevant Virsh config:

  <features>
    <acpi/>
    <apic/>
    <hyperv mode="passthrough">
      <relaxed state="on"/>
      <vapic state="on"/>
      <spinlocks state="on" retries="8191"/>
    </hyperv>
    <kvm>
      <hidden state="on"/>
    </kvm>
    <vmport state="off"/>
    <smm state="on"/>
  </features>
  <cpu mode="host-passthrough" check="full" migratable="on">
    <topology sockets="1" dies="1" cores="8" threads="2"/>
    <cache mode="passthrough"/>
    <feature policy="require" name="topoext"/>
    <feature policy="disable" name="svm"/>
    <feature policy="require" name="hypervisor"/>
  </cpu>

Default Settings for GPU passthrough (ROM BAR Enabled)

<hostdev mode="subsystem" type="pci" managed="yes">
  <driver name="vfio"/>
  <source>
    <address domain="0x0000" bus="0x42" slot="0x00" function="0x0"/>
  </source>
  <alias name="hostdev0"/>
  <address type="pci" domain="0x0000" bus="0x05" slot="0x00" function="0x0"/>
</hostdev>

Thanks a lot!

1 Like

I just confirmed that the 570.133.07 works well for near a month without the reset issue. 570.133.20 works but not fully tested. I think Nvidia has fixed at least part of the problem in 570.133 release.

I don’t have NV driver enabled on host so I’m not sure if my experience would apply to yours. Something worth noting is that I encountered strange CUDA bug related to UVM driver for both 570.124.06 and 570.133.20, but not 570.133.07. The only difference is that the previous two are installed through their APT repository while the 133.07 is from their dedicated installer. I suggest you try drivers from different sources (I know both of them are Nvidia’s official channel! ) to see if anything changes.

Note that if a working driver is uninstalled, any shutdown will lead to a failed reset until you shut it down when a working driver is installed.

I got my RTX 5080 this week and had almost no issues with passthrough, but I have only tested Windows VMs. Only problem I had was apparently caused by a Nvidia driver issue. Sometimes after a Windows reboot I got a grey screen. The 576.28 driver changelog lists that as fixed, so this should hopefully be a thing of the past now.

I’m using a almost default Libvirt VM config. I only added some CPU pinning with host-passthrough, smbios host and passthrough of sata and usb controllers and a nvme disk. No hiding of kvm or hypervisor.

Also I don’t have issues with downgraded pcie link state. In idle it goes down to 2.5GT/s but when I use the gpu it will be back to 32GT/s.

Currently I load vfio module on boot. Didn’t tried using the gpu on Linux host with PRIME yet.

Kernel: 6.13.8

2 Likes

I have a 5070 Ti that cannot reset in a Linux VM, but can reset in a Windows VM.
Very frustrating as I wanted to use the GPU for local LLM when not gaming on the VFIO VM. Guess I have to do it on the Windows VM for now.

Anyway, I opened a Github issue here:

(My host is Fedora 42. Reset failed on Fedora Server 41,42 and Debian 12 VMs)

1 Like

This thread isn’t motivating.
I currently have two 7900XTXs and can switch between Linux with two GPUs for local LLM and Windows with one GPU and the GPU reset always works.
To get better GPU scaling, I now want to install the AMD GPUs in an Epyc system and buy a 5090 for the desktop instead.
Does anyone have a 5090 that works flawlessly with Windows and Linux VMs and also has no black screen problem when switching between OS with a KVM?

2x RTX 5090 on WRX90 system. No issues with passthrough. Time Spy: 13390, Steel Nomad: 13599. I don’t really play any games other than Flight Sim 2024 so I have no idea if these are good results. I could only get it to work if I load vfio on boot and have a script to switch it between host / vm.

1 Like

Do you use Looking Glass or a KVM to switch between the systems?
Why I ask, Wendell had problems switching between Windows and Linux in his video for the RTX 6000, he explained it with driver problems, but maybe I misunderstood that.
Which 5090 do you have?

I vfio pass through my 6000 Blackwell into vms without any issues. Not doing anything fancy either. Using 576 driver on windows and 575 for Linux.

1 Like

I don’t have any Linux VM issues with my RTX5080 either. I can freely switch between my WIndows and Linux VMs. My Linux VMs are running Arch Linux with current kernel (6.14.6) and nvidia-open (570.144).

What distro is your VM host running? I will try installing an Arch VM later (never used Arch before, lol)

i cant deploy windows 11 VM with 5080 in passthrough on pve proxmox, somebody can show a howto manual of pve and host.
i was read instructions in forums, videos, pve howto but i cant

when I did the setup here in a forum thread I had to set the audio to be a multifunction device in kvm or else I had trouble. fwiw. also hyperv helpers

I always do it this way, I remember that AMD ROCm support recommends it this way, because the driver expects the subdevice to be on the same bus

Just tried passing through the GPU on a Arch VM (with UEFI). It resets properly now.
I also tested on a new Fedora Server 42 machine (also with UEFI) and it works too.

Maybe using BIOS instead of UEFI is what causing the reset issue? :person_shrugging:
(Note: Remember to disable secure boot in the UEFI firmware menu in the VM)

1 Like