My server murders GPUs (switiching docker <-> VM)

Hello community,

I come to you in great pain.

TLDR: Is it possible to brick/destroy a GPU by switching it from docker container to a VM?

My home server setup has now killed not one but two GPUs in the last month and I am now trying to find out why.

I am running Unraid with my singe GPU passed through to a VM for some light gaming. I was working on getting my GPU set up in a way that allows me to switch manually between the VM and ollama (I was playing around with local deepseek at the time). I have gotten that to work but after switching back to the VM (without rebooting, a few hours of normal operation had passed at that time) the VM startup process hung the whole system so that I had to force the server off.

This is when my fist GPU became unresponsive at boot (NVIDIA RTX 2060S) and I was fairly convinced at first that it died of a natural cause due to me using it to mine in the past. Motherboard was stuck on boot loop indicating an issue with the VGA adapter (indicator LED). EDIT: Unraid bootes normal without detecting the GPU.

Not making much of it (due to the other cards history) I set out to get another budget GPU (RX 7600XT) so I can continue to work. Only to have it die a very similar way, again the VM start (after having used the GPU in the Ollama container) made the server unresponsive, forcing me to reset it too.

This is how GPU 2 became unresponsive at boot.

I was then fairly convinced that it might have something to do with the vbios because of the GPU symptoms. Both cards spin their fans up during boot and stop after about 5s and their presence is nowhere to be found on the system (undetected in bios, windows, linux, dos). So I looked into reflashing them while attached as secondary GPU. But of course, the cards are both still not recognized. The kicker was that the new RX7600 has a switchable dual bios which did nothing! That’s when I decided to send the new card back as long as it was in warranty and not mess around with it. I am currently debugging the old card for electrical issues since reflashing the vbios directly onto the chip did not change the situation except that the fans no longer spin up during boot.

I am now quite reluctant to again buy a GPU only to sacrifice it.

I would now like to know if anyone has an idea if the system I have described is even able to brick cards that way or if I need to look for other issues.

I am out of ideas at this point, so I am really looking forward to some input.

Specs:
AMD Ryzen 9 5900X 12-Core
ASUSTeK COMPUTER INC. TUF GAMING B550M-PLUS (WI-FI)
64gb DDR4 ECC RAM
LSI SAS2308 PCI-Express Fusion-MPT SAS-2
Renesas Electronics Corp. uPD720201 USB 3.0 Host Controller
Custom case mod utilizing 3D printed parts that are non pcie compliant regarding frame grounding
doubled up GPU riser cable

Do you happen to have another system that you could test the offending GPUs in? Are you able to try them directly in the PCIe slots without riser cables?

  1. Check your PSU, caps in it and caps on the MoBo.
  2. Check the FLASH on both cards, best with some cheap ISP programmer

First of, thank you for your input.

@stin187 This has been done yelding the same results that’s why I pronounced both cards as at least hard bricked. Tested on a different system, without a riser and a different PSU, undetectee in bios, windows, linux.

@Fallon_Gray I have not checked the mobo nor psu caps but will do next. I am currently diagsing the 2060S in a testbench.

But more important, and I guess I have not made my self clear enough in the initial post. I am interested to know if the software switchover in my setup could actually be the reason for the GPU fault. Also keep mind that I only still have one of the cards in question.

1 Like

I may be wrong but AFAIK the only way to brick a GPU via software is by corrupting the BIOS.

Thats what I thought. The only reason for me to suspect something else is that the dual vbios switch on my xfx rx7600xt didn’t do anything.

Ok maybe i miss read something are you getting a post error during bootup with the cards, or something else during os boot. My understanding is despite missing or bad graphics a mother board should post and make it to bios. Basically what im trying to get at is is the problem at post, during the bootloader, or during the os boot process?

When its failing means the issue could be a number of different things. And considering you messed with gpu pass though i have a feeling it could be software related.

Sorry, my mistake! There was no bootloop, unraid started normal. I only thought it was looping during post because I was inpatient. I edited in the original post to clarify.

Pretty damming that neither the backup BIOS nor trying both GPUs in a different system worked. Would be worth carefully inspecting the GPU slot directly, checking for any debris or damaged/bent pins. Especially the pins before the notch as those carry the power. I’d also consider the PSU suspect too but I’m not sure of a good way to verify that.

So in the following im working under the assumtion that the omfv driver and nvidia or amdgpu driver cannot brick a gpu bios.

Ok so in linux when you say the card does not show up you mean its nowhere t be found when using lspci?

Ok so its been about 2 years since i last did vfio passthrough using iommu groups to passthough a gpu to a vm, so forgive me if i get something wrong here if something has changed. However:

Theoretically when you setup the card to do vfio passthrough using the omfv driver for the card, you would have had to set its iommu group to be targeted at boot to load the omfv driver for the device and not the nvidia or amdgpu driver. The graphics card cannot be used by two drivers at once.

If i recall once setup the gpu was effectively not “booted” by the host os when using the omfv driver so that a guest os could do this and take control of the gpu, using the guest os’ gpu driver. I also recall there was an annoying bug at the time where if a guest os crashed it could leave the card “booted” and in an incorrect state for the omfv driver to hand the device off to the vm at vm restart. This had something to do with how gpu bios interacted with the omfv driver at the time. The way to fix it was to fully reboot the host, which power cycled the pcie devices, resetting the gpu back to its “unbooted” state.

So this leaves me with another question of how are you handling switching between the omfv and nvidia driver on the fly?

To leave the gpu in the correct state for the omfv driver you would have to have the nvidia driver tell the gpu to effectively shut down before, handing the gpu over to the omfv driver.

Honestly I can’t imagine that happening. The only way would be if somehow you bricked the GPUs while flashing inside the docker container or VM. In a VM I even think you can pass a vbios as a ROM file. That file is not flashed onto the card, but just used INSTEAD of the GPU ROM.

Try to find yet another PC, that works with some GPU already and has the PSU to support one of those GPUs and test them again. If they don’t POST, you usually get what error it is (e.g. if the graphics card isn’t being recognized).

Unless you broke the cards while using them in docker or a VM (just like you could on a bare metal system), there’s no way they should’ve bricked.

If you had a mind to, you could read the card BIOS directly off the NAND and compare it with a known good file, confirm if it’s corrupted or not.

@Kougar I have not inspected the the PCIe sockets but since the incidents occurred, I used an old K2200 and now a 1050 (pulling only power from the socket) in the server. Having had no problems but also only used it in the VM. Regarding the PSU, I agree that there could be an issue but also have no idea what to do (further than measuring the PCIe power pins, they are good) with that then buy a new one. Its an RM750x from corsair.

@zmezoo While reading your post I remembered that I might have never been able switching from the VM to the container but only ever the other way around. So basically I rebooted, tested docker, switched back and then it died or I had to reboot while tinkering because the host became unresponsive. But to be fair, it has been a while since… On another note, I have totally forgotten that I messed around with power states of the GPU following this spaceinvaderone video but note that I only used the script at array startup and not as he does hourly (only for 2060s).

@IgorDe I am currently sourcing a new test bench due to my junk test PC not supporting the GPU at all. The card posted once (and only once in a thousand boot attempts) but could not be used due to the old mobo showing just “B2”. If anyone is interested in the GPU electrical diagnosis you can follow in this discord thread.

@wizarddata I have saved the BIOS that was on the card after it occurred but could only compare them to the ones on GPUZ database (was not smart enough to save it beforehand). My VBIOS version is 90.06.44.40.AD VGA Bios Collection: Gigabyte RTX 2060 Super 8 GB | TechPowerUp … Initially I did not check it but I have now and will you look at that, it was corrupted:

The left one is the original.

Sadly this changes nothing on the card since it is still dead. But this leaves us again with the question why the bios switch on the XFX rx7600xt didn’t do jack. I have heard somewhere on YT that there are some cards around with only a performance switch but on the XFX website it clearly states “Giving a backup BIOS in the case of a failed BIOS update”. I was not able to read the XFX bios since it would have voided warranty removing the shroud.

Ok, so, this is a vidoe on installing and using the nvidia driver to set gpu powerstates. This has nothing to do with how to setup vfio for a gpu, and use the omfv driver to pass through a gpu to a vm on unraid.

Next let me make a disclaimer that i don’t use unraid. However unriad is linux-based, so gpu passthrough shuld work the same as on any other linux distro.

My understanding is unraid has gui options to set up gpu pass-through. I’m assuming unrad when you pick this gui option is setting all the iommu vifio options at a os level to mark the gpu at boot for loading of the omfv driver. I also understand just like on other linux based distros, this requires a reboot for this to take effect.

The thing here is witout unsetting the above and rebooting every subsequent boot your gpu will not have the nvidia driver loaded, just the omfv, and would be un-usable to use it in a container on unraid, since the nvidia driver is not running which containers need to communicate with to use the gpu.

Last point here is setting up vfio and gpu passthrough is not the exact same with every gpu and some gpu’s requrie special extra options to be set. I have a hard time imagining that unraid is getting this right every time with the script the gui is triggering. If gpu passthough is not setup correctly, a few things:

  1. If the scirpt got the iommu omfv part correct unraid will not have loaded the nvida driver for unriad meaning fan control power control and use of the gpu with containers running on unraid will not work
  2. If an option was not set that is required for the gpu to work with a vm, the vm will fail to acquire/use the gpu, with a variety of different errors.

Again i will ask from unraid have you opened a terminal session and looked at the response of lspci? Does the gpu show up in the response? If its there, try lspci -v what driver is unraid currently using witht the gpu?

Since you were able to read the old BIOS off, how about writing a known good version? Won’t address the question of how it happened in the first place, but might bring your cards back.

I did that with a hardware programmer (same way I read the bios out since the card is non detectable) and the card remains dead. At this moment I am debugging why it is not getting any pex or memory voltage.

1 Like