Hello community,
I come to you in great pain.
TLDR: Is it possible to brick/destroy a GPU by switching it from docker container to a VM?
My home server setup has now killed not one but two GPUs in the last month and I am now trying to find out why.
I am running Unraid with my singe GPU passed through to a VM for some light gaming. I was working on getting my GPU set up in a way that allows me to switch manually between the VM and ollama (I was playing around with local deepseek at the time). I have gotten that to work but after switching back to the VM (without rebooting, a few hours of normal operation had passed at that time) the VM startup process hung the whole system so that I had to force the server off.
This is when my fist GPU became unresponsive at boot (NVIDIA RTX 2060S) and I was fairly convinced at first that it died of a natural cause due to me using it to mine in the past. Motherboard was stuck on boot loop indicating an issue with the VGA adapter (indicator LED). EDIT: Unraid bootes normal without detecting the GPU.
Not making much of it (due to the other cards history) I set out to get another budget GPU (RX 7600XT) so I can continue to work. Only to have it die a very similar way, again the VM start (after having used the GPU in the Ollama container) made the server unresponsive, forcing me to reset it too.
This is how GPU 2 became unresponsive at boot.
I was then fairly convinced that it might have something to do with the vbios because of the GPU symptoms. Both cards spin their fans up during boot and stop after about 5s and their presence is nowhere to be found on the system (undetected in bios, windows, linux, dos). So I looked into reflashing them while attached as secondary GPU. But of course, the cards are both still not recognized. The kicker was that the new RX7600 has a switchable dual bios which did nothing! That’s when I decided to send the new card back as long as it was in warranty and not mess around with it. I am currently debugging the old card for electrical issues since reflashing the vbios directly onto the chip did not change the situation except that the fans no longer spin up during boot.
I am now quite reluctant to again buy a GPU only to sacrifice it.
I would now like to know if anyone has an idea if the system I have described is even able to brick cards that way or if I need to look for other issues.
I am out of ideas at this point, so I am really looking forward to some input.
Specs:
AMD Ryzen 9 5900X 12-Core
ASUSTeK COMPUTER INC. TUF GAMING B550M-PLUS (WI-FI)
64gb DDR4 ECC RAM
LSI SAS2308 PCI-Express Fusion-MPT SAS-2
Renesas Electronics Corp. uPD720201 USB 3.0 Host Controller
Custom case mod utilizing 3D printed parts that are non pcie compliant regarding frame grounding
doubled up GPU riser cable