with the recent advancements in Linux gaming (DXVK/Proton, good open source drivers for AMD GPUs) I have started to migrate my gaming over to a Linux workstation. For the most part this works amazingly well using Manjaro on a Threadripper 1950x render box with a Radeon Vega 56. (Gaming is not the primary purpose of this machine.)
However, maybe once or twice a month I see a hard crash, presumably graphics related. It only happens when a (3D accelerated) game is running; the screen displays a solid uniform color (of random hue) and the machine becomes completely unresponsive (no hotkeys, no remote shell access) until hard reset. So I assume it is GPU related.
Googling around I find old advice on disabling automatic power management. But that discussion mentioned kernel version 3.13, so I don’t think this is the same issue.
Some other discussion recommends feeding the graphics card’s two PCIE power plugs from two different PSU rails. I am not an electrical engineer, but I hesitate to try this … I am having nightmares of electrons circling around loops that were never meant to be connected.
Is this a known problem? Would it help to switch from RADV to AMDGPU? Does anybody have similar issues with Vega under Linux?
Manjaro with XFCE desktop. Kernel version 5.3.6-1.
Vulkaninfo says (among other things):
GPU id : 0 (AMD RADV VEGA10 (LLVM 9.0.0))
So apparently I am running the stock RADV driver.
The GUI “About” dialog box has this to say about the GPU:
Radeon RX Vega (VEGA10, DRM 3.33.0, 5.3.6-1-MANJARO, LLVM 9.0.0)
Games are mostly older Blizzard Games: WoW, Diablo3, Hearthstone; though Lutris. The former two use DXVK, Hearthstone in contrast seems to use D9VK. Up until a very recent patch, Hearthstone used to be mapped to OpenGL and had abysmal frame rates (but that didn’t matter for this game).
Explored a few games from Steam using the built-in Proton: Path of Exile, Warframe; basically checking out a few of the free to play ones, and these two held my attention for a while.
All of the mentioned games except Hearthstone do sporadically crash. All games are rendered with VSYNC on (I value a quieter graphics card more than the lowest possible latency).
Huh, I’m experiencing this exact thing with my vega 56 on Windows, been trying for to debug it for a while (nothing appears in event viewer) but the fact you experience it on Linux, makes me think this might actually be a firmware or hardware problem :S
Unfortunately the model I own is a reference card (branded by Gigabyte). Yet another reason not to let it run too hot.
If this is independent of OS, then it could indeed be an issue with power (management, or spikes). Any electrical engineers here who can shed light on this idea of powering the GPU from more than one single PSU rail?
The power rail thing is a bit more complicated. Essentially, some PSUs (the multi rail ones) are internally “compartmentalized”; that is, they have more than one separate transformers that each converts electricity from the power grid to some lower voltage for use in the computer.
The alternative is to have only a single transformer bringing down power grid voltage to the highest of the required voltages on the PSU’s output side. In a PC that would be 12 Volts, I think. And then lower voltages are derived from that same power rail in a second step, say 5 Volts and 3.3 Volts. But remember that my understanding of the matter is limited and you better ask somebody who actually knows this stuff.
So this isn’t about the number of cables running to the Vega 56 (which is usually two, because power consumption is well above 150 Watts). But it is about if the two cables are bringing energy from one single or two different transformers in the PSU.
At least that’s how I understand it. And I don’t dare to try splitting power delivery up between two different source transformers, unless an expert explains to me why that is in fact not problematic. I am afraid of melting my computer in a spectacular fashion.
My workstation is using unbuffered ECC DIMMs at their nominal 2400 speed. Gaming is really the only workload that has these seemingly spontaneous freezes with blank screen of random solid color.
Writing here I remembered that the first thing I had done with this GPU was switching it to its “quiet” BIOS (it is a dual BIOS graphics card, with a “full speed” and a “low noise” setting). Now I flipped it back to full speed mode just to test over the next few days.
Researching a bit more, I found a quirk of some (not all!) dual-BIOS GPUs. It is not always sufficient to flip the physical switch on the card. Some models require the drivers to be de-installed and re-installed after changing the GPU BIOS. That is under windows.
I now suspect that the Linux drivers aren’t really prepared (and, consequently, not tested) for the low noise GPU BIOS.
To the person having this problem under windows: it might help to completely uninstall AMD drivers (there seem to be a specific tool for that, but I don’t remember what it was called), and then re-install them from scratch. Good luck!
And I keep my fingers crossed for my Linux use case.
Wayland is an experiment I have not yet done. First I had to wait for the next crash after switching the GPU BIOS. The heavy handed jinxing was effective, so to say, and the machine has crashed in the meantime. So now I am free to try wayland next.
I have seen a lot of spardic issues with Blizz games over the past year due to their launcher forcing different versions of directx than you actually select. I have recently removed their software completely so cannot test anymore, but if I recall there was a setting to force the launcher to run in DX9 mode which seemed to fix most of the flakiness issues.
Kernel and userland are up to date here. Just switched to Wayland (was a lot easier than I had thought). I will see if I can find more detail on the Blizzard launcher. However, I was having the described crashes also with the steam games mentioned earlier.
BTW, thanks for all the comments and hints, everybody!