Vega 56 Sporadic Crashes (Manjaro/Wine/Proton Gaming)

Hi Level1 Community,

with the recent advancements in Linux gaming (DXVK/Proton, good open source drivers for AMD GPUs) I have started to migrate my gaming over to a Linux workstation. For the most part this works amazingly well using Manjaro on a Threadripper 1950x render box with a Radeon Vega 56. (Gaming is not the primary purpose of this machine.)

However, maybe once or twice a month I see a hard crash, presumably graphics related. It only happens when a (3D accelerated) game is running; the screen displays a solid uniform color (of random hue) and the machine becomes completely unresponsive (no hotkeys, no remote shell access) until hard reset. So I assume it is GPU related.

Googling around I find old advice on disabling automatic power management. But that discussion mentioned kernel version 3.13, so I don’t think this is the same issue.

Some other discussion recommends feeding the graphics card’s two PCIE power plugs from two different PSU rails. I am not an electrical engineer, but I hesitate to try this … I am having nightmares of electrons circling around loops that were never meant to be connected.

Is this a known problem? Would it help to switch from RADV to AMDGPU? Does anybody have similar issues with Vega under Linux?

I had a lot of issues with xorg and vega. Logged in with wayland and no crashes or artifacting for several months.

Searching the errors returned nothing helpful.

Give it a try maybe itll help identify the issue.

would need more details. it can happen but is rare on my system

Some more random details:

  • it makes no differences if there is a (lower priority) background job running on the CPU or not

  • no crashes on desktop or during video playback, no artifacting during gaming

  • the GPU is usually frame capped to 60Hz; the GPU fans never rev up to 100%

  • GPU temperature (as far as I am able to monitor it) is always well below 85 C

  • I am currently using audio over HDMI as a temporary makeshift solution, so the GPU is involved in audio, too

I’ll research the option of using Wayland over Xorg; thanks for this hint!

still need to know what games, what settings, dxvk or opengl, Desktop environment etc.

there is a lot of stuff that can cause crashes and it really depends on a lot of factors

also are you running amd-gpu-pro or stock ?

Manjaro with XFCE desktop. Kernel version 5.3.6-1.

Vulkaninfo says (among other things):
GPU id : 0 (AMD RADV VEGA10 (LLVM 9.0.0))
So apparently I am running the stock RADV driver.

The GUI “About” dialog box has this to say about the GPU:
Radeon RX Vega (VEGA10, DRM 3.33.0, 5.3.6-1-MANJARO, LLVM 9.0.0)

Games are mostly older Blizzard Games: WoW, Diablo3, Hearthstone; though Lutris. The former two use DXVK, Hearthstone in contrast seems to use D9VK. Up until a very recent patch, Hearthstone used to be mapped to OpenGL and had abysmal frame rates (but that didn’t matter for this game).

Explored a few games from Steam using the built-in Proton: Path of Exile, Warframe; basically checking out a few of the free to play ones, and these two held my attention for a while.

All of the mentioned games except Hearthstone do sporadically crash. All games are rendered with VSYNC on (I value a quieter graphics card more than the lowest possible latency).

Huh, I’m experiencing this exact thing with my vega 56 on Windows, been trying for to debug it for a while (nothing appears in event viewer) but the fact you experience it on Linux, makes me think this might actually be a firmware or hardware problem :S

Is this also the model you have? ROG-STRIX-RXVEGA56-O8G-GAMING

Unfortunately the model I own is a reference card (branded by Gigabyte). Yet another reason not to let it run too hot. :slight_smile:

If this is independent of OS, then it could indeed be an issue with power (management, or spikes). Any electrical engineers here who can shed light on this idea of powering the GPU from more than one single PSU rail?

I’m running two cables from the PSU to the GPU, if that is what you mean, by using more than a single PSU rail.

Wonder if it has to do with the PSU having a too low wattage rating, tho I don’t see how mine could be that since it is a 650 watt PSU which can deliver 650 watts over 12v

The power rail thing is a bit more complicated. Essentially, some PSUs (the multi rail ones) are internally “compartmentalized”; that is, they have more than one separate transformers that each converts electricity from the power grid to some lower voltage for use in the computer.

The alternative is to have only a single transformer bringing down power grid voltage to the highest of the required voltages on the PSU’s output side. In a PC that would be 12 Volts, I think. And then lower voltages are derived from that same power rail in a second step, say 5 Volts and 3.3 Volts. But remember that my understanding of the matter is limited and you better ask somebody who actually knows this stuff. :slight_smile:

So this isn’t about the number of cables running to the Vega 56 (which is usually two, because power consumption is well above 150 Watts). But it is about if the two cables are bringing energy from one single or two different transformers in the PSU.

At least that’s how I understand it. And I don’t dare to try splitting power delivery up between two different source transformers, unless an expert explains to me why that is in fact not problematic. I am afraid of melting my computer in a spectacular fashion. :slight_smile:

1 Like

I was getting several GPU crashes every day with my Vega 56 and 3900X. I turned the RAM speed back down to 2,400 and it all went away. Now I’m running ECC at 2666 and no crashes. So far.

Your crashes are extremely rare, comparing to that, but I wonder if you’re running RAM at higher than normal speeds.

In my case the RAM at 3,600 passed Memtest perfectly. Errors only showed up when running all-core compiler tests, and apparently in games.

PS: The PCIe and RAM controllers are pretty tightly linked.

My workstation is using unbuffered ECC DIMMs at their nominal 2400 speed. Gaming is really the only workload that has these seemingly spontaneous freezes with blank screen of random solid color.

Writing here I remembered that the first thing I had done with this GPU was switching it to its “quiet” BIOS (it is a dual BIOS graphics card, with a “full speed” and a “low noise” setting). Now I flipped it back to full speed mode just to test over the next few days.

I think I have to try and jinx it. Ahem.

“Hey, ever since I switched to the default (full speed) BIOS, there were no crashes anymore.”

(The time hasn’t really been long enough, but if I can provoke a crash with this voodoo, then I can try other things without tangling the variables.)

Researching a bit more, I found a quirk of some (not all!) dual-BIOS GPUs. It is not always sufficient to flip the physical switch on the card. Some models require the drivers to be de-installed and re-installed after changing the GPU BIOS. That is under windows.

I now suspect that the Linux drivers aren’t really prepared (and, consequently, not tested) for the low noise GPU BIOS.

To the person having this problem under windows: it might help to completely uninstall AMD drivers (there seem to be a specific tool for that, but I don’t remember what it was called), and then re-install them from scratch. Good luck!

And I keep my fingers crossed for my Linux use case.

3 Likes

DDU - don’t know a trusty source though

EDIT: doing this consider having a boot USB-drive with windoes on it, broken drivers might be a pain in the ass

OP have you tried wayland as one of the first posts mentioned?

Wayland is an experiment I have not yet done. First I had to wait for the next crash after switching the GPU BIOS. The heavy handed jinxing was effective, so to say, and the machine has crashed in the meantime. So now I am free to try wayland next.

1 Like

I was having some sporadic issues the past few weeks and the most recent updates seem to have fixed most issues. the 5.3.8 kernel is pretty fantastic with the latest mesa and firmware updates.

Have you done a pacman - Syu lately?

I just saw this . . .

I have seen a lot of spardic issues with Blizz games over the past year due to their launcher forcing different versions of directx than you actually select. I have recently removed their software completely so cannot test anymore, but if I recall there was a setting to force the launcher to run in DX9 mode which seemed to fix most of the flakiness issues.

Kernel and userland are up to date here. Just switched to Wayland (was a lot easier than I had thought). I will see if I can find more detail on the Blizzard launcher. However, I was having the described crashes also with the steam games mentioned earlier.

BTW, thanks for all the comments and hints, everybody!

I do know that on some games, (especially No Man’s Sky) turning off motion blur seemed to help chugging quite a bit on my vega 56. Perhaps that’s something you could mess with.