Howdy,
I’ve been running into some crashes with my Vega 64 this morning. I believe this issue has been occurring since my ownership of this card began about two years ago. This seems to be an intermittent issue, and I have encountered this issue in Windows 10 and Linux. I don’t have any logs from the last time this occurred in Windows 10 though.
This issue occurs on the desktop, in a web browser, or in a game. The video output freezes but the audio continues. The system remains powered. I don’t have a second machine to ssh into my box with, but I presume that the system is not fully hung. Trying to switch to terminal or restarting DE does not recover. The only way to recover is to hard reset the system.
Running OpenSUSE Tumbleweed with kernel 5.10.16-1-default
kernel-firmware-amdgpu version: 20210208-2.1-noarch
libdrm_amdgpu1 version: 2.4.104-12-x86_64
libdrm_amdgpu1-32bit version: 2.4.104-12-x86_64
xf86-video-amdgpu version: 19.1.0-3.4x86_64
Running sudo journalctl --system | grep amdgpu > amdgpu.txt
gives me 3.2MiB of crap with the end as follows. The other crashes from today were a combination of ‘ring page0 timeout’ and ‘ring gfx timeout’.
Mar 03 11:17:17 linux-r70n kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring page0 timeout, signaled seq=11577, emitted seq=11582
Mar 03 11:17:17 linux-r70n kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process pid 0 thread pid 0
Mar 03 11:17:17 linux-r70n kernel: amdgpu 0000:0b:00.0: amdgpu: GPU reset begin!
Mar 03 11:17:17 linux-r70n kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, signaled seq=46798, emitted seq=46800
Mar 03 11:17:17 linux-r70n kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process chrome pid 2752 thread chrome:cs0 pid 2777
Mar 03 11:17:17 linux-r70n kernel: amdgpu 0000:0b:00.0: amdgpu: GPU reset begin!
Mar 03 11:17:17 linux-r70n kernel: amdgpu 0000:0b:00.0: amdgpu: Bailing on TDR for s_job:a03b, as another already in progress
Mar 03 12:23:15 linux-r70n kernel: snd_hda_intel 0000:0b:00.1: bound 0000:0b:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
I’ve found a thread on bugzilla.kernel.org in which users are reporting the same issue, but to no avail.
Is there any guidance anyone could bestow upon me in resolving this issue or reporting it to the developers? Apologizes in advance if I’ve missed something painfully obvious.