RX 580 amdgpu crashes

rbsil28 · December 15, 2019, 12:34pm

After doing a clean install of Fedora 31 XFCE spin a few days ago, I’m seeing amdgpu driver crashes occurring once or twice a day always in mild desktop usage scenarios such as browsing. I’m using the Firefox browser and have tried disabling hardware acceleration with no effect whatsoever. On the contrary, I had a feeling, that crashes are more frequent with hardware acceleration disabled, but it’s pretty hard are to tell, since they happen so randomly.

I haven’t seen a single crash while playing games. I have been playing mostly Pillars of Eternity lately, which is not the most GPU demanding game, but I have seen it mentioned in several older threads about AMD GPU bugs as the game that was able to trigger some of those bugs.

After the crash, the desktop environment becomes either totally unresponsive or extremely slow and screen artifacts appears. After switching keyboard to default mode (SysRq + R), I’m able to switch to another virtual console or reboot my machine through a combination of magic SysRq keys.
It might be worth mentioning, that occasionally the same screen artifacts that appeared in graphical user interface are visible in command line interfaces of virtual consoles.

I’m using Xorg graphical server.

Everything in my system is running on stock frequencies including CPU, RAM
and GPUs.

Before Fedora 31, I was using Fedora 29 XFCE spin for almost a year and I hadn’t experience a single graphics related crash. I’m not exaggerating, the GPU performance was rock stable. The crashes started happening as soon as I installed Fedora 31.

The notable difference might be that Fedora 29 used Mesa packages with a major version 18 while Fedora 31 is using Mesa packages with a major version 19. Of course, there are changes in versions of other packages in the graphical stack such as libdrm 2.4.97 vs libdrm 2.4.100 but Mesa version change seems to be most significant one, at least to an uneducated eye.

I’ve filed a bug in the Red Hat Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1780949 but I’m not sure what to do next. I don’t understand the logs and although I don’t consider myself a beginner Linux user, I have little knowledge of and experience with low level workings of the graphical stack and/or GPU drivers.

I have following packages installed

[username@hostname ~]$ uname -r
5.3.15-300.fc31.x86_64

[username@hostname ~]$ dnf list '*mesa*' --installed
Installed Packages
mesa-dri-drivers.x86_64                                                                          19.2.7-1.fc31                                                                        @updates
mesa-filesystem.x86_64                                                                           19.2.7-1.fc31                                                                        @updates
mesa-libEGL.x86_64                                                                               19.2.7-1.fc31                                                                        @updates
mesa-libGL.x86_64                                                                                19.2.7-1.fc31                                                                        @updates
mesa-libGLU.x86_64                                                                               9.0.1-1.fc31                                                                         @fedora 
mesa-libOSMesa.x86_64                                                                            19.2.7-1.fc31                                                                        @updates
mesa-libOpenCL.x86_64                                                                            19.2.7-1.fc31                                                                        @updates
mesa-libgbm.x86_64                                                                               19.2.7-1.fc31                                                                        @updates
mesa-libglapi.x86_64                                                                             19.2.7-1.fc31                                                                        @updates
mesa-libxatracker.x86_64                                                                         19.2.7-1.fc31                                                                        @updates
mesa-vulkan-drivers.x86_64                                                                       19.2.7-1.fc31                                                                        @updates

[username@hostname ~]$ dnf list '*drm*' --installed
Installed Packages
libdrm.x86_64                                                                             2.4.100-1.fc31                                                                              @updates

I’m using RX580 as my primary graphics card

2d:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7) (prog-if 00 [VGA controller])
	Subsystem: ASUSTeK Computer Inc. Device 0519
	Flags: bus master, fast devsel, latency 0, IRQ 64
	Memory at e0000000 (64-bit, prefetchable) [size=256M]
	Memory at f0000000 (64-bit, prefetchable) [size=2M]
	I/O ports at f000 [size=256]
	Memory at f7a00000 (32-bit, non-prefetchable) [size=256K]
	Expansion ROM at 000c0000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [58] Express Legacy Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150] Advanced Error Reporting
	Capabilities: [200] Resizable BAR <?>
	Capabilities: [270] Secondary PCI Express <?>
	Capabilities: [2b0] Address Translation Service (ATS)
	Capabilities: [2c0] Page Request Interface (PRI)
	Capabilities: [2d0] Process Address Space ID (PASID)
	Capabilities: [320] Latency Tolerance Reporting
	Capabilities: [328] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [370] L1 PM Substates
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu

2d:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590]
	Subsystem: ASUSTeK Computer Inc. Device aaf0
	Flags: bus master, fast devsel, latency 0, IRQ 67
	Memory at f7a60000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [58] Express Legacy Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150] Advanced Error Reporting
	Capabilities: [328] Alternative Routing-ID Interpretation (ARI)
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel

I also have NVIDIA 1070ti to be used for VMs for GPU passthrough but I haven’t configured any virtual machine since reinstall so right now it’s just sitting there idly

2e:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: ASUSTeK Computer Inc. Device 861e
	Flags: bus master, fast devsel, latency 0, IRQ 62
	Memory at f6000000 (32-bit, non-prefetchable) [size=16M]
	Memory at c0000000 (64-bit, prefetchable) [size=256M]
	Memory at d0000000 (64-bit, prefetchable) [size=32M]
	I/O ports at e000 [size=128]
	Expansion ROM at f7000000 [disabled] [size=512K]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Legacy Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [250] Latency Tolerance Reporting
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] Secondary PCI Express <?>
	Kernel driver in use: nouveau
	Kernel modules: nouveau

2e:00.1 Audio device: NVIDIA Corporation GP104 High Definition Audio Controller (rev a1)
	Subsystem: ASUSTeK Computer Inc. Device 861e
	Flags: bus master, fast devsel, latency 0, IRQ 68
	Memory at f7080000 (32-bit, non-prefetchable) [size=16K]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel

Each crash has a slightly different messages logged in journalctl, examples are
13.12.2019

Dec 13 18:33:35 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Dec 13 18:33:29 hostname kernel: sysrq: Keyboard mode set to system default
Dec 13 18:33:25 hostname kernel: amdgpu 0000:2d:00.0: GPU reset end with ret = -110
Dec 13 18:33:25 hostname kernel: amdgpu 0000:2d:00.0: GPU reset(2) failed
Dec 13 18:33:25 hostname kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <uvd_v6_0> failed -110
Dec 13 18:33:25 hostname kernel: amdgpu 0000:2d:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring uvd test failed (-110)
Dec 13 18:33:24 hostname kernel: [drm:amdgpu_device_ip_set_powergating_state [amdgpu]] *ERROR* set_powergating_state of IP block <uvd_v6_0> failed -1
Dec 13 18:33:24 hostname kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, giving up!!!
Dec 13 18:33:24 hostname kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Dec 13 18:33:23 hostname kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Dec 13 18:33:22 hostname kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Dec 13 18:33:21 hostname kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Dec 13 18:33:20 hostname kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Dec 13 18:33:19 hostname kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Dec 13 18:33:18 hostname kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Dec 13 18:33:17 hostname kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Dec 13 18:33:14 hostname kernel: [drm] DM_MST: starting TM on aconnector: 000000003df1e1fc [id: 63]
Dec 13 18:33:14 hostname kernel: [drm] VRAM is lost due to GPU reset!
Dec 13 18:33:14 hostname kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F4007E9000).
Dec 13 18:33:14 hostname kernel: amdgpu 0000:2d:00.0: GPU reset succeeded, trying to resume
Dec 13 18:33:13 hostname kernel: amdgpu 0000:2d:00.0: GPU pci config reset
Dec 13 18:33:13 hostname kernel: rlc is busy, skip halt rlc
Dec 13 18:33:13 hostname kernel: cp is busy, skip halt cp
Dec 13 18:33:13 hostname kernel: amdgpu 0000:2d:00.0: GPU reset begin!
Dec 13 18:33:13 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1485 thread Xorg:cs0 pid 1543
Dec 13 18:33:13 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=248250, emitted seq=248252
Dec 13 18:33:13 hostname kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Dec 13 18:33:02 hostname kernel: amdgpu 0000:2d:00.0: VM fault (0x01, vmid 2, pasid 32770) at page 50336203, read from 'TC4' (0x54433400) (72)
Dec 13 18:33:02 hostname kernel: amdgpu 0000:2d:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x04048001
Dec 13 18:33:02 hostname kernel: amdgpu 0000:2d:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x030011CB
Dec 13 18:33:02 hostname kernel: amdgpu 0000:2d:00.0: GPU fault detected: 147 0x0e584801 for process Xorg pid 1485 thread Xorg:cs0 pid 1543
Dec 13 18:33:02 hostname kernel: gmc_v8_0_process_interrupt: 59 callbacks suppressed

11.12.2019

Dec 11 21:37:38 hostname kernel: amdgpu 0000:2d:00.0: GPU reset begin!
Dec 11 21:37:38 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1525 thread Xorg:cs0 pid 1565
Dec 11 21:37:38 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=9620, emitted seq=9620
Dec 11 21:37:32 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:32 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:32 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:32 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:32 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:32 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:28 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:28 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:28 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:28 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:27 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:27 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:27 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:27 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:27 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:27 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:27 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:27 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:27 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:27 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:27 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:27 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:27 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:27 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:27 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:27 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:27 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:27 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:27 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:27 hostname kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Dec 11 21:37:27 hostname kernel: ---[ end trace 439ca7333ff82ac2 ]---
Dec 11 21:37:27 hostname kernel:  ret_from_fork+0x22/0x40
Dec 11 21:37:27 hostname kernel:  ? kthread_park+0x80/0x80
Dec 11 21:37:27 hostname kernel:  ? drm_sched_cleanup_jobs+0x130/0x130 [gpu_sched]
Dec 11 21:37:27 hostname kernel:  kthread+0xfb/0x130
Dec 11 21:37:27 hostname kernel:  ? finish_wait+0x80/0x80
Dec 11 21:37:27 hostname kernel: Call Trace:
Dec 11 21:37:27 hostname kernel: CR2: 00007f401ff1b37c CR3: 00000007f4022000 CR4: 00000000003406e0
Dec 11 21:37:27 hostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 11 21:37:27 hostname kernel: FS:  0000000000000000(0000) GS:ffff9edd9ed00000(0000) knlGS:0000000000000000
Dec 11 21:37:27 hostname kernel: R13: 0000000000000000 R14: ffff9edd90794d00 R15: ffff9edd8a166bb8
Dec 11 21:37:27 hostname kernel: R10: 0000000000022e40 R11: 0000000000000003 R12: ffff9edd8a166a30
Dec 11 21:37:27 hostname kernel: RBP: ffff9edcd92ef458 R08: 0000000000000001 R09: 0000000000000624
Dec 11 21:37:27 hostname kernel: RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff9edd9ed17900
Dec 11 21:37:27 hostname kernel: RAX: 0000000000000024 RBX: ffff9edcd92ef400 RCX: 0000000000000006
Dec 11 21:37:27 hostname kernel: RSP: 0018:ffffade500993e90 EFLAGS: 00010246
Dec 11 21:37:27 hostname kernel: Code: 5b 5d 41 5c 41 5d c3 48 89 ef e8 f1 45 00 ce 8b 43 20 85 c0 75 87 45 31 ed eb c2 45 31 ed 48 c7 c7 90 e5 68 c0 e8 64 9e ab cd <0f> 0b e9 69 ee f>
Dec 11 21:37:27 hostname kernel: RIP: 0010:drm_sched_main.cold+0xf/0x2c [gpu_sched]
Dec 11 21:37:27 hostname kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X370 Taichi, BIOS P4.40 02/07/2018
Dec 11 21:37:27 hostname kernel: CPU: 12 PID: 676 Comm: gfx Tainted: G        W         5.3.15-300.fc31.x86_64 #1
Dec 11 21:37:27 hostname kernel:  drm_kms_helper crc32_pclmul crc32c_intel mxm_wmi drm ghash_clmulni_intel igb ccp uas dca i2c_algo_bit usb_storage wmi pinctrl_amd fuse
Dec 11 21:37:27 hostname kernel: Modules linked in: rfcomm ccm xt_CHECKSUM xt_MASQUERADE nf_nat_tftp nf_conntrack_tftp xt_CT tun bridge stp llc ip6t_REJECT nf_reject_ipv6 ip6t_rpfilte>
Dec 11 21:37:27 hostname kernel: WARNING: CPU: 12 PID: 676 at include/linux/dma-fence.h:513 drm_sched_main.cold+0xf/0x2c [gpu_sched]
Dec 11 21:37:27 hostname kernel: ------------[ cut here ]------------
Dec 11 21:37:27 hostname kernel: [drm] Skip scheduling IBs!
Dec 11 21:37:27 hostname kernel: ---[ end trace 439ca7333ff82ac1 ]---
Dec 11 21:37:27 hostname kernel:  ret_from_fork+0x22/0x40
Dec 11 21:37:27 hostname kernel:  ? kthread_park+0x80/0x80
Dec 11 21:37:27 hostname kernel:  ? drm_sched_cleanup_jobs+0x130/0x130 [gpu_sched]
Dec 11 21:37:27 hostname kernel:  kthread+0xfb/0x130
Dec 11 21:37:27 hostname kernel:  ? finish_wait+0x80/0x80
Dec 11 21:37:27 hostname kernel: Call Trace:
Dec 11 21:37:27 hostname kernel: CR2: 00007f401ff1b37c CR3: 00000007f4022000 CR4: 00000000003406e0
Dec 11 21:37:27 hostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 11 21:37:27 hostname kernel: FS:  0000000000000000(0000) GS:ffff9edd9ed00000(0000) knlGS:0000000000000000
Dec 11 21:37:27 hostname kernel: R13: 0000000000000000 R14: ffff9edd90795b00 R15: ffff9edd8a166bb8
Dec 11 21:37:27 hostname kernel: R10: 000000000002223c R11: 0000000000000003 R12: ffff9edd8a166a30
Dec 11 21:37:27 hostname kernel: RBP: ffff9edcd92eb858 R08: 0000000000000001 R09: 0000000000000609
Dec 11 21:37:27 hostname kernel: RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff9edd9ed17900
Dec 11 21:37:27 hostname kernel: RAX: 0000000000000024 RBX: ffff9edcd92eb800 RCX: 0000000000000006
Dec 11 21:37:27 hostname kernel: amdgpu 0000:2d:00.0: GPU reset end with ret = -110
Dec 11 21:37:27 hostname kernel: RSP: 0018:ffffade500993e90 EFLAGS: 00010246
Dec 11 21:37:27 hostname kernel: Code: 5b 5d 41 5c 41 5d c3 48 89 ef e8 f1 45 00 ce 8b 43 20 85 c0 75 87 45 31 ed eb c2 45 31 ed 48 c7 c7 90 e5 68 c0 e8 64 9e ab cd <0f> 0b e9 69 ee f>
Dec 11 21:37:27 hostname kernel: RIP: 0010:drm_sched_main.cold+0xf/0x2c [gpu_sched]
Dec 11 21:37:27 hostname kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X370 Taichi, BIOS P4.40 02/07/2018
Dec 11 21:37:27 hostname kernel: CPU: 12 PID: 676 Comm: gfx Not tainted 5.3.15-300.fc31.x86_64 #1
Dec 11 21:37:27 hostname kernel:  drm_kms_helper crc32_pclmul crc32c_intel mxm_wmi drm ghash_clmulni_intel igb ccp uas dca i2c_algo_bit usb_storage wmi pinctrl_amd fuse
Dec 11 21:37:27 hostname kernel: Modules linked in: rfcomm ccm xt_CHECKSUM xt_MASQUERADE nf_nat_tftp nf_conntrack_tftp xt_CT tun bridge stp llc ip6t_REJECT nf_reject_ipv6 ip6t_rpfilte>
Dec 11 21:37:27 hostname kernel: amdgpu 0000:2d:00.0: GPU reset(23) failed
Dec 11 21:37:27 hostname kernel: WARNING: CPU: 12 PID: 676 at include/linux/dma-fence.h:513 drm_sched_main.cold+0xf/0x2c [gpu_sched]
Dec 11 21:37:27 hostname kernel: ------------[ cut here ]------------
Dec 11 21:37:27 hostname kernel: [drm] Skip scheduling IBs!
Dec 11 21:37:27 hostname kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <uvd_v6_0> failed -110
Dec 11 21:37:27 hostname kernel: amdgpu 0000:2d:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring uvd test failed (-110)
Dec 11 21:37:27 hostname kernel: [drm:amdgpu_device_ip_set_powergating_state [amdgpu]] *ERROR* set_powergating_state of IP block <uvd_v6_0> failed -1
Dec 11 21:37:27 hostname kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, giving up!!!
Dec 11 21:37:27 hostname kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Dec 11 21:37:26 hostname kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Dec 11 21:37:25 hostname kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Dec 11 21:37:24 hostname kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Dec 11 21:37:23 hostname kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
...
Dec 11 21:37:22 hostname kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Dec 11 21:37:21 hostname kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Dec 11 21:37:20 hostname kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Dec 11 21:37:19 hostname kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Dec 11 21:37:18 hostname kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
...
Dec 11 21:37:17 hostname kernel: [drm] DM_MST: starting TM on aconnector: 000000003a33a46b [id: 63]
Dec 11 21:37:16 hostname kernel: [drm] VRAM is lost due to GPU reset!
Dec 11 21:37:16 hostname kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F4007E9000).
Dec 11 21:37:16 hostname kernel: amdgpu 0000:2d:00.0: GPU reset succeeded, trying to resume
Dec 11 21:37:16 hostname kernel: amdgpu 0000:2d:00.0: GPU pci config reset
Dec 11 21:37:16 hostname kernel: amdgpu 0000:2d:00.0: GPU reset begin!
Dec 11 21:37:16 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Dec 11 21:37:16 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1525 thread Xorg:cs0 pid 1565
Dec 11 21:37:16 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=9617, emitted seq=9620
Dec 11 21:37:06 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Dec 11 21:36:55 hostname kernel: amdgpu 0000:2d:00.0: VM fault (0x04, vmid 2, pasid 32770) at page 1049514, read from 'TC0' (0x54433000) (8)
Dec 11 21:36:55 hostname kernel: amdgpu 0000:2d:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x04008004
Dec 11 21:36:55 hostname kernel: amdgpu 0000:2d:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001003AA
Dec 11 21:36:55 hostname kernel: amdgpu 0000:2d:00.0: GPU fault detected: 146 0x0d500804 for process Xorg pid 1525 thread Xorg:cs0 pid 1565
Dec 11 21:36:55 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Dec 11 21:36:45 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Dec 11 21:36:35 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Dec 11 21:36:25 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Dec 11 21:36:14 hostname kernel: amdgpu 0000:2d:00.0: VM fault (0x01, vmid 3, pasid 32770) at page 108008385, read from 'TC7' (0x54433700) (132)
Dec 11 21:36:14 hostname kernel: amdgpu 0000:2d:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06084001
Dec 11 21:36:14 hostname kernel: amdgpu 0000:2d:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x067013C1
Dec 11 21:36:14 hostname kernel: amdgpu 0000:2d:00.0: GPU fault detected: 147 0x0e088401 for process Xorg pid 1525 thread Xorg:cs0 pid 1565
Dec 11 21:36:14 hostname kernel: amdgpu 0000:2d:00.0: VM fault (0x01, vmid 3, pasid 32770) at page 108008369, read from 'TC7' (0x54433700) (132)
Dec 11 21:36:14 hostname kernel: amdgpu 0000:2d:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06084001
Dec 11 21:36:14 hostname kernel: amdgpu 0000:2d:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x067013B1
Dec 11 21:36:14 hostname kernel: amdgpu 0000:2d:00.0: GPU fault detected: 147 0x0d888401 for process Xorg pid 1525 thread Xorg:cs0 pid 1565
Dec 11 21:36:14 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Dec 11 21:36:04 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Dec 11 21:35:54 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Dec 11 21:35:44 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Dec 11 21:35:33 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Dec 11 21:35:23 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Dec 11 21:35:13 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Dec 11 21:35:03 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Dec 11 21:34:52 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
...
Dec 11 21:34:32 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Dec 11 21:34:22 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Dec 11 21:34:11 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Dec 11 21:34:11 hostname kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Dec 11 21:34:01 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Dec 11 21:33:51 hostname kernel: amdgpu 0000:2d:00.0: VM fault (0x01, vmid 3, pasid 32769) at page 108003841, read from 'TC7' (0x54433700) (132)
Dec 11 21:33:51 hostname kernel: amdgpu 0000:2d:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06084001
Dec 11 21:33:51 hostname kernel: amdgpu 0000:2d:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x06700201
Dec 11 21:33:51 hostname kernel: amdgpu 0000:2d:00.0: GPU fault detected: 147 0x00088401 for process xfwm4 pid 2039 thread xfwm4:cs0 pid 2056
Dec 11 21:33:51 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Dec 11 21:33:41 hostname kernel: amdgpu 0000:2d:00.0: VM fault (0x01, vmid 3, pasid 32769) at page 248513025, read from 'TC6' (0x54433600) (136)
Dec 11 21:33:41 hostname kernel: amdgpu 0000:2d:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06088001
Dec 11 21:33:41 hostname kernel: amdgpu 0000:2d:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0ED00201
Dec 11 21:33:41 hostname kernel: amdgpu 0000:2d:00.0: GPU fault detected: 147 0x00088801 for process xfwm4 pid 2039 thread xfwm4:cs0 pid 2056
Dec 11 21:33:41 hostname kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Dec 11 21:33:41 hostname kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Dec 11 21:33:31 hostname kernel: amdgpu 0000:2d:00.0: VM fault (0x02, vmid 6, pasid 32770) at page 5858917, read from 'TC1' (0x54433100) (4)
Dec 11 21:33:31 hostname kernel: amdgpu 0000:2d:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C004002
Dec 11 21:33:31 hostname kernel: amdgpu 0000:2d:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00596665
Dec 11 21:33:31 hostname kernel: amdgpu 0000:2d:00.0: GPU fault detected: 147 0x03280402 for process Xorg pid 1525 thread Xorg:cs0 pid 1565
Dec 11 21:33:31 hostname kernel: gmc_v8_0_process_interrupt: 59 callbacks suppressed
...

rbsil28 · December 15, 2019, 3:02pm

Follow-up question: how do you actually go about debugging GPU drivers/graphical stack errors to be able to provide developers everything they need to able to simulate and fix a bug? From what I found online, all those journalctl logs are the consequence of GPU hang but they don’t contain any information as to the cause of GPU hang. So they’re not very useful. I was thinking about using something like this: https://apitrace.github.io/ but I can’t reproduce the bug reliably and sometimes there are several hours in between crashes.

StrY · December 15, 2019, 4:37pm

I have a RX580 and mine is running rock stable without issues and I’m running 19.3.0.

This sounds like a too aggressive powersaving issue with you card - Too low voltage:frequency at lower power states.

You can try editing your power table for you card and see if stability improves.

https://wiki.archlinux.org/index.php/AMDGPU#Overclocking

This should provide some help with that. I have several profiles I made for different custom performance states - so that my card runs cooler/more efficient and I only have enough power for the game I’m running.

Screenshot_20191215_183526

This is my most efficient profile - you can try something like this for desktop use and see if it delivers better reliability.

rbsil28 · December 16, 2019, 6:49pm

Thanks for pointing out the possible cause and thanks also for power tables. I’ll test these and report back.

system · September 15, 2020, 12:49pm

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.