Ryzen/Vega laptop PCIe Bus Error

catsay · February 26, 2018, 11:23am

Oh I very much fear it is or is related in some way to this bug:

https://www.spinics.net/lists/dri-devel/msg161183.html

As of February 4th no activity

https://www.spinics.net/lists/dri-devel/msg164353.html

noenken · February 26, 2018, 11:26am

Doesn’t look like it.

[Edit] And just as I post that it stops once again.

The bug you posted does fit I guess. seq=offbytwo looks similar.
And I don’t have to run a video in the back to have OpenGL going, it is already rendering my KDE.

Oh, that’s new:

[  +0,000017] INFO: task amdgpu_cs:0:683 blocked for more than 120 seconds.
[  +0,000003]       Not tainted 4.16.0-1-MANJARO #1
[  +0,000002] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0,000003] amdgpu_cs:0     D    0   683    622 0x00000000
[  +0,000003] Call Trace:
[  +0,000005]  ? __schedule+0x24b/0x8a0
[  +0,000004]  schedule+0x32/0x90
[  +0,000002]  schedule_timeout+0x202/0x470
[  +0,000060]  ? amdgpu_cs_bo_validate+0x8f/0x140 [amdgpu]
[  +0,000005]  dma_fence_default_wait+0x1ea/0x280
[  +0,000004]  ? dma_fence_default_wait+0x280/0x280
[  +0,000004]  dma_fence_wait_timeout+0x38/0x110
[  +0,000063]  amdgpu_ctx_wait_prev_fence+0x46/0x80 [amdgpu]
[  +0,000060]  amdgpu_cs_ioctl+0x223/0x1b70 [amdgpu]
[  +0,000023]  ? dequeue_entity+0x38d/0x970
[  +0,000064]  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
[  +0,000022]  drm_ioctl_kernel+0x5b/0xb0 [drm]
[  +0,000019]  drm_ioctl+0x2d5/0x370 [drm]
[  +0,000059]  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
[  +0,000058]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[  +0,000007]  do_vfs_ioctl+0xa4/0x630
[  +0,000005]  ? SyS_futex+0x12d/0x180
[  +0,000003]  SyS_ioctl+0x74/0x80
[  +0,000005]  do_syscall_64+0x74/0x190
[  +0,000005]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  +0,000006] RIP: 0033:0x7fe78b6add87
[  +0,000002] RSP: 002b:00007fe781772af8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  +0,000003] RAX: ffffffffffffffda RBX: 00007fe781772bd8 RCX: 00007fe78b6add87
[  +0,000002] RDX: 00007fe781772b60 RSI: 00000000c0186444 RDI: 0000000000000018
[  +0,000002] RBP: 00007fe781772b60 R08: 00007fe781772c00 R09: 00007fe781772b40
[  +0,000002] R10: 00007fe781772c00 R11: 0000000000000246 R12: 00000000c0186444
[  +0,000001] R13: 0000000000000018 R14: 000000000000000a R15: 0000000000000000

catsay · February 26, 2018, 11:34am

I found another report
https://bugs.freedesktop.org/show_bug.cgi?id=104817

And an older possibly less related one
https://www.spinics.net/lists/kernel/msg2679428.html

But this point I assume there are plenty more people out there.

Might need to scour the lkml and bug tracker for something related to this and then open a bug if one doesn’t exist.

Buy yeah either they are aware of this bug and its getting fixed in an upcoming release or the kernel & amdgpu developers are unaware of it. Which is likely since these laptops aren’t very widespread yet.

That said, didn’t phoronix recently try to test some 2500U laptops under linux and failed because something was horribly wrong?

catsay · February 26, 2018, 11:39am

OK yes not the mobile parts but the amdgpu stack for Ryzen APU’s in general seems WIP broken still.

https://www.phoronix.com/scan.php?page=news_item&px=AMD-Raven-Ridge-Mobo-Linux

Fudge. You might as well use Windows until that is fixed or become an alpha kernel tester

noenken · February 26, 2018, 11:46am

Hard no to that.

I could disable amdgpu but … a laptop I can’t even play a video on is a bit pointless.
I have other devices, I guess I’ll just have to wait for some time. ¯\_(ツ)_/¯

Is there a way to kill that driver when it crashes? Just so I don’t have to force it off.
This time it doesn’t seem to lose the connection. I can still do stuff over SSH.

And one more question:
The pro driver would do nothing for me on this one, right?
[Edit] I think I just read the answer to that in that phoronix article.

Trying out AMDGPU-PRO 17.50 on Ubuntu 16.04 LTS also hadn’t worked out nor did an Antergos 18.2 live image.

noenken · March 11, 2018, 12:19am

So, the USB thing…

This happens when I boot with my mouse connected and then disconnect it.

[Mär10 17:45] usb 1-2: USB disconnect, device number 2
[  +8,067789] xhci_hcd 0000:03:00.3: WARN: xHC save state timeout
[  +0,000023] suspend_common(): xhci_pci_suspend+0x0/0xc0 [xhci_pci] returns -110
[  +0,000016] xhci_hcd 0000:03:00.3: can't suspend (hcd_pci_runtime_suspend [usbcore] returned -110)

I don’t think it has any connection to the AMDGPU problems.
So maybe there is a fix for this at least?

Back to GPU … This looks a bit different, the outcome is the same. Freeze.
While running valley at minimum settings. Manjaro testing, latest 4.16rc.

[Mär18 05:42] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:155 vmid:3 pas_id:0)
[  +0,000011] amdgpu 0000:03:00.0:   at page 0x00000000c0500000 from 27
[  +0,000005] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00301537
[  +0,000009] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:155 vmid:3 pas_id:0)
[  +0,000005] amdgpu 0000:03:00.0:   at page 0x00000000c0500000 from 27
[  +0,000004] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ +10,207853] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=89474, last emitted seq=89476
[  +0,000010] [drm] IP block:psp is hung!
[  +0,000002] [drm] GPU recovery disabled.

brauliobo · April 13, 2018, 10:51pm

Same as in

noenken · April 27, 2018, 7:36pm

Ubuntu shows the same old behavior, freeze, bunch of errors and if I’m lucky I can shut it off remotely.

[  +0,000005] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:154 vm_id:6 pas_id:0)
[  +0,000008] amdgpu 0000:03:00.0:   at page 0x0000000115b46000 from 27
[  +0,000005] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00601135

And Manjaro is still doing it on 4.17rc.

This is getting frustrating to be honest.
Can I (as a pure user) do anything about this?

framspreat · May 12, 2018, 6:22am

Is there any update on this thread?

I have the same issue on my ideapad 720s Ryzen platform.
After nothing worked I had be compelled switch to windows system then

catsay · May 12, 2018, 11:08am

Gee-whiz I can’t believe this is still an issue.

At this point I’m wondering what can we do to get testing hardware/funding to someone( a kernel/pci/amdgpu dev) that can figure this out.

I’m still convinced that this is yet another PCI-e problem.
Same as the ASPM issue that STILL affects a number of Ryzen desktop machines.

In my dual GPU system running the GPU’s under load without ASPM=off triggers an avalanche of PCI-e AER’s.

I would actually drop money on someone to do more dedicated Ryzen related kernel fixes and optimizations.

noenken · May 12, 2018, 11:42am

I’m fine with that, I also would love to, like, run shit if that helps.
I just need to know where to go / who to talk to.

Also isn’t AMD interested in this being fixed?
I mean, their pro driver is based on the same thing, right?

So far I have a kubuntu 18.04 installed with blacklisted amdgpu.
But to me it is basically not useful that way and I also don’t like ubuntu.
So I’m not using the machine, which is a shame.

catsay · May 12, 2018, 12:11pm

Have you tried this ?
https://aur.archlinux.org/packages/linux-amd-staging-drm-next-git

noenken · May 12, 2018, 12:24pm

Nope. I only used the manjaro 4.17rc. Gonna try it when I get back home, probably tomorrow.

Grim_Reaper · May 12, 2018, 2:29pm

You can go back to Windows LOL.

FaunCB · May 12, 2018, 2:45pm

I’ll have my spec’d out X360 in my paws soon. If this is still a problem by then then I’ll attack on it too. Personally I was gunna wait till kernel 5 dropped.

noenken · May 18, 2018, 12:51am

Goddammit, I almost thought they fixed it.
It ran valley for over an hour but then…

[Mai18 02:31] amdgpu 0000:03:00.0: [mmhub] VMC page fault (src_id:0 ring:153 vmid:0 pasid:0)
[  +0,000009] amdgpu 0000:03:00.0:   at page 0x0000000000000000 from 18
[  +0,000005] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000133

…and repeating for all eternity. And thrown in there just for fun … this:

[  +0,140038] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, last signaled seq=819500, last emitted seq=819502
[  +0,000008] [drm] No hardware hang detected. Did some blocks stall?
[  +4,862865] gmc_v9_0_process_interrupt: 107644 callbacks suppressed

Different messages, same behavior. Display freeze, no input option, can do stuff over ssh but that’s it.

I am testing out the Manjaro 18 XFCE Beta but it didn’t help with this stuff.
On my big rig it is actually pretty good, running AMDGPU on Vega64 with DC.
Gonna have to test the 2200G before I make that one into a router.

Laptops, man… they are my kryptonite.

thro · May 18, 2018, 2:13am

I think you’d be very surprised at how small the sample size for people running Ryzen 2xxxG with Linux on a release kernel is. I reckon you’d be one of a handful of people to be honest…

Pretty sure they’ve been kryptonite for Linux and other open source platforms forever

Shining_Ace · May 29, 2018, 11:18pm

So I’m running Arch (Antergos) on my HP Envy x360 with Ryzen 5 2500U and I’ve had the same issue that everyone seems to be talking about. Everything works perfectly, CPU, Graphics, Sleep, Hibernate, until suddenly BAM the whole system hangs.
I’ve tried Ubuntu 18.04, Fedora 27 and 28, and Antergos but settled on Antergos for now, all of them have the same issue.
I’ve recently found a way to solve this issue for the most part, I know the easiest way for me to reproduce it is to compile something under Intel Quartus version 17 and up, while in fitter operations it should crash within 4 tries.
Usually like I said it crashes within four tries, after my last attempt at fixing the issue it crashed at about the 40th time which is much better. I am experimenting with disabling the C6 state now as well.

Here it goes:

1- Get the source of the latest kernel and compile it using any guide online. For arch users, just use the linux-mainline AUR as it has the RCS feature enabled by default. For other linux distros, before compiling type “make menuconfig” and navigate to RCU and enable the last feature “offload RCU callback processing…”. After that compile the kernel using whatever your distro recommends and install it.
2- Edit the grub config line for Kernel command line (/etc/default/grub) where it says (GRUB_CMDLINE_LINUX_DEFAULT) add the option between the quotes (rcu_nocbs=0-7), 0-7 is because we have 4 cores and 4 threads.
3- Update grub and your initramfs depending on your distro, reboot and choose the new kernel and test if things work well.

noenken · May 30, 2018, 3:48am

I’ll be damned! Just had valley running for four hours straight.

I took your suggestion and googled for rcu_nocbs since I wanted to know what that was doing and I found this:

https://utcc.utoronto.ca/~cks/space/blog/linux/KernelRcuNocbsMeaning

I know, it is not exactly efficient but it didn’t freeze up.

Gonna run a test now with just your suggestion, gonna report later.

Shining_Ace · June 1, 2018, 12:37am

It does seem to freeze less now with me but there are still freezes when compiling Quartus Projects. I will test with other workloads and see as well.