Return to Level1Techs.com

Ryzen/Vega laptop PCIe Bus Error

amd
helpdesk
laptops

#47

WHAAAAT?

Whaaat

I didn’t even think of that. Mostly because suspend always broke something for me. So the first thing I do is just disable that. But it does make sense of course, the system was “suspended” anyway and kicks back into gear when you resume. … Damn, should have thought of that.

Oh also, on my machine no input was working when the freeze happened.
How did you suspend it? SSH?


#48

power button is setup to suspend


#49

Huh, powerbutton had no effect on my machine, was setup to power off.
What machine was that? I’m on the Lenovo Ideapad 720s.


#50

you might have to configure it on your distro… this is umm raw motheroardish.


#51

Yeah, like I said, that did not work for me. It was configured to power off without any dialog or confirmation and it did nothing. Must be device specific then.


#52

I actually tried that before. Didn’t work with my Envy x360 at least. Power button is indeed set to sleep. If I remove the rcu and cstate code it sleeps, wakes up, and even hibernates with no issue, but crashes randomly. If the code is in there I lose power management and sleep but at least the laptop works. So far everything is looking well with rcu_nocbs and cstate max 1.
Moving from an Acer Aspire S7 (great laptop but dual core haswell and 8gb ram) to this has been heavenly. I’m hoping for a fix for this issue soon but that’s all.


#53

Linux 4.17 is now mainline.
I doubt it’s gonna fix the issue since non of the rc kernels did, but I am compiling as we speak and will test and update everyone.

EDIT: It crashed as expected.
Today’s my birthday so I’m gonna have some fun instead of testing lol.
But when I’m done I wanna see which C-State is causing the issue by going from max_cstate=1 to 6. 6 will probably fail but I do wanna test out. Will update as soon as I find anything.

EDIT 2:
The forum is not letting me post because I am a new user :man_facepalming:
I can only edit.

Note that for the rcu_nocbs line to work you need to enable the option during kernel compile-time.
Look here for reference.
https://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen

Arch Kernels have it enabled by default AFAIK

EDIT 3:
I tested with processor.max_cstate as 5 down to 1. The only stable one is 1, any other value and it crashes.


#54

Add “rcu_nocbs=0-7” and “processor.max_cstate=1” in grub.

For me it doesn’t work =( Before it used to hang only under high loads, now it hangs randomly by web-surfing too.


#55

What is your hardware and software specs exactly?


#56

Ryzen 3 2200G on MSI B350 Motherboard
Ubuntu 18.04 (Kernel 14.17.0)

dmesg give this:
[ 3240.929502] pcieport 0000:00:01.2: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=000a(Transmitter ID)
[ 3240.929505] pcieport 0000:00:01.2: device [1022:15d3] error status/mask=00001000/00006000
[ 3240.929507] pcieport 0000:00:01.2: [12] Replay Timer Timeout

The system loads approximately once from 4 times.
Errors by loading:
[ 0.079553] ACPI BIOS Error (bug): Failure creating [_SB.SMIC], AE_ALREADY_EXISTS (20180313/dswload2-316)
[ 0.079764] ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20180313/psobject-220)
[ 0.079946] ACPI Error: Method parse/execution failed , AE_ALREADY_EXISTS (20180313/psparse-516)
[ 0.080001] ACPI Error: Invalid zero thread count in method (20180313/dsmethod-760)
[ 0.080174] ACPI: Marking method ___ as Serialized because of AE_ALREADY_EXISTS error
[ 0.080175] ACPI Error: Invalid OwnerId: 0x00 (20180313/utownerid-156)
[ 0.080350] ACPI Error: AE_ALREADY_EXISTS, (SSDT: AMD PT) while loading table (20180313/tbxfload-197)
[ 0.081148] ACPI Error: 1 table load failures, 7 successful (20180313/tbxfload-215)
[ 0.697188] AMD-Vi: Unable to write to IOMMU perf counter.
[ 4.100036] usb 1-8: device not accepting address 5, error -71
[ 4.100139] usb usb1-port8: unable to enumerate USB device
[ 16.801249] kvm: disabled by bios

Thanks for any help


#57

Huh, so this also happens on the desktop APUs? Didn’t realize that.
I am guessing you are on the latest UEFI?

Can’t you just disable C-states in the BIOS?

Oh and since the 2200G has only 4 threads it must be rcu_nocbs=0-3


#58

Yes, i have latest UEFI and i tried already rcu_nocbs=0-3 - it remains the same.

I’ll try to disable C-states in the BIOS. Thanks


#59

So I installed Fedora and am testing kernel 4.18rc1.
It seems the issue has not been fixed yet. I will start testing with what kernel parameters are needed for this.
In the meantime, 4.17 works perfectly with the two parameters I posted two weeks ago.


#60

Thank you guys for investigating this issue when AMD doesnt.
I recently bought an Acer Swift 3 with Ryzen 5 2500U and installed Linux Mint 19 on it.
Using latest Kernel 4.17 and Mesa version but Im still having the same freeze issues when watching youtube videos. I will try the suggested workaround and report on it.


#61

I was able to work on my notebook this evening without a single freeze. It seems, for my case just adding processor.max_cstate=1 in /etc/default/grub was enough:

Afterwards I ran sudo update-grub rebooted and no freezes yet!

For anyone interested in my system information:


#62

Yeah, at this point it is obvious that it is a C-State thing again.
Glad to hear that the processor.max_cstate=1 seems to work across the board.


Ryzen 7 1700 machine check error - instant reboot
#63

I have the same machine and I hope this will help me.

At the same time I am wondering, how did you manage to install LM to it? Mine wouldn’t play ball at all with LM. It didn’t even get into the desk environment from the USB.

I am currently running Antergos Cinnamon and liking it a lot.

Thanks for the solution, will let know if my system holds.


#64

So under Kubuntu 18.04 my machine is actually freezing up again.
I can ssh into it and dmesg tells me this:

[   42.821380] amdgpu: [powerplay] pp_dpm_get_temperature was not implemented.
[ 5488.951160] gmc_v9_0_process_interrupt: 21 callbacks suppressed
[ 5488.951167] amdgpu 0000:03:00.0: [mmhub] VMC page fault (src_id:0 ring:153 vm_id:0 pas_id:0)
[ 5488.951175] amdgpu 0000:03:00.0:   at page 0x0000000600000000 from 18
[ 5488.951178] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000132
[ 5680.091606] INFO: task Xorg:991 blocked for more than 120 seconds.
[ 5680.091612]       Tainted: G        W        4.15.0-24-generic #26-Ubuntu
[ 5680.091615] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5680.091618] Xorg            D    0   991    985 0x00400004
[ 5680.091622] Call Trace:
[ 5680.091632]  __schedule+0x291/0x8a0
[ 5680.091636]  schedule+0x2c/0x80
[ 5680.091639]  schedule_preempt_disabled+0xe/0x10
[ 5680.091641]  __ww_mutex_lock.isra.3+0x204/0x670
[ 5680.091645]  __ww_mutex_lock_slowpath+0x16/0x20
[ 5680.091647]  ? __ww_mutex_lock_slowpath+0x16/0x20
[ 5680.091649]  ww_mutex_lock+0x5a/0x70
[ 5680.091671]  drm_modeset_backoff+0x47/0xc0 [drm]
[ 5680.091687]  drm_mode_obj_set_property_ioctl+0x14b/0x280 [drm]
[ 5680.091704]  ? drm_mode_connector_set_obj_prop+0x80/0x80 [drm]
[ 5680.091719]  drm_mode_connector_property_set_ioctl+0x3f/0x60 [drm]
[ 5680.091731]  drm_ioctl_kernel+0x5f/0xb0 [drm]
[ 5680.091743]  drm_ioctl+0x31b/0x3d0 [drm]
[ 5680.091757]  ? drm_mode_connector_set_obj_prop+0x80/0x80 [drm]
[ 5680.091798]  amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
[ 5680.091803]  do_vfs_ioctl+0xa8/0x630
[ 5680.091807]  ? vfs_read+0x115/0x130
[ 5680.091809]  SyS_ioctl+0x79/0x90
[ 5680.091813]  do_syscall_64+0x73/0x130
[ 5680.091816]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 5680.091819] RIP: 0033:0x7f4fa03325d7
[ 5680.091821] RSP: 002b:00007ffe84240258 EFLAGS: 00003246 ORIG_RAX: 0000000000000010
[ 5680.091823] RAX: ffffffffffffffda RBX: 000055a8ab08d6f0 RCX: 00007f4fa03325d7
[ 5680.091824] RDX: 00007ffe84240290 RSI: 00000000c01064ab RDI: 0000000000000017
[ 5680.091826] RBP: 00007ffe84240290 R08: 0000000000000001 R09: 0000000000000000
[ 5680.091827] R10: 00007f4fa03bacc0 R11: 0000000000003246 R12: 00000000c01064ab
[ 5680.091828] R13: 0000000000000017 R14: 000055a8ab08db10 R15: 000055a8a9c88601
[ 5680.091913] INFO: task kworker/u32:4:5622 blocked for more than 120 seconds.
[ 5680.091915]       Tainted: G        W        4.15.0-24-generic #26-Ubuntu
[ 5680.091917] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5680.091919] kworker/u32:4   D    0  5622      2 0x80000000
[ 5680.091933] Workqueue: events_unbound commit_work [drm_kms_helper]
[ 5680.091934] Call Trace:
[ 5680.091937]  __schedule+0x291/0x8a0
[ 5680.091941]  schedule+0x2c/0x80
[ 5680.091943]  schedule_timeout+0x1cf/0x350
[ 5680.092003]  ? tgn10_get_crtc_scanoutpos+0x6b/0xa0 [amdgpu]
[ 5680.092007]  dma_fence_default_wait+0x1c7/0x260
[ 5680.092009]  ? dma_fence_release+0xa0/0xa0
[ 5680.092011]  dma_fence_wait_timeout+0x3e/0xf0
[ 5680.092014]  reservation_object_wait_timeout_rcu+0x17d/0x370
[ 5680.092072]  amdgpu_dm_do_flip+0x12c/0x390 [amdgpu]
[ 5680.092126]  amdgpu_dm_atomic_commit_tail+0x92c/0xa50 [amdgpu]
[ 5680.092131]  ? dequeue_entity+0xe4/0x470
[ 5680.092135]  ? __switch_to+0x182/0x500
[ 5680.092143]  commit_tail+0x42/0x70 [drm_kms_helper]
[ 5680.092149]  commit_work+0x12/0x20 [drm_kms_helper]
[ 5680.092153]  process_one_work+0x1de/0x410
[ 5680.092155]  worker_thread+0x32/0x410
[ 5680.092158]  kthread+0x121/0x140
[ 5680.092160]  ? process_one_work+0x410/0x410
[ 5680.092163]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 5680.092165]  ? do_syscall_64+0x73/0x130
[ 5680.092168]  ? SyS_exit+0x17/0x20
[ 5680.092170]  ret_from_fork+0x22/0x40

#65

Mine was also starting showing CPU#3 soft lockup again now, after 2 weeks of working like a charm from @Deflaktor solution.

The grub changes very much after the update (after the fix) and I am not sure how to tackle this.


#66

I forgot where I found this, but appending “idle=nomwait pcie_aspm=off” seems to keep the laptop running even better than my previous solution. Can everyone please try and confirm my findings? I have not run into any crashes yet on kernel 4.18
AFAIK this seems to be a BIOS issue.