Ryzen/Vega laptop PCIe Bus Error

I even threw out the C-state thing again and it still has not crashed.
Manjaro 18 beta with kernel 4.17rc and that rcu_nocbs=0-7. Nothing else.

Can I help testing your stuff?
Warning: you’d need to explain veeery sloooowly.
I’m more of a filthy casual in the terminal.

I am testing now with both rcu_nocbs=0-7 and processor.max_cstate=1
No crashes yet at 12 compiles :slight_smile:

I could really use some help but I cannot release this code unfortunately and I don’t know if it only hangs while compiling my code or if any Quartus compile hangs it.

EDIT: 50 compiles and no crashes, time to transfer my installs and home folder and see if general usage and gaming crashes it.
Happy with the results so far. Less battery life for now is not a big issue.

1 Like

Huh, so it seems like a power management issue from what I’ve read so far, like XFR and powersave options for the Ryzen platform just plain aren’t working.

You’d probably want to run that laptop at full clocks consistently until they get the power save features fixed.

OK, finally it froze. No trying processor.max_cstate=3.


That froze after a couple hours video playback.

[Jun 1 13:33] watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [TaskSchedulerBa:1727]
[  +0,000006] Modules linked in: ccm cmac rfcomm fuse bnep btusb btrtl btbcm btintel bluetooth ecdh_generic uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev media arc4 amdkfd amd_iommu_v2 amdgpu iwlmvm chash gpu_sched edac_mce_amd i2c_algo_bit ttm ccp mac80211 rng_core joydev drm_kms_helper mousedev kvm snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel irqbypass hid_multitouch crct10dif_pclmul crc32_pclmul hid_generic snd_hda_codec drm iwlwifi ghash_clmulni_intel wmi_bmof pcbc snd_hda_core nls_iso8859_1 snd_hwdep nls_cp437 snd_pcm aesni_intel agpgart aes_x86_64 crypto_simd syscopyarea snd_timer vfat cryptd sysfillrect fat cfg80211 glue_helper snd sysimgblt fb_sys_fops input_leds led_class soundcore pcspkr sp5100_tco ideapad_laptop sparse_keymap
[  +0,000053]  i2c_piix4 k10temp rfkill battery shpchp rtc_cmos ucsi_acpi typec_ucsi typec i2c_hid hid wmi evdev mac_hid pinctrl_amd ac acpi_cpufreq uinput crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 fscrypto serio_raw atkbd libps2 xhci_pci xhci_hcd usbcore crc32c_intel usb_common i8042 serio
[  +0,000028] CPU: 7 PID: 1727 Comm: TaskSchedulerBa Not tainted 4.17.0-1-MANJARO #1
[  +0,000001] Hardware name: LENOVO 81BR/LNVNB161216, BIOS 6KCN29WW 01/03/2018
[  +0,000006] RIP: 0010:smp_call_function_many+0x1f4/0x250
[  +0,000001] RSP: 0018:ffff9d564380fd00 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[  +0,000003] RAX: 0000000000000005 RBX: ffff8e5fdede2e40 RCX: ffff8e5fded68c60
[  +0,000002] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8e5fdede2e48
[  +0,000001] RBP: ffff8e5fdede2e48 R08: 0000000000000003 R09: ffff8e5fdede2e70
[  +0,000001] R10: ffff8e5fdede2e48 R11: 0000000000000005 R12: ffff8e5fdede2e70
[  +0,000001] R13: ffffffff9a06f220 R14: ffff9d564380fd40 R15: 0000000000000140
[  +0,000002] FS:  00007f5814431700(0000) GS:ffff8e5fdedc0000(0000) knlGS:0000000000000000
[  +0,000001] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0,000001] CR2: 00007f0e000a8000 CR3: 000000020527c000 CR4: 00000000003406e0
[  +0,000002] Call Trace:
[  +0,000008]  flush_tlb_mm_range+0x10d/0x170
[  +0,000005]  zap_page_range+0x126/0x150
[  +0,000003]  ? do_futex+0x2a8/0xa90
[  +0,000003]  ? find_vma+0x60/0x70
[  +0,000003]  __se_sys_madvise+0x635/0x810
[  +0,000005]  ? do_syscall_64+0x5b/0x170
[  +0,000003]  ? __ia32_sys_gettid+0x17/0x20
[  +0,000002]  do_syscall_64+0x5b/0x170
[  +0,000005]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0,000004] RIP: 0033:0x7f584146d477
[  +0,000001] RSP: 002b:00007f5814430c78 EFLAGS: 00000206 ORIG_RAX: 000000000000001c
[  +0,000002] RAX: ffffffffffffffda RBX: 00007f5813c31000 RCX: 00007f584146d477
[  +0,000001] RDX: 0000000000000004 RSI: 00000000007fb000 RDI: 00007f5813c31000
[  +0,000001] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[  +0,000001] R10: 0000000000000001 R11: 0000000000000206 R12: 00007ffc5a4701ee
[  +0,000001] R13: 00007ffc5a4701ef R14: 00007f5814431700 R15: 0000000000000000
[  +0,000002] Code: 89 c7 e8 d0 aa 62 00 3b 05 8e 87 02 01 0f 83 8f fe ff ff 48 63 d0 48 8b 0b 48 03 0c d5 00 97 f0 9a 8b 51 18 83 e2 01 74 0a f3 90 <8b> 51 18 83 e2 01 75 f6 eb c8 48 c7 c2 e0 ba 13 9b 4c 89 e6 89 

Now testing processor.max_cstate=1 again
… with the one and only burn-in test.

1 Like

OK, pretty much a full day uptime, 12 hours videoplayback, three of those Ryan’s PGP stream, couple hours mprime -t, couple hours unigine-valley and now idling most of the time.

I wouldn’t call this completely fixed because it still needs the flags to work and it also drains more power than it would without them. But for now I would say this is the solution for anyone on mobile RR.

Add "rcu_nocbs=0-7" and "processor.max_cstate=1" in grub.

you know I was working on this myself and I found when I suspend then resume the hung machine it recovers, which I thought was interesting.

WHAAAAT?

Whaaat

I didn’t even think of that. Mostly because suspend always broke something for me. So the first thing I do is just disable that. But it does make sense of course, the system was “suspended” anyway and kicks back into gear when you resume. … Damn, should have thought of that.

Oh also, on my machine no input was working when the freeze happened.
How did you suspend it? SSH?

power button is setup to suspend

Huh, powerbutton had no effect on my machine, was setup to power off.
What machine was that? I’m on the Lenovo Ideapad 720s.

you might have to configure it on your distro… this is umm raw motheroardish.

Yeah, like I said, that did not work for me. It was configured to power off without any dialog or confirmation and it did nothing. Must be device specific then.

I actually tried that before. Didn’t work with my Envy x360 at least. Power button is indeed set to sleep. If I remove the rcu and cstate code it sleeps, wakes up, and even hibernates with no issue, but crashes randomly. If the code is in there I lose power management and sleep but at least the laptop works. So far everything is looking well with rcu_nocbs and cstate max 1.
Moving from an Acer Aspire S7 (great laptop but dual core haswell and 8gb ram) to this has been heavenly. I’m hoping for a fix for this issue soon but that’s all.

1 Like

Linux 4.17 is now mainline.
I doubt it’s gonna fix the issue since non of the rc kernels did, but I am compiling as we speak and will test and update everyone.

EDIT: It crashed as expected.
Today’s my birthday so I’m gonna have some fun instead of testing lol.
But when I’m done I wanna see which C-State is causing the issue by going from max_cstate=1 to 6. 6 will probably fail but I do wanna test out. Will update as soon as I find anything.

EDIT 2:
The forum is not letting me post because I am a new user :man_facepalming:
I can only edit.

Note that for the rcu_nocbs line to work you need to enable the option during kernel compile-time.
Look here for reference.
https://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen

Arch Kernels have it enabled by default AFAIK

EDIT 3:
I tested with processor.max_cstate as 5 down to 1. The only stable one is 1, any other value and it crashes.

2 Likes

Add “rcu_nocbs=0-7” and “processor.max_cstate=1” in grub.

For me it doesn’t work =( Before it used to hang only under high loads, now it hangs randomly by web-surfing too.

What is your hardware and software specs exactly?

Ryzen 3 2200G on MSI B350 Motherboard
Ubuntu 18.04 (Kernel 14.17.0)

dmesg give this:
[ 3240.929502] pcieport 0000:00:01.2: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=000a(Transmitter ID)
[ 3240.929505] pcieport 0000:00:01.2: device [1022:15d3] error status/mask=00001000/00006000
[ 3240.929507] pcieport 0000:00:01.2: [12] Replay Timer Timeout

The system loads approximately once from 4 times.
Errors by loading:
[ 0.079553] ACPI BIOS Error (bug): Failure creating [_SB.SMIC], AE_ALREADY_EXISTS (20180313/dswload2-316)
[ 0.079764] ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20180313/psobject-220)
[ 0.079946] ACPI Error: Method parse/execution failed , AE_ALREADY_EXISTS (20180313/psparse-516)
[ 0.080001] ACPI Error: Invalid zero thread count in method (20180313/dsmethod-760)
[ 0.080174] ACPI: Marking method ___ as Serialized because of AE_ALREADY_EXISTS error
[ 0.080175] ACPI Error: Invalid OwnerId: 0x00 (20180313/utownerid-156)
[ 0.080350] ACPI Error: AE_ALREADY_EXISTS, (SSDT: AMD PT) while loading table (20180313/tbxfload-197)
[ 0.081148] ACPI Error: 1 table load failures, 7 successful (20180313/tbxfload-215)
[ 0.697188] AMD-Vi: Unable to write to IOMMU perf counter.
[ 4.100036] usb 1-8: device not accepting address 5, error -71
[ 4.100139] usb usb1-port8: unable to enumerate USB device
[ 16.801249] kvm: disabled by bios

Thanks for any help

Huh, so this also happens on the desktop APUs? Didn’t realize that.
I am guessing you are on the latest UEFI?

Can’t you just disable C-states in the BIOS?

Oh and since the 2200G has only 4 threads it must be rcu_nocbs=0-3

Yes, i have latest UEFI and i tried already rcu_nocbs=0-3 - it remains the same.

I’ll try to disable C-states in the BIOS. Thanks

So I installed Fedora and am testing kernel 4.18rc1.
It seems the issue has not been fixed yet. I will start testing with what kernel parameters are needed for this.
In the meantime, 4.17 works perfectly with the two parameters I posted two weeks ago.

1 Like

Thank you guys for investigating this issue when AMD doesnt.
I recently bought an Acer Swift 3 with Ryzen 5 2500U and installed Linux Mint 19 on it.
Using latest Kernel 4.17 and Mesa version but Im still having the same freeze issues when watching youtube videos. I will try the suggested workaround and report on it.

3 Likes