Threadripper & PCIe Bus Errors

@wendell Have you tried finding speakers of the annual KVM Forum event ?
I mean looking up speakers and visit their sites/blogs and asking there ?
It could be that someone encountered, or maybe even resolved, the issues at hand. Or at least give more insight.

https://www.youtube.com/channel/UCRCSQmAOh7yzgheq-emy1xA

https://www.linux-kvm.org/page/KVM_Forum

Has there been any fixes for the PCIe bus errors in recent UEFI’s (from Gigabyte/Asus?) CC @wendell @ryan @kreestuh

Hi, All,

Just a quick update, I updated to 4.14.0, but the error still exists with ASPM enabled.


Hi, build a gentoo system on 1950X with Gigabyte Designare Ex MB, but the error is same as you mentioned.
After booting to system, I got the error flush my console. The only option for now is to remove the ASPM support from the kernel, but I know it is not the right solution. please do keep update the progress of the fix.

Thanks

Same issue on CentOS 7.4, specs:

  • AMD TR 1950X
  • Asus Zenith Extreme
  • Nvidia GT 710

So are there any news yet?

No, be it I haven’t had any side effects from disabling aspm at boot

Thanks, adding pcie_aspm=off to grub mitigated the issue for me, no more kernel messages since.

So in theory if I understand it correctly, there is no negative impact to setting this, apart from “higher” power consumption at idle. So not a big deal on a GT 710.

Still hoping to see AMD and the Linux guys get together for fixing this flaw. Granted, the intersection of people buying Threadripper and people running Linux is quite small, but still, come on, implementing PCIe spec correctly cant be that hard, @AMD. smh

1 Like

It may actually be Nvidia at fault with the pcie spec. Aspm can be left on with the Polaris and Vega cards. And pcie 10 gig cards. Not sure.

1 Like

Can confirm that it’s not just Nvidia cards that cause the issue. My Magewell Pro Capture AIO also appears to trigger the errors with ASPM.

dmesg

[   19.746245] dpc 0000:00:03.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[   19.746257] pcieport 0000:00:03.1: AER: Corrected error received: id=0000
[   19.746260] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Receiver ID)
[   19.746263] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00000040/00006000                                                      
[   19.746266] pcieport 0000:00:03.1:    [ 6] Bad TLP

lspci -tv

-+-[0000:40]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device 1450
 |           +-00.2  Advanced Micro Devices, Inc. [AMD] Device 1451
 |           +-01.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           +-02.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           +-03.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           +-03.1-[41]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Tonga PRO GL [FirePro W7100]
 |           |            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Tonga HDMI Audio [Radeon R9 285/380]
 |           +-04.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           +-07.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           +-07.1-[42]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 145a
 |           |            +-00.2  Advanced Micro Devices, Inc. [AMD] Device 1456
 |           |            \-00.3  Advanced Micro Devices, Inc. [AMD] USB3 Host Controller
 |           +-08.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           \-08.1-[43]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 1455
 |                        \-00.2  Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
 \-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device 1450
             +-00.2  Advanced Micro Devices, Inc. [AMD] Device 1451
             +-01.0  Advanced Micro Devices, Inc. [AMD] Device 1452
             +-01.1-[01-07]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 43ba
             |               +-00.1  Advanced Micro Devices, Inc. [AMD] Device 43b6
             |               \-00.2-[02-07]--+-00.0-[03]----00.0  Device 1d6a:d107
             |                               +-04.0-[04]----00.0  Intel Corporation I211 Gigabit Network Connection
             |                               +-05.0-[05]----00.0  Intel Corporation Device 24fb
             |                               +-06.0-[06]----00.0  Intel Corporation I211 Gigabit Network Connection
             |                               \-07.0-[07]--
             +-01.2-[08]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961
             +-01.3-[09-0a]----00.0-[0a]----00.0  Creative Labs CA0108/CA10300 [Sound Blaster Audigy Series]
             +-02.0  Advanced Micro Devices, Inc. [AMD] Device 1452
             +-03.0  Advanced Micro Devices, Inc. [AMD] Device 1452
             +-03.1-[0b]----00.0  Nanjing Magewell Electronics Co., Ltd. Device 0002
             +-04.0  Advanced Micro Devices, Inc. [AMD] Device 1452
             +-07.0  Advanced Micro Devices, Inc. [AMD] Device 1452
             +-07.1-[0c]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 145a
             |            +-00.2  Advanced Micro Devices, Inc. [AMD] Device 1456
             |            \-00.3  Advanced Micro Devices, Inc. [AMD] USB3 Host Controller
             +-08.0  Advanced Micro Devices, Inc. [AMD] Device 1452
             +-08.1-[0d]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 1455
             |            +-00.2  Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
             |            \-00.3  Advanced Micro Devices, Inc. [AMD] Device 1457
             +-14.0  Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
             +-14.3  Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge
             +-18.0  Advanced Micro Devices, Inc. [AMD] Device 1460
             +-18.1  Advanced Micro Devices, Inc. [AMD] Device 1461
             +-18.2  Advanced Micro Devices, Inc. [AMD] Device 1462
             +-18.3  Advanced Micro Devices, Inc. [AMD] Device 1463
             +-18.4  Advanced Micro Devices, Inc. [AMD] Device 1464
             +-18.5  Advanced Micro Devices, Inc. [AMD] Device 1465
             +-18.6  Advanced Micro Devices, Inc. [AMD] Device 1466
             +-18.7  Advanced Micro Devices, Inc. [AMD] Device 1467
             +-19.0  Advanced Micro Devices, Inc. [AMD] Device 1460
             +-19.1  Advanced Micro Devices, Inc. [AMD] Device 1461
             +-19.2  Advanced Micro Devices, Inc. [AMD] Device 1462
             +-19.3  Advanced Micro Devices, Inc. [AMD] Device 1463
             +-19.4  Advanced Micro Devices, Inc. [AMD] Device 1464
             +-19.5  Advanced Micro Devices, Inc. [AMD] Device 1465
             +-19.6  Advanced Micro Devices, Inc. [AMD] Device 1466
             \-19.7  Advanced Micro Devices, Inc. [AMD] Device 1467

I did add a GeForce 6600 to the system and it didn’t make the problem any worse. The error follows the slot that the capture card is installed in. The card seems to work fine regardless though. Using ASRock X399 Professional Gaming.

2 Likes

Hi, There.

I replaced my Gigabyte Designare EX motherboard with Asus Zenith Extreme last Friday and today I restored my gentoo linux on this new mo. Also I updated the bios which released from Asus on Dec 7.

It seems the pci-e bus error is gone now. Not sure what’s the reason? Maybe it is the new Bios update has some fix on this.

I may do some test in next few days to confirm this, but if someone get sometime can test it. the latest bios of zenith extreme is 0804.

Make sure aspm is not disabled on the new board?

How can I check whether the bios enable or disable it? I compile the ASPM into the kernel not a module. I didn’t see the annoying message from dmesg in these 2 days now.

I can also confirm it’s not just NVidia cards. I am currently building a Threadripper system with the ASUS ROG Zenith Extreme x399 motherboard. My test bench uses an AMD HD7950 card and a Samsung 960 Pro NVMe, and I am also getting the “Bad DLLP” and “Bad TLP” PCI errors.

As others have said, using pcie_aspm=off does help tremendously. So far I have not seen any errors with pcie_aspm=off, though I’ve only really ran Prime95 on this bench so far.

Also have the asrock fatal1ty with a nvidia card. Applied the pcie-aspm=off and the pci errors disappeared but I will get a blanking screen and messages like the following:

Jan 09 21:32:30 ripper kernel: nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
Jan 09 21:32:46 ripper kernel: nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000857d:0:0

I am just trying to run a stable build at this point, not even try virtualization with gpu passthrough.
latest bios installed, reset to defaults.

Antegros running latest kernel 4.14.12-1-ARCH w/nvidia drivers

Has anyone come across a fix to the required flag?

@younkey has your asus board been stable?

Also here: Treadripper on Gigabyte Aorus, with NVIDIA 1080TI. Logs completely flooded with PCIE errors.

Adding pcie_aspm=off to Grub results in blank screen (but no more errors).

For now I’m just trying to get vanilla Ubuntu 17.10 working.

Anyone knows how to get around the problem in Grub?

That’s a problem with Nouveau and Wayland. Switch to X11. Nvidia and Wayland aren’t friends yet.

1 Like

Thanks mate!

I edited
/etc/gdm3/custom.conf

uncommenting the line
#WaylandEnable=false by removing the # in front

Booting fine in X11 and so far my logs stay clean of PCIE errors.

So I guess we have to hope Gigabyte/AMD will do something about ASPM soon (and Nvidia will do something about their drives in for Wayland)

1 Like

I agree with you Nvidia should update their Linux drivers but as everyone knows unless AMD/ATI becomes a huge threat Nvidia isn’t going to change.

Just an update to my issue

This problem was happening with my 550ti’s on nouveau and nvidia drivers, system was not stable. When I switched to a 970 or 1080ti, then my issues went away. The only thing I know of is the 550’s do not have a uefi bios.

Thanks for the help, I still have to test the GPU passthrough, more to come.

Just updating this issue – a lot of this is now largely resolved (or on track to resolution) https://patchwork.kernel.org/patch/10181903/

This immediately resolves PCIe reset passthrough issues (and lets one use non-vega graphics cards for passthrough, for example) but may also resolve some other issues as well.

I am testing now.

4 Likes

Hi @wendell, did the patch fix the PCIe Bus Errors issue for you? I’ve tried the tr.patch over here, but it seems I’m still getting the PCIe Bus Errors when booting.