Threadripper & PCIe Bus Errors

Can confirm that it’s not just Nvidia cards that cause the issue. My Magewell Pro Capture AIO also appears to trigger the errors with ASPM.

dmesg

[   19.746245] dpc 0000:00:03.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[   19.746257] pcieport 0000:00:03.1: AER: Corrected error received: id=0000
[   19.746260] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Receiver ID)
[   19.746263] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00000040/00006000                                                      
[   19.746266] pcieport 0000:00:03.1:    [ 6] Bad TLP

lspci -tv

-+-[0000:40]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device 1450
 |           +-00.2  Advanced Micro Devices, Inc. [AMD] Device 1451
 |           +-01.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           +-02.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           +-03.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           +-03.1-[41]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Tonga PRO GL [FirePro W7100]
 |           |            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Tonga HDMI Audio [Radeon R9 285/380]
 |           +-04.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           +-07.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           +-07.1-[42]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 145a
 |           |            +-00.2  Advanced Micro Devices, Inc. [AMD] Device 1456
 |           |            \-00.3  Advanced Micro Devices, Inc. [AMD] USB3 Host Controller
 |           +-08.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           \-08.1-[43]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 1455
 |                        \-00.2  Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
 \-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device 1450
             +-00.2  Advanced Micro Devices, Inc. [AMD] Device 1451
             +-01.0  Advanced Micro Devices, Inc. [AMD] Device 1452
             +-01.1-[01-07]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 43ba
             |               +-00.1  Advanced Micro Devices, Inc. [AMD] Device 43b6
             |               \-00.2-[02-07]--+-00.0-[03]----00.0  Device 1d6a:d107
             |                               +-04.0-[04]----00.0  Intel Corporation I211 Gigabit Network Connection
             |                               +-05.0-[05]----00.0  Intel Corporation Device 24fb
             |                               +-06.0-[06]----00.0  Intel Corporation I211 Gigabit Network Connection
             |                               \-07.0-[07]--
             +-01.2-[08]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961
             +-01.3-[09-0a]----00.0-[0a]----00.0  Creative Labs CA0108/CA10300 [Sound Blaster Audigy Series]
             +-02.0  Advanced Micro Devices, Inc. [AMD] Device 1452
             +-03.0  Advanced Micro Devices, Inc. [AMD] Device 1452
             +-03.1-[0b]----00.0  Nanjing Magewell Electronics Co., Ltd. Device 0002
             +-04.0  Advanced Micro Devices, Inc. [AMD] Device 1452
             +-07.0  Advanced Micro Devices, Inc. [AMD] Device 1452
             +-07.1-[0c]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 145a
             |            +-00.2  Advanced Micro Devices, Inc. [AMD] Device 1456
             |            \-00.3  Advanced Micro Devices, Inc. [AMD] USB3 Host Controller
             +-08.0  Advanced Micro Devices, Inc. [AMD] Device 1452
             +-08.1-[0d]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 1455
             |            +-00.2  Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
             |            \-00.3  Advanced Micro Devices, Inc. [AMD] Device 1457
             +-14.0  Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
             +-14.3  Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge
             +-18.0  Advanced Micro Devices, Inc. [AMD] Device 1460
             +-18.1  Advanced Micro Devices, Inc. [AMD] Device 1461
             +-18.2  Advanced Micro Devices, Inc. [AMD] Device 1462
             +-18.3  Advanced Micro Devices, Inc. [AMD] Device 1463
             +-18.4  Advanced Micro Devices, Inc. [AMD] Device 1464
             +-18.5  Advanced Micro Devices, Inc. [AMD] Device 1465
             +-18.6  Advanced Micro Devices, Inc. [AMD] Device 1466
             +-18.7  Advanced Micro Devices, Inc. [AMD] Device 1467
             +-19.0  Advanced Micro Devices, Inc. [AMD] Device 1460
             +-19.1  Advanced Micro Devices, Inc. [AMD] Device 1461
             +-19.2  Advanced Micro Devices, Inc. [AMD] Device 1462
             +-19.3  Advanced Micro Devices, Inc. [AMD] Device 1463
             +-19.4  Advanced Micro Devices, Inc. [AMD] Device 1464
             +-19.5  Advanced Micro Devices, Inc. [AMD] Device 1465
             +-19.6  Advanced Micro Devices, Inc. [AMD] Device 1466
             \-19.7  Advanced Micro Devices, Inc. [AMD] Device 1467

I did add a GeForce 6600 to the system and it didn’t make the problem any worse. The error follows the slot that the capture card is installed in. The card seems to work fine regardless though. Using ASRock X399 Professional Gaming.

2 Likes

Hi, There.

I replaced my Gigabyte Designare EX motherboard with Asus Zenith Extreme last Friday and today I restored my gentoo linux on this new mo. Also I updated the bios which released from Asus on Dec 7.

It seems the pci-e bus error is gone now. Not sure what’s the reason? Maybe it is the new Bios update has some fix on this.

I may do some test in next few days to confirm this, but if someone get sometime can test it. the latest bios of zenith extreme is 0804.

Make sure aspm is not disabled on the new board?

How can I check whether the bios enable or disable it? I compile the ASPM into the kernel not a module. I didn’t see the annoying message from dmesg in these 2 days now.

I can also confirm it’s not just NVidia cards. I am currently building a Threadripper system with the ASUS ROG Zenith Extreme x399 motherboard. My test bench uses an AMD HD7950 card and a Samsung 960 Pro NVMe, and I am also getting the “Bad DLLP” and “Bad TLP” PCI errors.

As others have said, using pcie_aspm=off does help tremendously. So far I have not seen any errors with pcie_aspm=off, though I’ve only really ran Prime95 on this bench so far.

Also have the asrock fatal1ty with a nvidia card. Applied the pcie-aspm=off and the pci errors disappeared but I will get a blanking screen and messages like the following:

Jan 09 21:32:30 ripper kernel: nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
Jan 09 21:32:46 ripper kernel: nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000857d:0:0

I am just trying to run a stable build at this point, not even try virtualization with gpu passthrough.
latest bios installed, reset to defaults.

Antegros running latest kernel 4.14.12-1-ARCH w/nvidia drivers

Has anyone come across a fix to the required flag?

@younkey has your asus board been stable?

Also here: Treadripper on Gigabyte Aorus, with NVIDIA 1080TI. Logs completely flooded with PCIE errors.

Adding pcie_aspm=off to Grub results in blank screen (but no more errors).

For now I’m just trying to get vanilla Ubuntu 17.10 working.

Anyone knows how to get around the problem in Grub?

That’s a problem with Nouveau and Wayland. Switch to X11. Nvidia and Wayland aren’t friends yet.

1 Like

Thanks mate!

I edited
/etc/gdm3/custom.conf

uncommenting the line
#WaylandEnable=false by removing the # in front

Booting fine in X11 and so far my logs stay clean of PCIE errors.

So I guess we have to hope Gigabyte/AMD will do something about ASPM soon (and Nvidia will do something about their drives in for Wayland)

1 Like

I agree with you Nvidia should update their Linux drivers but as everyone knows unless AMD/ATI becomes a huge threat Nvidia isn’t going to change.

Just an update to my issue

This problem was happening with my 550ti’s on nouveau and nvidia drivers, system was not stable. When I switched to a 970 or 1080ti, then my issues went away. The only thing I know of is the 550’s do not have a uefi bios.

Thanks for the help, I still have to test the GPU passthrough, more to come.

Just updating this issue – a lot of this is now largely resolved (or on track to resolution) https://patchwork.kernel.org/patch/10181903/

This immediately resolves PCIe reset passthrough issues (and lets one use non-vega graphics cards for passthrough, for example) but may also resolve some other issues as well.

I am testing now.

4 Likes

Hi @wendell, did the patch fix the PCIe Bus Errors issue for you? I’ve tried the tr.patch over here, but it seems I’m still getting the PCIe Bus Errors when booting.

Which errors? The aspm errors from GeForce cards still occur but can be disabled by turning aspm off. No I’ll effects. Other errors haven’t popped up yet

Errors like

[ 5572.240540] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Transmitter ID)
[ 5572.240543] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00001000/00006000
[ 5572.240545] pcieport 0000:00:01.1:    [12] Replay Timer Timeout
[ 5573.827516] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[ 5573.827533] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[ 5573.827537] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Transmitter ID)
[ 5573.827540] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00001000/00006000
[ 5573.827542] pcieport 0000:00:01.1:    [12] Replay Timer Timeout
[ 5574.224257] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[ 5574.224271] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[ 5574.224275] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[ 5574.224278] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000080/00006000
[ 5574.224280] pcieport 0000:00:01.1:    [ 7] Bad DLLP

I guess I’ll try adding pcie_aspm=off to grub for now.

This patch doesn’t try to address the ASPM issue, it is an entirely different problem.

well, not so fast… yes this is true, but for some types of pcie errors the kernel will automatically do a bus reset it seems, and this patch actually does fix that.

Any news whether this patch going to be in 4.16? :slight_smile:

Hi There I’m grateful for this thread. Thanks gnif and Wendell!
Been pulling my hair out with a x1950 and a 1080ti trying to get KVM to pass it though.

Apparently according to some posts on redit from


namely

and

They are still working on it.

I’m close to throwing in the towel so fingers crossed it is in 4.16. as it stands I have a stupidly expensive target to replicate my ZFS to from my existing Xeon.

This error has started appearing over and over and over for me right after I updated my bios. X370 taichi and ryzen 1700

[87500.309822] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID)
[87500.309828] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001140/00006000
[87500.309832] pcieport 0000:00:03.1:    [ 6] Bad TLP               
[87500.309835] pcieport 0000:00:03.1:    [ 8] RELAY_NUM Rollover    
[87500.309838] pcieport 0000:00:03.1:    [12] Replay Timer Timeout

lspci

-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device 1450
           +-00.2  Advanced Micro Devices, Inc. [AMD] Device 1451
           +-01.0  Advanced Micro Devices, Inc. [AMD] Device 1452
           +-01.3-[03-2f]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 43b9
           |               +-00.1  Advanced Micro Devices, Inc. [AMD] Device 43b5
           |               \-00.2-[1d-2f]--+-00.0-[1e]--
           |                               +-02.0-[20]----00.0  ASMedia Technology Inc. ASM1062 Serial ATA Controller
           |                               +-03.0-[21-2b]----00.0-[26-2b]--+-01.0-[27]----00.0  Intel Corporation Device 24fb
           |                               |                               +-03.0-[28]--
           |                               |                               +-05.0-[2a]----00.0  Intel Corporation I211 Gigabit Network Connection
           |                               |                               \-07.0-[2b]--
           |                               \-04.0-[2c-2f]----00.0-[2d-2f]--+-02.0-[2e]--+-00.0  Intel Corporation 82575GB Gigabit Network Connection
           |                                                               |            \-00.1  Intel Corporation 82575GB Gigabit Network Connection
           |                                                               \-04.0-[2f]--+-00.0  Intel Corporation 82575GB Gigabit Network Connection
           |                                                                            \-00.1  Intel Corporation 82575GB Gigabit Network Connection
           +-02.0  Advanced Micro Devices, Inc. [AMD] Device 1452
           +-03.0  Advanced Micro Devices, Inc. [AMD] Device 1452
           +-03.1-[30]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Polaris11]
           |            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Device aae0
           +-03.2-[31]----00.0  LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
           +-04.0  Advanced Micro Devices, Inc. [AMD] Device 1452
           +-07.0  Advanced Micro Devices, Inc. [AMD] Device 1452
           +-07.1-[32]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 145a
           |            +-00.2  Advanced Micro Devices, Inc. [AMD] Device 1456
           |            \-00.3  Advanced Micro Devices, Inc. [AMD] USB3 Host Controller
           +-08.0  Advanced Micro Devices, Inc. [AMD] Device 1452
           +-08.1-[33]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 1455
           |            +-00.2  Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
           |            \-00.3  Advanced Micro Devices, Inc. [AMD] Device 1457
           +-14.0  Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
           +-14.3  Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge
           +-18.0  Advanced Micro Devices, Inc. [AMD] Device 1460
           +-18.1  Advanced Micro Devices, Inc. [AMD] Device 1461
           +-18.2  Advanced Micro Devices, Inc. [AMD] Device 1462
           +-18.3  Advanced Micro Devices, Inc. [AMD] Device 1463
           +-18.4  Advanced Micro Devices, Inc. [AMD] Device 1464
           +-18.5  Advanced Micro Devices, Inc. [AMD] Device 1465
           +-18.6  Advanced Micro Devices, Inc. [AMD] Device 1466
           \-18.7  Advanced Micro Devices, Inc. [AMD] Device 1467