Threadripper & PCIe Bus Errors

I agree with you Nvidia should update their Linux drivers but as everyone knows unless AMD/ATI becomes a huge threat Nvidia isn’t going to change.

Just an update to my issue

This problem was happening with my 550ti’s on nouveau and nvidia drivers, system was not stable. When I switched to a 970 or 1080ti, then my issues went away. The only thing I know of is the 550’s do not have a uefi bios.

Thanks for the help, I still have to test the GPU passthrough, more to come.

Just updating this issue – a lot of this is now largely resolved (or on track to resolution) https://patchwork.kernel.org/patch/10181903/

This immediately resolves PCIe reset passthrough issues (and lets one use non-vega graphics cards for passthrough, for example) but may also resolve some other issues as well.

I am testing now.

4 Likes

Hi @wendell, did the patch fix the PCIe Bus Errors issue for you? I’ve tried the tr.patch over here, but it seems I’m still getting the PCIe Bus Errors when booting.

Which errors? The aspm errors from GeForce cards still occur but can be disabled by turning aspm off. No I’ll effects. Other errors haven’t popped up yet

Errors like

[ 5572.240540] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Transmitter ID)
[ 5572.240543] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00001000/00006000
[ 5572.240545] pcieport 0000:00:01.1:    [12] Replay Timer Timeout
[ 5573.827516] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[ 5573.827533] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[ 5573.827537] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Transmitter ID)
[ 5573.827540] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00001000/00006000
[ 5573.827542] pcieport 0000:00:01.1:    [12] Replay Timer Timeout
[ 5574.224257] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[ 5574.224271] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[ 5574.224275] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[ 5574.224278] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000080/00006000
[ 5574.224280] pcieport 0000:00:01.1:    [ 7] Bad DLLP

I guess I’ll try adding pcie_aspm=off to grub for now.

This patch doesn’t try to address the ASPM issue, it is an entirely different problem.

well, not so fast… yes this is true, but for some types of pcie errors the kernel will automatically do a bus reset it seems, and this patch actually does fix that.

Any news whether this patch going to be in 4.16? :slight_smile:

Hi There I’m grateful for this thread. Thanks gnif and Wendell!
Been pulling my hair out with a x1950 and a 1080ti trying to get KVM to pass it though.

Apparently according to some posts on redit from


namely

and

They are still working on it.

I’m close to throwing in the towel so fingers crossed it is in 4.16. as it stands I have a stupidly expensive target to replicate my ZFS to from my existing Xeon.

This error has started appearing over and over and over for me right after I updated my bios. X370 taichi and ryzen 1700

[87500.309822] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID)
[87500.309828] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001140/00006000
[87500.309832] pcieport 0000:00:03.1:    [ 6] Bad TLP               
[87500.309835] pcieport 0000:00:03.1:    [ 8] RELAY_NUM Rollover    
[87500.309838] pcieport 0000:00:03.1:    [12] Replay Timer Timeout

lspci

-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device 1450
           +-00.2  Advanced Micro Devices, Inc. [AMD] Device 1451
           +-01.0  Advanced Micro Devices, Inc. [AMD] Device 1452
           +-01.3-[03-2f]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 43b9
           |               +-00.1  Advanced Micro Devices, Inc. [AMD] Device 43b5
           |               \-00.2-[1d-2f]--+-00.0-[1e]--
           |                               +-02.0-[20]----00.0  ASMedia Technology Inc. ASM1062 Serial ATA Controller
           |                               +-03.0-[21-2b]----00.0-[26-2b]--+-01.0-[27]----00.0  Intel Corporation Device 24fb
           |                               |                               +-03.0-[28]--
           |                               |                               +-05.0-[2a]----00.0  Intel Corporation I211 Gigabit Network Connection
           |                               |                               \-07.0-[2b]--
           |                               \-04.0-[2c-2f]----00.0-[2d-2f]--+-02.0-[2e]--+-00.0  Intel Corporation 82575GB Gigabit Network Connection
           |                                                               |            \-00.1  Intel Corporation 82575GB Gigabit Network Connection
           |                                                               \-04.0-[2f]--+-00.0  Intel Corporation 82575GB Gigabit Network Connection
           |                                                                            \-00.1  Intel Corporation 82575GB Gigabit Network Connection
           +-02.0  Advanced Micro Devices, Inc. [AMD] Device 1452
           +-03.0  Advanced Micro Devices, Inc. [AMD] Device 1452
           +-03.1-[30]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Polaris11]
           |            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Device aae0
           +-03.2-[31]----00.0  LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
           +-04.0  Advanced Micro Devices, Inc. [AMD] Device 1452
           +-07.0  Advanced Micro Devices, Inc. [AMD] Device 1452
           +-07.1-[32]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 145a
           |            +-00.2  Advanced Micro Devices, Inc. [AMD] Device 1456
           |            \-00.3  Advanced Micro Devices, Inc. [AMD] USB3 Host Controller
           +-08.0  Advanced Micro Devices, Inc. [AMD] Device 1452
           +-08.1-[33]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 1455
           |            +-00.2  Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
           |            \-00.3  Advanced Micro Devices, Inc. [AMD] Device 1457
           +-14.0  Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
           +-14.3  Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge
           +-18.0  Advanced Micro Devices, Inc. [AMD] Device 1460
           +-18.1  Advanced Micro Devices, Inc. [AMD] Device 1461
           +-18.2  Advanced Micro Devices, Inc. [AMD] Device 1462
           +-18.3  Advanced Micro Devices, Inc. [AMD] Device 1463
           +-18.4  Advanced Micro Devices, Inc. [AMD] Device 1464
           +-18.5  Advanced Micro Devices, Inc. [AMD] Device 1465
           +-18.6  Advanced Micro Devices, Inc. [AMD] Device 1466
           \-18.7  Advanced Micro Devices, Inc. [AMD] Device 1467

Hi, new too the forum. I can confirm on my system that passing pcie_aspm=off to grub has elimated the following errors on my system:

Mar  1 19:52:25 threadripper kernel: pcieport 0000:00:01.1: AER: Corrected error 
received: id=0000
Mar  1 19:52:25 threadripper kernel: pcieport 0000:00:01.1: PCIe Bus Error: sever
ity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
Mar  1 19:52:25 threadripper kernel: pcieport 0000:00:01.1:   device [1022:1453] 
error status/mask=00000080/00006000
Mar  1 19:52:25 threadripper kernel: pcieport 0000:00:01.1:    [ 7] Bad DLLP 

I was getting this error sporadically sometimes as frequently as 30 secs, or as long as every 5 minutes. Prior to trying this, I also tried building the kernel with pcie_aspm set to performance (as opposed to bios) but still had the problem.

My system is an overclocked 1950 on a a gigabyte designare x399 with two amd radeon pro wx 3100 and samsung nvme hd

2 Likes

Appreciate the hell out of this thread, so much so that I made an account to share some data:

Fresh out of the box, copypasta from the sales receipt:

AMD Threadripper Configurator (NO MONITOR)

*BASE_PRICE: [+1355]
BLKFRISALE1: ASUS ROG Strix Evolve Aura RGB 7200 DPI USB Wired Optical Ergonomic Ambidextrous Gaming Mouse [+0]
BLKFRISALE2: CYBERPOWERPC Skorpion K2 RGB Mechanical Gaming Keyboard [+5]
BLKFRISALE3: CyberPowerPC AULA Explosive 50mm Drive Analog Gaming Headset [White & Orange] [+5]
BLUETOOTH: None
CABLE: None
CAS: IN WIN 101 Mid Tower High Air Flow Gaming Case w/ Tempered Glass Full Size Window (Black)
CASUPGRADE: None
CC: None
CD: None
CD2: None
CPU: AMD Ryzen Threadripper 1950X 3.4GHz [4.0GHz Turbo] Sixteen-Core 32MB L3 Cache 180W Processor [+480]
CS_FAN: Default case fans
ENGRAVING: None
EVGA_POWER: None
FA_HDD: None
FAN: Asetek 570LC 120mm (Fatboy) Liquid CPU Cooling System w/ Copper Cold Plate (Single Standard 120MM Fan)
FLASHMEDIA: None
HD_M2SSD: None
HD_PCIE1X_SSD: None
HDD: 1TB SATA-III 6.0Gb/s 32MB Cache 7200RPM HDD (Single Drive)
HDD2: None
HEADSET: None
IUSB: Built-in USB 2.0 Ports
KEYBOARD: CyberpowerPC Multimedia USB Gaming Keyboard
MEMORY: 16GB (4GBx4) DDR4/3000MHz Quad Channel Memory (ADATA XPG Z1)
MONITOR: None
MOPAD: None
MOTHERBOARD: GIGABYTE AORUS X399 Gaming 7 ATX w/ RGB, Digital LED Support, 802.11ac, USB 3.1, 5 PCIe x16, 8 SATA3, 3 M.2 SATA/PCIe
MOUSE: CyberpowerPC Standard 4000 DPI with Weight System Optical Gaming Mouse
NETWORK: Onboard Gigabit LAN Network
OS: None - FORMAT HARD DRIVE ONLY [-60]
OVERCLOCK: No Overclocking
POWERSUPPLY: 600 Watts – Enermax Revo DUO series 600Watts 80 Plus Gold high-efficient airflow w/ Dual Fans Power Supply [+34]
PRO_WIRING: None
RUSH: Standard processing time: ship within 12 to 15 Business Days
SERVICE: 3 Years FREE Service Plan (INCLUDES LABOR AND LIFETIME TECHNICAL SUPPORT)
SLI_BRIDGE: None
SOUND: HIGH DEFINITION ON-BOARD 7.1 AUDIO
SPEAKERS: None
USBHD: None
USBX: None
VIDEO: AMD Radeon RX 580 4GB GDDR5 Video Card [VR Ready] (Single Card)
WARRANTY: STANDARD WARRANTY: 1 Year Parts WARRANTY

Constant AER (corrected) mostly bad TLP, some DLLP, and eventually Replay Timer Timeout for [1022:1453] and occasional Replay Timer Timeout for [1022:43b1] (also AMD PCI bridge) dumped to dmesg during install of CentOS (so kernel 3.10)

BIOS > PCIe Slot Configuration > changed from “auto” to “Gen2” and no more issues.

Not ideal, but w/e… hope this information is useful.

1 Like

I’ve been experiencing the many errors listed in this thread while working on my new homelab build. I’ve got a 1950x on an x399 Taichi with 2x LSI 9211-8is flashed to P20 IT mode ( tried P19 as well ) on Debian. I’ve flashed the bios on the Taichi to before and after the raid update. I’ve rebuilt to 4.16.1 with the patches listed previously in the reddit threads and the various kernel options in here: nomconf, aspm, etc. I’ve flipped around all of my BIOS settings on what I hope to be every combination of options relating to SATA ( did this after I gave up on using logic to pick ). I have used different PCIE ports, I have used a different GPU for vision, I have removed the cards to get the system to install and installed them later, I have tried different flavors, I have swapped my data cables, I have reseated my CPU, I have removed the BIOS from the raid cards.

Is there anything much to do other than wait for AMD-Linux here? Get a newer HBA? Trade for an i9-7900x? Stop trying to hack together consumer hardware in my basement? Take initiative of my own issues as a developer and learn the relevant pieces of the kernel to fix it myself?

Thanks

Edit: The best results I’ve achieved are with all patches and kernel options applied, with all of the BIOS options applied as described by an AMD rep in one of the reddit threads. At that point I can boot the system with 14/14 drives detected for sometimes about 3 minutes before it crashes. The crash doesn’t seem to be related to anything other than this issue - the cards will show lots of drive activity until the LEDs instantly flip solid and the system shows no sign of functionality.

Well this is just great… while my original bus errors were solved (in Ubuntu 17.10) by adding “pcie_aspm=off” to grub, I seem to be back where I started with the Threadripper after upgrading to 18.04 today!

  • The issue now occurs in both the old 4.13 kernel and the new 4.15 kernel
  • “pcie_aspm=off” or “pcie_aspm=force” in grub does not have an effect
  • Booting normally takes me to a blank screen after login (but that’s probably an driver issue with the 1080Ti)

Does anyone have any idea what to try next?

additional info: going back to PCI-e 2.0 in BIOS seems to solve the issue (but of course that comes of a performance loss that leaves me with an under-used 1080Ti :frowning: )

have you also tried to limit PCIe to 2.0 instead of 3.0 in BIOS?

The only reason this works is because PCIe 2.0 doesn’t support ASPM, the better option is to provide the pcie_aspm=off flag. I am not sure why it’s not working for you with Ubuntu though, I am running on a custom 4.15 kernel in Debian and this is working fine for me.

Well I suspected something like that (that it works only because aspm is not an issue in PCEe 2.0). However, that doesn’t explain why I’m still having the problem with PCIe 3.0 and “pcie_aspm=off” in grub. Unless there is some difference in the way grub is parsed between Ubuntu 17.10 and 18.04… or I’m missing something else…

“pcie_aspm=off” works fine for me in Ubuntu 18.04. Running a 1950X and Zenith Extreme. Maybe a dumb question, but you did rebuild your grub config after making the changes right? (update-grub).