Threadripper & PCIe Bus Errors

Having lots of fun with Threadripper the last few weeks. I haven’t seen much traffic on this, so I’m making a post here in hopes that people searching for it will find this thread on the Level1 forum. I intend to link to other places on the internet where people might be discussing and/or resolving this issue.

This seems really similar to a problem that Intel’s X99 had on launch that was fixed with a later software update.

PCIe Bus errors occur in certain circumstances (“by default”) on popular distros such as Fedora and Ubuntu.

I have tested as far as 4.13-git and the errors still occur. As I mentioned on my VFIO live stream the other day, you can mitigate the issue by disabling message signaled interrupts and/or memory mapped IO. I’ve had some conflicting reports that disabling mmio also disables msi, but I’m not too sure about that because of what I’m seeing in load testing.

Here are the errors Threadripper users are seeing:

[  641.339624] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[  641.339641] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[  641.339646] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[  641.339649] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000040/00006000
[  641.339652] pcieport 0000:00:01.1:    [ 6] Bad TLP               
[  643.385764] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[  643.385783] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[  643.385787] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[  643.385791] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000040/00006000
[  643.385794] pcieport 0000:00:01.1:    [ 6] Bad TLP               
[  648.193351] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[  648.193372] pcieport 0000:00:01.1: AER: Multiple Corrected error received: id=0000
[  648.193408] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Transmitter ID)
[  648.193412] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00001080/00006000
[  648.193414] pcieport 0000:00:01.1:    [ 7] Bad DLLP              
[  648.193416] pcieport 0000:00:01.1:    [12] Replay Timer Timeout  
[  648.193421] pcieport 0000:01:00.2: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0102(Transmitter ID)
[  648.193424] pcieport 0000:01:00.2:   device [1022:43b1] error status/mask=00003000/00002000
[  648.193426] pcieport 0000:01:00.2:    [12] Replay Timer Timeout  
[  649.183105] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[  649.183128] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[  649.183133] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[  649.183137] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000040/00006000
[  649.183140] pcieport 0000:00:01.1:    [ 6] Bad TLP               
[  649.381124] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[  649.381142] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[  649.381147] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[  649.381151] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000040/00006000
[  649.381153] pcieport 0000:00:01.1:    [ 6] Bad TLP               
[  649.678112] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
...snip...
[45827.569777] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[45827.569792] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[45827.569795] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[45827.569798] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000080/00006000
[45827.569799] pcieport 0000:00:01.1:    [ 7] Bad DLLP              
[46245.264835] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[46245.264848] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[46245.264851] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[46245.264854] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000080/00006000
[46245.264855] pcieport 0000:00:01.1:    [ 7] Bad DLLP              
[46657.063543] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[46657.063556] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[46657.063559] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[46657.063562] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000080/00006000
[46657.063563] pcieport 0000:00:01.1:    [ 7] Bad DLLP              
[47075.088619] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[47075.088630] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[47075.088633] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[47075.088636] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000080/00006000
[47075.088637] pcieport 0000:00:01.1:    [ 7] Bad DLLP              
[47529.569908] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[47529.569925] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[47529.569930] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[47529.569934] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000080/00006000
[47529.569936] pcieport 0000:00:01.1:    [ 7] Bad DLLP 

The main errors seem to be Bad DLLP and Bad TLP – in all cases the errors are reported as AER: Corrected error received, so at least it seems benign.

It is worth noting it seems to happen whether I’m using Nvidia (Asus Strix 1080) or AMD (Vega 64) Graphics cards.

I haven’t had any word out of AMD yet; not sure if anyone else has seen anything. AMD has added a ton of code to the -git version of the kernel, things are looking great, but still getting the PCIe bus errors.

Watch this space for future updates. If anyone finds discussion threads /for threadripper/ other places around the internet, please post them. (We’ll skip linking to the old x99 threads, they’re nigh useless).

13 Likes

It’s also been reported that, if your motherboard supports it, setting the PCIe Link Speed to the promontory chipset to Gen2 from Gen3 will fix the issue; that may be a better fix that pci=nommconf but anything running through the chipset will be limited to at most 2 gigabytes/sec.

I have also been experimenting with a very slight FSB underclock (98mhz) on TR boards that support it, that may also “correct” the issue but I’m not real sure yet.

5 Likes

Hi.

Following this with interest from home as I wait for my TR4 system to get built and arrive…

From reading about similar issues e.g. here and here I’m curious if you tried the “pcie_aspm=off” kernel setting mentioned at those links, and if so what the result was?

This stackexchange answer tries to ELI5 what pci=nommconf does - “disables Memory Mapped PCI Configuration Space” - which sounds a bit less performance-trashing than disabling memory mapped IO - is that correct, do you think?

Thanks for your output on this, btw, you seem to be the only easily findable resource on TR4/Linux M/B problems and solutions - good job!

3 Likes

this is what I meant, sorry, but it can still trash performance. Interesting. I wonder if disabling aspm in uefi will fix it. Perhaps the promontory chipset does not actually support pcie power saving modes? If so that’d be a trivial fix for board makers in uefi…

will have to try that.

So far disabling aspm has it fixed. Left it compiling so we shall see.

Edit; nope, it generated one error while compiling. DLLP error.

Hmm. Maybe to try aspm and slight underclock.

3 Likes

Ah well, shame… thanks for trying!

The potential performance impact is quite concerning. I wonder why this didn’t show up in the phoronix benchmark efforts?

Update:

tldr: MSI is helping now, big thanks fot MSI! Called for help outside our community because everything I’ve tried so far hasn’t worked.

6 Likes

The RX 550 and Threadripper pass through is pretty much a no go. I got it to almost work once. It hard locks the machine.

Turns out my sata ssd also goes into read only mode due to errors. Yay more screenshots.

3 Likes

is there any verification that this is a serious error or needless system messages spam ??

i have applied the kernel option to ignore the aer errors and it seemed to work . but it still bugs me that these errors occurs, especially with the rate that they occur. my discussions with msi tech support about these were fruitless.

hopefully you will get further along with them .

just for refernce

the error occur on m y 1950x msi procarbon

kingston predator memory qvl

pny anarchy memory non qvl

team extreem 4133 non qvl

gtx1080

ati hd 6570

gt 610

gt 240

usb boot , sata boot. regular hdd boot .

any version of unix i have thrown at shows these errors (prox mox arch, solus, fedora, all they way upto kernel 4.13 ) … makes me wonder if they are occuring in windows and it is just hidden.

also makes me wonder if this is why windows 7 crashes on install at irql not equal to or less than regarding pci.sys.

i would love to have my mind put at ease as if this is a serious issue or not.

Seeing the same issue on my Gigabyte X399 rig.

@wendell please feel free to delete my thread here and migrate my post to this thread - sorry!

Meanwhile, /u/AMD_Robert says everything is fine. :confused:

Reconfirmed that Vega 10 does indeed work fine, but my rx550 does not. Nor does nvidia. Amd vi timeout, dpc error, disconnected from Host.

This is on the otherwise rock stable ASrock x399 fatality.

Will keep testing. Next up to test is r9 390

1 Like

I’m not too familiar with the low-level stuff here, but since Vega 10 works, I’m assuming it’s a software issue and can be fixed?

crosses fingers hopefully

Some of the pcie bus errors from heavy reads/writes on sata devices:

On my Gigabyte X399 - I’m still seeing

	Sep 29 22:24:10 threadripper kernel: dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
	Sep 29 22:24:10 threadripper kernel: pcieport 0000:00:01.1: AER: Corrected error received: id=0000
	Sep 29 22:24:10 threadripper kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Transmitter ID)
	Sep 29 22:24:10 threadripper kernel: pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00001000/00006000
	Sep 29 22:24:10 threadripper kernel: pcieport 0000:00:01.1:    [12] Replay Timer Timeout
	Sep 29 22:24:25 threadripper kernel: dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
	Sep 29 22:24:25 threadripper kernel: pcieport 0000:00:01.1: AER: Corrected error received: id=0000
	Sep 29 22:24:25 threadripper kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Transmitter ID)
	Sep 29 22:24:25 threadripper kernel: pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00001000/00006000
	Sep 29 22:24:25 threadripper kernel: pcieport 0000:00:01.1:    [12] Replay Timer Timeout
	Sep 29 22:24:44 threadripper kernel: dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
	Sep 29 22:24:44 threadripper kernel: pcieport 0000:00:01.1: AER: Corrected error received: id=0000
	Sep 29 22:24:44 threadripper kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Transmitter ID)
	Sep 29 22:24:44 threadripper kernel: pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00001000/00006000
	Sep 29 22:24:44 threadripper kernel: pcieport 0000:00:01.1:    [12] Replay Timer Timeout
	Sep 29 22:24:48 threadripper ntpd[35088]: Soliciting pool server 2409:11:53c0:200::2:123

As for the device in question…

	[root@threadripper opt]# ./ls-iommu.sh |grep 1453
	IOMMU Group 12 40:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1453]
	IOMMU Group 1 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1453]

My full IOMMU mapping is listed here. From the mapping, it looks like 40.01.03 is the culprit, namely the GTX1070

	-+-[0000:40]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device 1450
	 |           +-00.2  Advanced Micro Devices, Inc. [AMD] Device 1451
	 |           +-01.0  Advanced Micro Devices, Inc. [AMD] Device 1452
	 |           +-01.3-[41]--+-00.0  NVIDIA Corporation GP104 [GeForce GTX 1070]
	 |           |            \-00.1  NVIDIA Corporation GP104 High Definition Audio Controller

CC @wendell Would you know if Gigabyte are patching this?

@wendell Have you tried finding speakers of the annual KVM Forum event ?
I mean looking up speakers and visit their sites/blogs and asking there ?
It could be that someone encountered, or maybe even resolved, the issues at hand. Or at least give more insight.

https://www.youtube.com/channel/UCRCSQmAOh7yzgheq-emy1xA

https://www.linux-kvm.org/page/KVM_Forum

Has there been any fixes for the PCIe bus errors in recent UEFI’s (from Gigabyte/Asus?) CC @wendell @ryan @kreestuh

Hi, All,

Just a quick update, I updated to 4.14.0, but the error still exists with ASPM enabled.


Hi, build a gentoo system on 1950X with Gigabyte Designare Ex MB, but the error is same as you mentioned.
After booting to system, I got the error flush my console. The only option for now is to remove the ASPM support from the kernel, but I know it is not the right solution. please do keep update the progress of the fix.

Thanks

Same issue on CentOS 7.4, specs:

  • AMD TR 1950X
  • Asus Zenith Extreme
  • Nvidia GT 710

So are there any news yet?

No, be it I haven’t had any side effects from disabling aspm at boot