Having lots of fun with Threadripper the last few weeks. I haven’t seen much traffic on this, so I’m making a post here in hopes that people searching for it will find this thread on the Level1 forum. I intend to link to other places on the internet where people might be discussing and/or resolving this issue.
This seems really similar to a problem that Intel’s X99 had on launch that was fixed with a later software update.
PCIe Bus errors occur in certain circumstances (“by default”) on popular distros such as Fedora and Ubuntu.
I have tested as far as 4.13-git and the errors still occur. As I mentioned on my VFIO live stream the other day, you can mitigate the issue by disabling message signaled interrupts and/or memory mapped IO. I’ve had some conflicting reports that disabling mmio also disables msi, but I’m not too sure about that because of what I’m seeing in load testing.
Here are the errors Threadripper users are seeing:
[ 641.339624] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[ 641.339641] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[ 641.339646] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[ 641.339649] pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000040/00006000
[ 641.339652] pcieport 0000:00:01.1: [ 6] Bad TLP
[ 643.385764] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[ 643.385783] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[ 643.385787] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[ 643.385791] pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000040/00006000
[ 643.385794] pcieport 0000:00:01.1: [ 6] Bad TLP
[ 648.193351] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[ 648.193372] pcieport 0000:00:01.1: AER: Multiple Corrected error received: id=0000
[ 648.193408] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Transmitter ID)
[ 648.193412] pcieport 0000:00:01.1: device [1022:1453] error status/mask=00001080/00006000
[ 648.193414] pcieport 0000:00:01.1: [ 7] Bad DLLP
[ 648.193416] pcieport 0000:00:01.1: [12] Replay Timer Timeout
[ 648.193421] pcieport 0000:01:00.2: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0102(Transmitter ID)
[ 648.193424] pcieport 0000:01:00.2: device [1022:43b1] error status/mask=00003000/00002000
[ 648.193426] pcieport 0000:01:00.2: [12] Replay Timer Timeout
[ 649.183105] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[ 649.183128] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[ 649.183133] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[ 649.183137] pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000040/00006000
[ 649.183140] pcieport 0000:00:01.1: [ 6] Bad TLP
[ 649.381124] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[ 649.381142] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[ 649.381147] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[ 649.381151] pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000040/00006000
[ 649.381153] pcieport 0000:00:01.1: [ 6] Bad TLP
[ 649.678112] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
...snip...
[45827.569777] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[45827.569792] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[45827.569795] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[45827.569798] pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000080/00006000
[45827.569799] pcieport 0000:00:01.1: [ 7] Bad DLLP
[46245.264835] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[46245.264848] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[46245.264851] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[46245.264854] pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000080/00006000
[46245.264855] pcieport 0000:00:01.1: [ 7] Bad DLLP
[46657.063543] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[46657.063556] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[46657.063559] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[46657.063562] pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000080/00006000
[46657.063563] pcieport 0000:00:01.1: [ 7] Bad DLLP
[47075.088619] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[47075.088630] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[47075.088633] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[47075.088636] pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000080/00006000
[47075.088637] pcieport 0000:00:01.1: [ 7] Bad DLLP
[47529.569908] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[47529.569925] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[47529.569930] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[47529.569934] pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000080/00006000
[47529.569936] pcieport 0000:00:01.1: [ 7] Bad DLLP
The main errors seem to be Bad DLLP and Bad TLP – in all cases the errors are reported as AER: Corrected error received, so at least it seems benign.
It is worth noting it seems to happen whether I’m using Nvidia (Asus Strix 1080) or AMD (Vega 64) Graphics cards.
I haven’t had any word out of AMD yet; not sure if anyone else has seen anything. AMD has added a ton of code to the -git version of the kernel, things are looking great, but still getting the PCIe bus errors.
Watch this space for future updates. If anyone finds discussion threads /for threadripper/ other places around the internet, please post them. (We’ll skip linking to the old x99 threads, they’re nigh useless).