Threadripper & PCIe Bus Errors

Update:

tldr: MSI is helping now, big thanks fot MSI! Called for help outside our community because everything I’ve tried so far hasn’t worked.

6 Likes

The RX 550 and Threadripper pass through is pretty much a no go. I got it to almost work once. It hard locks the machine.

Turns out my sata ssd also goes into read only mode due to errors. Yay more screenshots.

3 Likes

is there any verification that this is a serious error or needless system messages spam ??

i have applied the kernel option to ignore the aer errors and it seemed to work . but it still bugs me that these errors occurs, especially with the rate that they occur. my discussions with msi tech support about these were fruitless.

hopefully you will get further along with them .

just for refernce

the error occur on m y 1950x msi procarbon

kingston predator memory qvl

pny anarchy memory non qvl

team extreem 4133 non qvl

gtx1080

ati hd 6570

gt 610

gt 240

usb boot , sata boot. regular hdd boot .

any version of unix i have thrown at shows these errors (prox mox arch, solus, fedora, all they way upto kernel 4.13 ) … makes me wonder if they are occuring in windows and it is just hidden.

also makes me wonder if this is why windows 7 crashes on install at irql not equal to or less than regarding pci.sys.

i would love to have my mind put at ease as if this is a serious issue or not.

Seeing the same issue on my Gigabyte X399 rig.

@wendell please feel free to delete my thread here and migrate my post to this thread - sorry!

Meanwhile, /u/AMD_Robert says everything is fine. :confused:

Reconfirmed that Vega 10 does indeed work fine, but my rx550 does not. Nor does nvidia. Amd vi timeout, dpc error, disconnected from Host.

This is on the otherwise rock stable ASrock x399 fatality.

Will keep testing. Next up to test is r9 390

1 Like

I’m not too familiar with the low-level stuff here, but since Vega 10 works, I’m assuming it’s a software issue and can be fixed?

crosses fingers hopefully

Some of the pcie bus errors from heavy reads/writes on sata devices:

On my Gigabyte X399 - I’m still seeing

	Sep 29 22:24:10 threadripper kernel: dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
	Sep 29 22:24:10 threadripper kernel: pcieport 0000:00:01.1: AER: Corrected error received: id=0000
	Sep 29 22:24:10 threadripper kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Transmitter ID)
	Sep 29 22:24:10 threadripper kernel: pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00001000/00006000
	Sep 29 22:24:10 threadripper kernel: pcieport 0000:00:01.1:    [12] Replay Timer Timeout
	Sep 29 22:24:25 threadripper kernel: dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
	Sep 29 22:24:25 threadripper kernel: pcieport 0000:00:01.1: AER: Corrected error received: id=0000
	Sep 29 22:24:25 threadripper kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Transmitter ID)
	Sep 29 22:24:25 threadripper kernel: pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00001000/00006000
	Sep 29 22:24:25 threadripper kernel: pcieport 0000:00:01.1:    [12] Replay Timer Timeout
	Sep 29 22:24:44 threadripper kernel: dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
	Sep 29 22:24:44 threadripper kernel: pcieport 0000:00:01.1: AER: Corrected error received: id=0000
	Sep 29 22:24:44 threadripper kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Transmitter ID)
	Sep 29 22:24:44 threadripper kernel: pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00001000/00006000
	Sep 29 22:24:44 threadripper kernel: pcieport 0000:00:01.1:    [12] Replay Timer Timeout
	Sep 29 22:24:48 threadripper ntpd[35088]: Soliciting pool server 2409:11:53c0:200::2:123

As for the device in question…

	[root@threadripper opt]# ./ls-iommu.sh |grep 1453
	IOMMU Group 12 40:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1453]
	IOMMU Group 1 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1453]

My full IOMMU mapping is listed here. From the mapping, it looks like 40.01.03 is the culprit, namely the GTX1070

	-+-[0000:40]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device 1450
	 |           +-00.2  Advanced Micro Devices, Inc. [AMD] Device 1451
	 |           +-01.0  Advanced Micro Devices, Inc. [AMD] Device 1452
	 |           +-01.3-[41]--+-00.0  NVIDIA Corporation GP104 [GeForce GTX 1070]
	 |           |            \-00.1  NVIDIA Corporation GP104 High Definition Audio Controller

CC @wendell Would you know if Gigabyte are patching this?

@wendell Have you tried finding speakers of the annual KVM Forum event ?
I mean looking up speakers and visit their sites/blogs and asking there ?
It could be that someone encountered, or maybe even resolved, the issues at hand. Or at least give more insight.

https://www.youtube.com/channel/UCRCSQmAOh7yzgheq-emy1xA

https://www.linux-kvm.org/page/KVM_Forum

Has there been any fixes for the PCIe bus errors in recent UEFI’s (from Gigabyte/Asus?) CC @wendell @ryan @kreestuh

Hi, All,

Just a quick update, I updated to 4.14.0, but the error still exists with ASPM enabled.


Hi, build a gentoo system on 1950X with Gigabyte Designare Ex MB, but the error is same as you mentioned.
After booting to system, I got the error flush my console. The only option for now is to remove the ASPM support from the kernel, but I know it is not the right solution. please do keep update the progress of the fix.

Thanks

Same issue on CentOS 7.4, specs:

  • AMD TR 1950X
  • Asus Zenith Extreme
  • Nvidia GT 710

So are there any news yet?

No, be it I haven’t had any side effects from disabling aspm at boot

Thanks, adding pcie_aspm=off to grub mitigated the issue for me, no more kernel messages since.

So in theory if I understand it correctly, there is no negative impact to setting this, apart from “higher” power consumption at idle. So not a big deal on a GT 710.

Still hoping to see AMD and the Linux guys get together for fixing this flaw. Granted, the intersection of people buying Threadripper and people running Linux is quite small, but still, come on, implementing PCIe spec correctly cant be that hard, @AMD. smh

1 Like

It may actually be Nvidia at fault with the pcie spec. Aspm can be left on with the Polaris and Vega cards. And pcie 10 gig cards. Not sure.

1 Like

Can confirm that it’s not just Nvidia cards that cause the issue. My Magewell Pro Capture AIO also appears to trigger the errors with ASPM.

dmesg

[   19.746245] dpc 0000:00:03.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[   19.746257] pcieport 0000:00:03.1: AER: Corrected error received: id=0000
[   19.746260] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Receiver ID)
[   19.746263] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00000040/00006000                                                      
[   19.746266] pcieport 0000:00:03.1:    [ 6] Bad TLP

lspci -tv

-+-[0000:40]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device 1450
 |           +-00.2  Advanced Micro Devices, Inc. [AMD] Device 1451
 |           +-01.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           +-02.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           +-03.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           +-03.1-[41]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Tonga PRO GL [FirePro W7100]
 |           |            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Tonga HDMI Audio [Radeon R9 285/380]
 |           +-04.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           +-07.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           +-07.1-[42]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 145a
 |           |            +-00.2  Advanced Micro Devices, Inc. [AMD] Device 1456
 |           |            \-00.3  Advanced Micro Devices, Inc. [AMD] USB3 Host Controller
 |           +-08.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           \-08.1-[43]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 1455
 |                        \-00.2  Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
 \-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device 1450
             +-00.2  Advanced Micro Devices, Inc. [AMD] Device 1451
             +-01.0  Advanced Micro Devices, Inc. [AMD] Device 1452
             +-01.1-[01-07]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 43ba
             |               +-00.1  Advanced Micro Devices, Inc. [AMD] Device 43b6
             |               \-00.2-[02-07]--+-00.0-[03]----00.0  Device 1d6a:d107
             |                               +-04.0-[04]----00.0  Intel Corporation I211 Gigabit Network Connection
             |                               +-05.0-[05]----00.0  Intel Corporation Device 24fb
             |                               +-06.0-[06]----00.0  Intel Corporation I211 Gigabit Network Connection
             |                               \-07.0-[07]--
             +-01.2-[08]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961
             +-01.3-[09-0a]----00.0-[0a]----00.0  Creative Labs CA0108/CA10300 [Sound Blaster Audigy Series]
             +-02.0  Advanced Micro Devices, Inc. [AMD] Device 1452
             +-03.0  Advanced Micro Devices, Inc. [AMD] Device 1452
             +-03.1-[0b]----00.0  Nanjing Magewell Electronics Co., Ltd. Device 0002
             +-04.0  Advanced Micro Devices, Inc. [AMD] Device 1452
             +-07.0  Advanced Micro Devices, Inc. [AMD] Device 1452
             +-07.1-[0c]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 145a
             |            +-00.2  Advanced Micro Devices, Inc. [AMD] Device 1456
             |            \-00.3  Advanced Micro Devices, Inc. [AMD] USB3 Host Controller
             +-08.0  Advanced Micro Devices, Inc. [AMD] Device 1452
             +-08.1-[0d]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 1455
             |            +-00.2  Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
             |            \-00.3  Advanced Micro Devices, Inc. [AMD] Device 1457
             +-14.0  Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
             +-14.3  Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge
             +-18.0  Advanced Micro Devices, Inc. [AMD] Device 1460
             +-18.1  Advanced Micro Devices, Inc. [AMD] Device 1461
             +-18.2  Advanced Micro Devices, Inc. [AMD] Device 1462
             +-18.3  Advanced Micro Devices, Inc. [AMD] Device 1463
             +-18.4  Advanced Micro Devices, Inc. [AMD] Device 1464
             +-18.5  Advanced Micro Devices, Inc. [AMD] Device 1465
             +-18.6  Advanced Micro Devices, Inc. [AMD] Device 1466
             +-18.7  Advanced Micro Devices, Inc. [AMD] Device 1467
             +-19.0  Advanced Micro Devices, Inc. [AMD] Device 1460
             +-19.1  Advanced Micro Devices, Inc. [AMD] Device 1461
             +-19.2  Advanced Micro Devices, Inc. [AMD] Device 1462
             +-19.3  Advanced Micro Devices, Inc. [AMD] Device 1463
             +-19.4  Advanced Micro Devices, Inc. [AMD] Device 1464
             +-19.5  Advanced Micro Devices, Inc. [AMD] Device 1465
             +-19.6  Advanced Micro Devices, Inc. [AMD] Device 1466
             \-19.7  Advanced Micro Devices, Inc. [AMD] Device 1467

I did add a GeForce 6600 to the system and it didn’t make the problem any worse. The error follows the slot that the capture card is installed in. The card seems to work fine regardless though. Using ASRock X399 Professional Gaming.

2 Likes

Hi, There.

I replaced my Gigabyte Designare EX motherboard with Asus Zenith Extreme last Friday and today I restored my gentoo linux on this new mo. Also I updated the bios which released from Asus on Dec 7.

It seems the pci-e bus error is gone now. Not sure what’s the reason? Maybe it is the new Bios update has some fix on this.

I may do some test in next few days to confirm this, but if someone get sometime can test it. the latest bios of zenith extreme is 0804.

Make sure aspm is not disabled on the new board?

How can I check whether the bios enable or disable it? I compile the ASPM into the kernel not a module. I didn’t see the annoying message from dmesg in these 2 days now.