Threadripper & PCIe Bus Errors

bleomycin · July 18, 2019, 5:35pm

Super late to the party, not sure if this warranted a new thread or not.

Just setup a new Ryzen 3700X build on an Asrock x470 taichi ultimate and I’m seeing identical errors to OP on a fresh proxmox 6.0.1 install which is debian buster with proxmox custom kernel 5.0.15-1. For now I set pcie_aspm=off in grub and it seems the errors have gone. Is this still considered a recommended fix?

kernel: [  557.900765] pcieport 0000:00:01.3: AER: Corrected error received: 0000:00:01.3
kernel: [  557.900769] pcieport 0000:00:01.3: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
kernel: [  557.900780] pcieport 0000:00:01.3:   device [1022:1483] error status/mask=00000040/00006000
kernel: [  557.900784] pcieport 0000:00:01.3:    [ 6] BadTLP

SgtAwesomesauce · July 18, 2019, 5:36pm

Yep, I still use this.

Adam_Wilber · July 18, 2019, 10:06pm

Glad I’m not the only one super late to the party. Just built my first AMD system with ASRock B450 Mobo, Ryzen 5 3600, RX5700xt and a 500gb NVMe SSD, and I’m getting these same errors but for my networking. I’m up to date on Ubuntu 18.04. I tried installing one of my spare 4 port Intel nics but all it did was move the error to the active port on that nic. I tried pcie_aspm=off in grub as well as setting pcie and promontory to gen2 manually to no avail. Anybody have any ideas?

gnif · July 18, 2019, 10:32pm

Hi @Adam_Wilber and welcome to the L1Tech forum.

Can you please show the dmesg output, your kernel version, and your VM configuration.

wendell · July 18, 2019, 10:42pm

Mixing older chipsets with ryzen 3000 seems to require manually setting pcie gen3 and not all boards expose that yet. Soon tho

Adam_Wilber · July 19, 2019, 4:41pm

Thank you both for your quick responses, but I think I figured out what my issue was. After scouring the net for different things to try I realized when I set up this new hardware I had to manually set my memory speed, but didn’t realize the board would default to 1.2V for DDR4 regardless of speed input, so I had to go set that to the proper 1.35V which fixed the rest of the issues I was having.

Last time I build a desktop was when Sandy Bridge came out and I had more money to throw at a toy back then so I was able to afford a better motherboard that did most of the OC work for me I guess.

moddie · August 14, 2019, 4:57pm

pci=nommconf was just what i nedded to shut up my 4.19 (deb buster)

you saved the ssd where the logfiles live

moddie · August 14, 2019, 6:45pm

fyi, on my x399+1900X deb stretch 4.9.0 works totally fine, without modification.
dist-upgrade gave me a good scare
[Update]
4.19 may be quiet with that option but it is not stable
just got stuck no errors to look at…
4.9.0 has thrown 2 “AER … Corrected” errors but bis stable with buster for now. just don’t “apt-get autoremove”
[Update2]
4.9.0 was not stable with buster after all. for whatever reason however,
4.9.0 is rock solid with stretch. had been for weeks b4 upgrade…
What great pcie related changes were made between 4.9 and 4.19 kernel versions?

hansl · August 17, 2019, 3:47pm

Hi. I’m not on Threadripper, but I was getting the same BadTLP, BadDLLP, etc. errors on a Ryzen 2700X build, with a Asus Prime B450 Plus motherboard, and Nvidia GTX 1660 graphics.

I tried setting pcie_aspm=off in grub and while this did get rid of these errors, it caused a much worse problem that my M.2 NVMe drive controller would get stuck in a low power state or something, at which point it would quit X and remount my root partition as readonly, causing a whole lot of headache.

I found this problem was triggered for me mostly when running some mprime “P-1” workloads with a large percentage of memory allocated (8 of 16GB). This uses a lot of memory bandwidth, and for whatever reason seemed to be interfering with the NVMe like this.

Here is some dmesg output of what it looks like when the NVMe controller went down for me:

[  989.409598] perf: interrupt took too long (4979 > 4912), lowering kernel.perf_event_max_sample_rate to 40000   
[ 1195.031765] fuse: init (API version 7.31)                                                  
[ 1327.328770] perf: interrupt took too long (6268 > 6223), lowering kernel.perf_event_max_sample_rate to 31750
[ 2238.284260] perf: interrupt took too long (7846 > 7835), lowering kernel.perf_event_max_sample_rate to 25250
[ 9117.462381] perf: interrupt took too long (9831 > 9807), lowering kernel.perf_event_max_sample_rate to 20250
[ 9261.476036] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[ 9261.603999] pci_raw_set_power_state: 19 callbacks suppressed
[ 9261.604009] nvme 0000:01:00.0: Refused to change power state, currently in D3
[ 9261.604430] nvme nvme0: Removing after probe failure status: -19
[ 9261.632241] print_req_error: I/O error, dev nvme0n1, sector 15247304 flags 100001
[ 9261.632255] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[ 9261.729511] nvme nvme0: failed to set APST feature (-19)
[ 9261.739582] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
[ 9261.739591] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
[ 9261.739595] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
[ 9261.756670] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 1, flush 0, corrupt 0, gen 0
[ 9261.756951] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 2, flush 0, corrupt 0, gen 0
[ 9261.758061] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 3, flush 0, corrupt 0, gen 0
[ 9261.758368] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 4, flush 0, corrupt 0, gen 0
[ 9261.759112] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 5, flush 0, corrupt 0, gen 0
[ 9261.759138] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 6, flush 0, corrupt 0, gen 0
[ 9262.276359] Core dump to |/bin/false pipe failed
[ 9262.336595] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[ 9262.336817] caller _nv000939rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs
[ 9262.975980] snd_hda_codec_hdmi hdaudioC0D0: HDMI: invalid ELD data byte 62
[ 9263.012987] Core dump to |/bin/false pipe failed
[ 9263.015801] Core dump to |/bin/false pipe failed
[ 9263.035986] snd_hda_codec_hdmi hdaudioC0D0: HDMI: invalid ELD data byte 1
[ 9263.134288] Core dump to |/bin/false pipe failed
[ 9265.580609] BTRFS: error (device nvme0n1p2) in btrfs_commit_transaction:2234: errno=-5 IO failure (Error while writing out transaction)
[ 9265.580610] BTRFS info (device nvme0n1p2): forced readonly
[ 9265.580611] BTRFS warning (device nvme0n1p2): Skipping commit of aborted transaction.
[ 9265.580612] BTRFS: error (device nvme0n1p2) in cleanup_transaction:1794: errno=-5 IO failure
[ 9265.580613] BTRFS info (device nvme0n1p2): delayed_refs has NO entry
[ 9292.708719] btrfs_dev_stat_print_on_error: 320 callbacks suppressed
[ 9292.708723] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 123, rd 208, flush 0, corrupt 0, gen 0
[ 9368.485780] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 208, flush 0, corrupt 0, gen 0
[ 9577.728458] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 209, flush 0, corrupt 0, gen 0
[ 9577.728508] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 210, flush 0, corrupt 0, gen 0
[ 9577.728715] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 211, flush 0, corrupt 0, gen 0
[ 9577.728768] Core dump to |/bin/false pipe failed
[ 9578.059425] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 212, flush 0, corrupt 0, gen 0
[ 9578.059466] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 213, flush 0, corrupt 0, gen 0
[ 9578.059531] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 214, flush 0, corrupt 0, gen 0
[ 9578.059555] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 215, flush 0, corrupt 0, gen 0
[ 9578.059574] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 216, flush 0, corrupt 0, gen 0
[ 9578.059590] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 217, flush 0, corrupt 0, gen 0
[ 9578.059604] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 218, flush 0, corrupt 0, gen 0
[ 9608.872774] btrfs_dev_stat_print_on_error: 1 callbacks suppressed
[ 9608.872777] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 125, rd 219, flush 0, corrupt 0, gen 0
[ 9608.872797] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 126, rd 219, flush 0, corrupt 0, gen 0
[ 9608.872805] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 127, rd 219, flush 0, corrupt 0, gen 0
[11308.648706] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 127, rd 220, flush 0, corrupt 0, gen 0
[11308.648753] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 127, rd 221, flush 0, corrupt 0, gen 0

This was on OpenSUSE Tumbleweed which I was trying out, but I just switched over to Linux Mint 19.2 today since I’m more familiar with that. (It turns out the BadTLP,etc. errors show up on both distros, but I figured I would try Mint and see if the results were any different)

So yeah I’m definitely not going to try turning off aspm again. I think I’ll just ignore these errors as they don’t seem to be causing any real problems as far as I can tell. I might eventually try setting “noaer” to hide these errors, but for now I’m fine just not thinking about them as long as my system isn’t crashing.

edit: BTW, my SSD is: Crucial P1 500GB 3D NAND NVMe PCIe M.2 SSD - CT500P1SSD8

Jimeb · August 22, 2019, 10:52am

Hello, I am getting the same errors on my build:
Ryzen 7 1700
Asus Prime X370 Pro
Asus ROG Strix RX Vega 64
I’m using Debian Unstable with 5.2.7 kernel at the moment.
The thing is, i tested everything a while ago with another kernel (probably between 4.17 and 4.19) but lost the file in which i recorded everything. Thus i can’t be sure it’s exactly the same now, but it looks quite similar.
The errors now are:

[   58.175644] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[   58.175650] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[   58.175656] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00000080/00006000
[   58.175659] pcieport 0000:00:03.1: AER:    [ 7] BadDLLP               
[   69.769504] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[   69.769511] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[   69.769517] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00000040/00006000
[   69.769521] pcieport 0000:00:03.1: AER:    [ 6] BadTLP

After a few minutes it began to throw more and more “Timeout” errors:

[ 1497.907503] pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:00:00.0
[ 1497.908661] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 1497.908664] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=000010c0/00006000
[ 1497.908666] pcieport 0000:00:03.1: AER:    [ 6] BadTLP                
[ 1497.908668] pcieport 0000:00:03.1: AER:    [ 7] BadDLLP               
[ 1497.908670] pcieport 0000:00:03.1: AER:    [12] Timeout               
[ 1497.984646] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[ 1497.984650] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 1497.984653] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00001000/00006000
[ 1497.984655] pcieport 0000:00:03.1: AER:    [12] Timeout               
[ 1497.995671] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[ 1497.995675] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[ 1497.995677] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00000040/00006000
[ 1497.995679] pcieport 0000:00:03.1: AER:    [ 6] BadTLP                
[ 1498.050775] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[ 1498.050779] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[ 1498.050783] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00000040/00006000
[ 1498.050785] pcieport 0000:00:03.1: AER:    [ 6] BadTLP                
[ 1498.094854] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[ 1498.094858] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 1498.094861] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00001000/00006000
[ 1498.094863] pcieport 0000:00:03.1: AER:    [12] Timeout               
[ 1498.116898] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[ 1498.116902] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[ 1498.116906] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00000080/00006000
[ 1498.116908] pcieport 0000:00:03.1: AER:    [ 7] BadDLLP               
[ 1498.172002] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[ 1498.172007] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 1498.172010] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00001000/00006000
[ 1498.172012] pcieport 0000:00:03.1: AER:    [12] Timeout               
[ 1498.381399] pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:00:00.0
[ 1498.384093] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 1498.384097] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=000010c0/00006000
[ 1498.384100] pcieport 0000:00:03.1: AER:    [ 6] BadTLP                
[ 1498.384101] pcieport 0000:00:03.1: AER:    [ 7] BadDLLP               
[ 1498.384103] pcieport 0000:00:03.1: AER:    [12] Timeout

(that is the last thing i managed to copy from the log)
From around that moment the performance started dropping visibly until the system gui became completely unusable. The screen still kept slowly updating so i could see each line of pixels drawing. I managed to switch to tty1 and observed the “Timeout” errors with almost no “BadTLP” and “BadDLLP”. The time was around [1600]. The system didn’t respond to any input and i had to reboot it with alt+printscreen+b.
The device 0000:00:03.1 which throws errors is:
00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453] (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 29
Bus: primary=00, secondary=09, subordinate=0b, sec-latency=0
I/O behind bridge: 0000d000-0000dfff [size=4K]
Memory behind bridge: fe600000-fe7fffff [size=2M]
Prefetchable memory behind bridge: 00000000e0000000-00000000f01fffff [size=258M]
Capabilities:
Kernel driver in use: pcieport
Back then with the old kernel i also tested different boot options like “nommconf” (i can’t remember the exact results, but it din’t seem to solve the problem), tried the system with almost everything pulled out (just the bare minimum “cpu+1 stick of ram+this gpu+the hdd with the system”), switched some bios options, updated the bios to 4207
2018/12/148.36 MBytes
PRIME X370-PRO BIOS 4207
1.Update AGESA 1006
2.Improve compatibility and performance for Athlon™ with Radeon™ Vega Graphics Processors
The next version was
4406
2019/03/1110.24 MBytes
PRIME X370-PRO BIOS 4406
Update AGESA 0070 for the upcoming processors and improve some CPU compatibility.
ASUS strongly recommends that you update AMD chipset driver 18.50.16 or later before updating BIOS.
But the last chipset driver ASUS ships for Windows 7 is 17.40.2815.1010 so I didn’t upgrade it further.
I also tried Windows 7, which threw BSoD with some error in pci.sys during the boot process or within 1-2 minutes after the startup.

A few times the system didn’t boot and showed nothing on the screen. And If there was another gpu, i could enter the motherboard setup and see that the gpu wasn’t even detected by the motherboard.

Sometimes the boot process stopped just right after the kernel boot and i also could do nothing. When it was detected as the secondary gpu, i tried to pass it to the guest virtual system (using qemu with iommu passthrough and the guest was windows 8.1) and it worked better than the ‘real’ windows 7 but still crashed after a while. The thing I noticed there is that the crashes mostly happened when the gpu changed from 2D to 3D clocks or back from 3D to 2D (when I opened some 3D app for tests or closed it). After that the log was flooded with errors and I couldn’t start the VM again. Sometimes the host system also couldn’t turn off correctly and showed some ‘system calls traceback’ and the CPU register values.

Unfortunately my motherboard’s firmware doesn’t have a switch for pci-e 1/2/3 mode but if i put vega into the last pci-e which is always in x4 2.0 mode, it worked fine with debian for a few days and for a few hours with windows 7 (but still threw a bsod in the end). The problem is, I had to pull everything out of the pc case and it was very inconvenient. The gpu also seems to be fine in an old motherboard with the only pci-e 1.0 port.

Maybe it’s worth noting that the asus prime motherboard works without any problem with an old radeon hd 5750 (which should be pci-e 2.0 and which I’ve been constantly using instead of the faulty (?) vega for this whole time). I also got an rx 580 (pci-e 3.0) for a short test and it worked without any problem.

TL; DR
After all I assumed this vega 64 has some problems with pci-e 3.0 mode. I was going to try to return and replace the gpu (or get a refund) but I still have some doubts. It works with the same motherboard in pci-e 2.0 mode, works with another motherboard in pci-e 1.0 mode, and the same motherboard works with the other pci-e 2.0 and 3.0 gpus. So it works in some circumstances and I am not sure it can be considered faulty.

The ‘soft’ solutions I found earlier didn’t work for me, but my kernel version was above 4.15 so I probably have to roll back and try it again? Anyway, it doesn’t fully solve the problem, as the gpu sometimes does not POST and isn’t detected by the motherboard. The only thing that might help is the motherboard firmware update but in that case my Windows 7 may become unusable (as i mentioned above, asus recommends to update the chipset driver to a version unavailable for Windows 7) and also it seems that the update doesn’t change anything, just adds support for the new ryzen 3xxx CPUs.

What should I do?

wendell · August 22, 2019, 12:38pm

Get the chipset driver from AMD? Pcie aspm may be disabled at boot time for linux, as well as aer. There are some threads here that have various boot params that will likely help

Jimeb · August 25, 2019, 1:05pm

Finally, the long rendering finished yesterday and I updated the motherboard firmware to the newest available version. It also reset the settings to defaults (or whatever) and I had to set everything back as it was (not sure, might have missed something). I still have not changed anything in the kernel boot options but, surprisingly, it works.

So now Vega stays silent in the logs and does not crash or anything, i can use it in linux as the primary gpu. However, i can’t pass it into a VM. No errors, no image on the screen, just nothing. I noticed one thing that changed after the update: the gpu and its audio device get different iommu groups (each one includes only one device) and this does not change if i put it in another slot.

Today I got the rx 580 for a test again. The last time I could successfully pass it into the vm or just use in linux without any errors. Now I am still able to use it in linux, it throws BadTLP and BadDLLP errors like Vega before (and unlike itself before), but it does not crash as Vega did. Also the errors come quite rarely, while Vega was flooding the whole log with them.

[  738.375206] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[  738.375213] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[  738.375218] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00000040/00006000
[  738.375222] pcieport 0000:00:03.1: AER:    [ 6] BadTLP

It still has its audio in the samge group unlike Vega now, but the VM does not start. The errors slightly change:

[ 1502.655116] pcieport 0000:00:03.2: AER: Uncorrected (Non-Fatal) error received: 0000:00:00.0
[ 1502.655124] pcieport 0000:00:03.2: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[ 1502.655130] pcieport 0000:00:03.2: AER:   device [1022:1453] error status/mask=00200000/04400000
[ 1502.655133] pcieport 0000:00:03.2: AER:    [21] ACSViol                (First)
[ 1502.655211] pcieport 0000:00:03.2: AER: Device recovery successful

The VM gets automatically paused and can’t be started again. Another VM not using any pci-e passthrough works without any problems. So the firmware update fixed Vega but broke pci passthrough and rx 580. Ridiculous.

I googled a little but couldn’t find anything about the gpu and its audio device getting separated into different iommu groups. Maybe i should ask here in a different topic because the original problem is gone now and this one looks slightly different than the one discussed in this topic. I still have dual-booting as an option if nothing works, though I’d prefer using a VM.

Anyway, thanks for your advice, now i got my Vega working.

hansl · August 30, 2019, 3:54pm

Just a quick follow-up to my previous post. I was having stability issues with aspm disabled, and causing my NVMe to not wake from low power state, get re-mounted in read-only mode and wreak havoc.

In my post I had attributed this instability to the grub pcie_aspm=off setting, but shortly after that post I realized I was still encountering the same issue.

What I forgot was that in addition to the grub setting, I had also disabled ASPM via BIOS. I’ve now changed that BIOS ASPM setting back to “Auto”, and my system has been totally stable ever since.

(I’m still getting all the PCIe bus errors, but I’m ok with ignoring them now shrug)

zlynx · September 20, 2019, 2:38am

The Corrected errors are mostly not a problem. They are PCI Express versions of Ethernet link transmit errors. If they didn’t show up in the kernel log no one would notice.

My Dell laptop has these all the time, mostly when the NVMe drive is waking up from low power.

I mean, they obviously can indicate a problem, just like when a 10Gbps Ethernet link has 50% transmit errors. But a few here and there are nothing.

bitcore · December 3, 2019, 7:50pm

Sorry to bring up an old thread again… but I’m seeing these errors as well:
[ 6] Bad TLP
[ 7] Bad DLLP
[12] Replay Timer Timeout

on our 100G infiniband cards under load on our 172 node Epyc 7301 (Rome) based HPC cluster using the GIGABYTE MZ61-HD0-00 mainboard. You can imagine the amount of head-node log rotation we have as a result.

We are on F06 bios, and running (RHEL7.4) with a slightly older kernel for “reasons.” uname -a:
Linux ******* 3.10.0-693.21.1.el7.x86_64 #1 SMP Fri Feb 23 18:54:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Our system integrator suggests simply filtering them in syslog, though I doubt this is the best course of action considering this is our primary high speed MPI interconnect, I’d prefer to resolve the underlying condition, even if it’s mostly “informational” and not considered a real “problem”

Is disabling ASPM at boot in grub still the preferred course of action if upgrading BIOS and or Kernel are comparatively painful to achieve? The report from hansl regarding their M.2 card refusing to leave low power states has us concerned. Alternatively, we can stop AER, or filter the logs to ignore it, but we prefer the most stable/performant course of action.

Thanks,

SgtAwesomesauce · December 3, 2019, 9:59pm

I would just disable ASPM. It’s how I run my system.

gnif · December 8, 2019, 10:24pm

TR still has ASPM issues, it doesn’t seem to be a priority for AMD to fix this. I am with @SgtAwesomesauce, just disable ASPM, it really isn’t that useful anyway and there are no negative side effects of disabling it.

jpoplett · March 25, 2020, 6:37pm

I found this thread in a quest to put an end to occasional crashes of my ASRock Taichi X399 / TR 1920 build when under load. I applied the advice here and “believe” I saw improvements in stability. Whenever that happened, I bumped the load. I was running non-compressed VNC session at high-res, folding at home on a nVidia geForce 1080ti card, and mprime for the CPU cores.

That really got things cooking until it crashed hard.

I started to take the machine apart planning on putting in the old guts , had second thoughts and then noticed an unoccupied 4-pin header in the corner of the MB. You with me?

My old guts came from an i3 build (I admit it) and there just wasn’t an extra 12-volt ESP supply on my PS or a place to send it to on the MB.

Upshot is, thanks to this thread, I knew exactly what to do. I ordered an upgrade to my power supply (EVGA 650 -> Seasonic 750) and hope to have it in a couple days.

I am pretty confident, based on all that I read here, that this will bring my woes to an end. Good news is I learned a lot about setting up a stable TR system.

Mine will have

pcm_aspm -> false
PCI Idle Control -> Typical
spread spectrum -> disabled

On the fence about these:

Core performance boost (CBP) -> disabled
ACPCI HPET table

In the frenzy to get back up and running, I switched from Ubuntu 19.10 to Pop!_OS. So far that is looking like a power move. The claim to sort out the unending hassle of keeping nVidia’s graphics drivers, CUDA and CUDN up-to-date and in sync with Ubuntu. It’s a torture I would be happy to eliminate from my life.

I digress. What else should I be thinking of for stability with respect to system config? Thanks for this cachet of advise and charmingly civil discussion. Wish me luck!

John

rubdos · April 2, 2020, 7:56am

I’m also having this issue now in 2020, after having added a few PCIe devices. In my previous configuration (Asrock X399M Taichi with 1920X), I only had a single NVMe SSD and an X1 PCIe DVB card. I’ve since added another NVMe SSD and a second hand RX570 (which obviously was used for mining).

Ever since, I’ve seen those bus errors, which eventually locks up the whole system. Disabling ASPM drops me in the same situation as @Jimeb: the oldest NVMe SSD starts to give errors.

I’ve tried switching to PCIe 2 on both switches, but that triggered the error immediately! I can’t find a setting to downclock the PCIe bus.

I’ve now pulled out the GPU (without any luck) and switched the new SSD to another position (crossing my fingers now). It triggers mostly with high workloads, i.e., Folding@Home. Strangely, I use the machine to compile a lot of Rust code, which doesn’t seem to trigger anything.

Is there a Windows-equivalent of this failure? I cannot imagine this only happens on Linux…

zlynx · April 2, 2020, 3:05pm

Are you overclocking your RAM, like most every AMD owner seems to do?

I had serious problems with my 3900X build running 3,600 memory. Including weird graphics errors with my Vega 56. My problems were also when under heavy load. Single threaded Memtest runs always completed without error.

It is my understanding that the PCIe and RAM controllers are tightly integrated and affect each other.

Try running your RAM at 2,400 or whatever its stock speed is, and see if things change.