Threadripper & PCIe Bus Errors

I’m also having this issue now in 2020, after having added a few PCIe devices. In my previous configuration (Asrock X399M Taichi with 1920X), I only had a single NVMe SSD and an X1 PCIe DVB card. I’ve since added another NVMe SSD and a second hand RX570 (which obviously was used for mining).

Ever since, I’ve seen those bus errors, which eventually locks up the whole system. Disabling ASPM drops me in the same situation as @Jimeb: the oldest NVMe SSD starts to give errors.

I’ve tried switching to PCIe 2 on both switches, but that triggered the error immediately! I can’t find a setting to downclock the PCIe bus.

I’ve now pulled out the GPU (without any luck) and switched the new SSD to another position (crossing my fingers now). It triggers mostly with high workloads, i.e., Folding@Home. Strangely, I use the machine to compile a lot of Rust code, which doesn’t seem to trigger anything.

Is there a Windows-equivalent of this failure? I cannot imagine this only happens on Linux…

Are you overclocking your RAM, like most every AMD owner seems to do?

I had serious problems with my 3900X build running 3,600 memory. Including weird graphics errors with my Vega 56. My problems were also when under heavy load. Single threaded Memtest runs always completed without error.

It is my understanding that the PCIe and RAM controllers are tightly integrated and affect each other.

Try running your RAM at 2,400 or whatever its stock speed is, and see if things change.

I’m running at 3200, the default of the modules (just loaded XMP, I’m not at home in these matters). If it persists, I’ll drop down to JEDEC spec just for funs and giggles.

Yeah anything with XMP or DOCP is technically overclocking the RAM.

While I get that, it should be supported by the memory vendor, and AMD says 3200 shouldn’t be a problem. But if it doesn’t stay stable tonight under Folding at Home and the GPU being re-added, I’ll downclock it to JEDEC!

The joke here is that it ran stable for a year when it had only a single SSD and no GPU, and that was under the XMP profile.

Could F@H’s use of AVX instructions factor into this? I read somewhere that they induce voltage spikes on Intel Broadwell CPUs when using adaptive vcore voltage. I think my next gambit would be to try fixed voltage.

AVX having something to do with it sounds probable, I’ll try setting fixed voltage next.

It seems like I had my RAM already at JEDEC, applying the latest EFI update must have reset it. I’m currently booting withpci=nommconf now, hadn’t tried that yet. It seems like downclocking the FSB is not an option in the Asrock X339M Taichi UEFI, or I must be looking over it.

With my setup, I could run Folding at Home full-throttle (all cores and GPU engaged) for anywhere from minutes to hours before crashing. I re-installed the motherboard and CPU I used before the upgrade to ThreadRipper, reverting from an Asrock Taichi x399 / 1920 combination to the prior combination of Asrock Taichi Z370 / i3 8100. The RAM and the power supply were the same in both configurations. I got the same behavior. I replaced an EVGA 650 watt PS with a Seasonic 750 watt PS. Same behavior. I swapped out the RAM. Right now I am running a pair of G.Skill Flare X sticks on the i3 8100.

It is running F@H full-bore since yesterday evening. It has gone past the duration of any previous run I had with the old RAM. I would have it in the Threadripper setup right now except that the Flare X sticks look like they’re too tall to fit under a Dark Rock Pro air cooler.

Long story short: even though the Seasonic is a tasty and essential PS upgrade for the TR MB, it looks like it was the memory (DDR HyperX Fury) all along.

Reinstalled the ASRock Taichi X399 motherboard with G.Skill Flare X memory sticks (2 x 16GB) and a new Seasonic 750 PS. I have since run the system with Folding at Home going full bore 24-threads, all 12 cores on the CPU and full-throttle on my Geforce GTX 1080ti Nvidia graphics card. At the same time, I built Android including the Linux kernel from scratch (make -j20). Only changes to BIOS settings from default were to disable spread-spectrum, use the RAM profile for the G.Skills and turn on hypervisor support.

With the motherboard out of the case, I was able to easily slot the G.Skills under the Dark Rock Pro CPU cooler. System runs cool with the case fans cranked up. The hottest temperature I saw in the box was 76 degrees C. I am delighted that I get to avoid the expense of an AIO setup.

I am having similar issues. AMD Ryzen Threadripper 1950x. Linux Ubuntu 16.4 LTS. No problem whatsoever until recently when computer started crashing. Diagnosed as a bad cpu water cooler. Replaced that, but now going into emergency mode when booting up. From there I am continuously getting:
[ 3276.313138] pcieport 0000:00:01.1 PCI Bus Error: severity:Corrected, type=Data Link Layer, id=0009(Receiver ID)
[ 3276.313768] pcieport 0000:00:01.1 device [1022:1453] error status/mask=00000080/00006000
[ 3276.314401] pcieport 0000:00:01.1 [ 7] Bad DLLP

I know others have found solutions, but help would be very much appreciated. Thank you in advance.

try adding pcie_aspm=off to your kernel line.

1 Like

Wow, you are the fastest turn around for an answer ever. Thanks.
Problem turned out to be something different. I have 4 X 1TB drives in Raid 10. One of them remained unplugged after putting my box back together. I found the problem in comparing blkid vs. fstab. My UID for a disk on fstab to mount was not on the blkid. And there were only 3 of the 4 disks on blkid.
I poked around the cables until I found the loose one. Now I’m good. Working again.
Strange symptom for an even stranger problem.
Maybe someone can use this lesson.
Thanks again.
Joe

Hi bought some new parts to put together a Home proxmox install and create some VM’s for the usual suspects; LAN cache, NAS, Surveillance from 2 cameras, smart home routines, next cloud, etc.

Parts:

  • TR 1920x
  • ASRock Rack X399D8A-2T
  • Seasonic 800w focus plus
  • 8xSAS drives
  • 2xSamsung 970 Evo plus nvme m.2
  • 1xSamsung 860 Evo SATA
  • LSI SAS 9210-8i

I installed proxmox on a z1 pool on both the 970 Evo nvme ssd’s. So far, the install works fine. I can see the web GUI and create VM’s no problem, but the console is getting barraged by PCI-E BUS Error corrected, and my logs are one GB already in the span of a week. My grub has these modifications:

  • GRUB_CMDLINE_LINUX_DEFAULT=“quiet pci=noaer,nomsi,nommconf pcie_aspm=off”

And on my bios, I have changed the following settings:

  • spread spectrum → disabled
  • SR-IOV → Enabled
  • IOMMU → Enabled
  • Core Performance Boost → Disabled
  • Power Supply Idle Control → Typical Current Idle

Also the error reports a device name [144d:a808] which I believe is Samsung = 144d and NVMe SSD Controller SM981/PM981/PM983 = a808
I also have one more issue where 2 of the SAS drives do not show up. I say this in case it affects anything other ways; just ignore it.

You probably do want msi, I’d get rid of the nomsi part.of your kernel.

Also use magician from a usb stick to see.if there are firmware updates for your drives. You may be surprised.

Let me try magician; even though I did all this, the error barrage is still there.

this should probably also be enabled, normally. Can disable to troubleshoot but if disabling doesnt fix it, no need to live with this disabled.

look for an option like platform first error handling and make sure that’s set to enabled

The complete error says:
pcieport 0000:0c:01.4: AER: Corrected Error Received

nvme 0000:0c:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)

nvme 0000:0c:00.0: AER: device [144d:a808] error status/mask=00000001/0000e000

nvme 0000:0c:00.0: AER: [ 0] RxErr

I tried quite a bit to get the magician software to run off the USB but no luck mounting the USB any hints about how better to do this I downloaded magician DC 64bits from Samsungs website

Download the enterprise magician and use Rufus to burn the iso to usb.

Aer errors mean aer is still on. Double check bios settings. Cat proc/cmdline as something is up with your command like and aer is infact still on

Well turns out since I installed proxmox on a zfs mirror pool, proxmox used ‘systemd’ as a boot loader instead of grub so all my changes to /etc/default/grub were doing nothing, I just added pci_aer=off in line to “/etc/kernel/cmdline” and ran "proxmox-boot-tool refresh " and that worked it stoped the nvme error corrected

1 Like

I would like to point out that this does not, in fact, stop the errors. It stops reporting the errors. Your hardware is still processing every single one and losing performance due to retransmits. But you won’t run out of log space.

2 Likes