PSA: ASPM causes PCI-E bus errors under heavy SATA load on AM4

Methylzero · February 8, 2019, 3:11pm

If you have an AM4 system and you are seeing errors in your kernel logs like this after an extended period of heavy SATA load :

kernel: [323826.023666] pcieport 0000:00:03.2: AER: Multiple Corrected error received: id=0000
kernel: [323826.023680] pcieport 0000:00:03.2: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=001a(Transmitter ID)
kernel: [323826.023687] pcieport 0000:00:03.2:   device [1022:1453] error status/mask=00001100/00006000
kernel: [323826.023690] pcieport 0000:00:03.2:    [ 8] RELAY_NUM Rollover
kernel: [323826.023692] pcieport 0000:00:03.2:    [12] Replay Timer Timeout

Then you are being affected by some sort of PCI-E bug.
I have seen this on two systems, both equipped with a 1700X, an X370 Taichi, 32GB of ECC RAM and 2 WD Black drives in RAID 0 (mdraid), running up to date Lubuntu 18.04 LTS.
Others have seen similar errors on Threadripper and APU systems.

In my case, the errors seem to have been corrected and no adverse effects have been observed.
The cause of these issues seems to be ASPM, a power saving feature that can reduce power consumption a little bit, by messing with the PCI-E links. AFAIK, this is a part of the PCI-E 3.0 standard, so it should work, but in practice it often does not, due to buggy implementations.

The problem manifests when you have heavy SATA activity going on for extended periods of time. I believe it is caused by the PCI-E link between the CPU and the chipset having a flaky ASPM implementation.
Disabling ASPM using the pcie_aspm=off kernel boot option appears to solve this issue completely.