Bursts of PCIe AER Log Entries Approximately Every 10 Seconds

I didn’t think to peek at syslog until the system caught a T10 protection information error. While my main issue was cleared up, there’s been a residual sluggishness with the VMs that I haven’t shaken off. And now that I’ve seen them, I cannot unsee:

The picture doesn’t give the full picture of my situation, but it does show that it’s been going on for a while. These entries have been filling the logs in bursts approximately every 10 seconds or so.

  • 01.00.0/00:01.0/00:01.1 can be attributed to the graphics card slot. It was a GeForce GT 1030 until midnight of 2024-13-14. Then it was a RTX 4000 ADA. That the behavior persists across video card changes can only mean that the problem is not card-related.
  • The remainder of the errors under 00:02.1 are chipset-connected devices. 5c:00.0 corresponded to an Intel Optane P5800X which—until today—remained connected via an M.2-to-OCuLink-to-ICY DOCK path including a redriver. The OCuLink cable is 65 cm, which I purchase from Supermicro. However, I had also inserted it into a different slot connected via a shorter 25 cm cable which I was working perfect on a different motherboard brand with the same setup. Yet, the PCIe AER entries did not abate for the P5800X (seen as AER entries for 55:00.0 after I moved it to a different slot in the ICY DOCK enclosure). The Intel Optane P4800X works fine though.

I’m beginning to suspect the either motherboard or EMI taking a toll on the signal integrity. I’ve also had brief episodes of Linux complaining about possible EMI while initializing USB ports and SATA drives until I moved some cables around. But why would the primary PCIe slot generate so many AER entries? The Broadcom P411W-32P gave no such issues on the secondary slot, and that’s closest to where all the cables and wires run. :thinking:

Should I panic or is this what a contemporary system is supposed to tolerate?

  • Motherboard: ASRock X670E Taichi
  • CPU: AMD Ryzen 7 7700X
  • PSU: XPG Fusion 1600 Titanium

EDIT: The ones coming from the GPU went away after adding kernel command line option pcie_aspm=off.

The PCIe 4.0 errors over OCuLink, however, are unresolved.

Were you able to fix this issue?

Well, the PCIe errors from the VM went away. I haven’t figured out the PCIe 4.0 instability, but surely a better cable (from a reputable vendor) should do the trick.

On the other hand, I’ve also run into a new issue where the system randomly resets from time to time―sometimes a day or two apart and sometimes merely an hour apart. RAM stress test passed the 40 hour mark, so it’s probably not that. Running the system for over a day so far without starting the Windows VM (host is desktop Ubuntu) and it seems stable. System load seems to be a non-factor as the resets happened simply with only Microsoft Edge running in the guest.

I enabled 1 GiB huge pages, but that was at least a week before the first random reset. So I suspect a change in the VM config to enable nested virtualization, which I did a few days after I enabled huge pages.

1 Like

12 days of uptime and counting… safe to say that the host has no issues. Time to find out where the problem is in the VM configuration.

Is this plugged into a UPS or power filter?

Neither. Everything is plugged directly into the wall.