Bursts of PCIe AER Log Entries Approximately Every 10 Seconds

LiKenun · March 20, 2024, 1:38am

I didn’t think to peek at syslog until the system caught a T10 protection information error. While my main issue was cleared up, there’s been a residual sluggishness with the VMs that I haven’t shaken off. And now that I’ve seen them, I cannot unsee:

The picture doesn’t give the full picture of my situation, but it does show that it’s been going on for a while. These entries have been filling the logs in bursts approximately every 10 seconds or so.

01.00.0/00:01.0/00:01.1 can be attributed to the graphics card slot. It was a GeForce GT 1030 until midnight of 2024-13-14. Then it was a RTX 4000 ADA. That the behavior persists across video card changes can only mean that the problem is not card-related.
The remainder of the errors under 00:02.1 are chipset-connected devices. 5c:00.0 corresponded to an Intel Optane P5800X which—until today—remained connected via an M.2-to-OCuLink-to-ICY DOCK path including a redriver. The OCuLink cable is 65 cm, which I purchase from Supermicro. However, I had also inserted it into a different slot connected via a shorter 25 cm cable which I was working perfect on a different motherboard brand with the same setup. Yet, the PCIe AER entries did not abate for the P5800X (seen as AER entries for 55:00.0 after I moved it to a different slot in the ICY DOCK enclosure). The Intel Optane P4800X works fine though.

I’m beginning to suspect the either motherboard or EMI taking a toll on the signal integrity. I’ve also had brief episodes of Linux complaining about possible EMI while initializing USB ports and SATA drives until I moved some cables around. But why would the primary PCIe slot generate so many AER entries? The Broadcom P411W-32P gave no such issues on the secondary slot, and that’s closest to where all the cables and wires run.

Should I panic or is this what a contemporary system is supposed to tolerate?

Motherboard: ASRock X670E Taichi
CPU: AMD Ryzen 7 7700X
PSU: XPG Fusion 1600 Titanium

EDIT: The ones coming from the GPU went away after adding kernel command line option pcie_aspm=off.

The PCIe 4.0 errors over OCuLink, however, are unresolved.

Jolly · April 3, 2024, 2:54am

Were you able to fix this issue?

LiKenun · April 13, 2024, 1:43am

Well, the PCIe errors from the VM went away. I haven’t figured out the PCIe 4.0 instability, but surely a better cable (from a reputable vendor) should do the trick.

On the other hand, I’ve also run into a new issue where the system randomly resets from time to time―sometimes a day or two apart and sometimes merely an hour apart. RAM stress test passed the 40 hour mark, so it’s probably not that. Running the system for over a day so far without starting the Windows VM (host is desktop Ubuntu) and it seems stable. System load seems to be a non-factor as the resets happened simply with only Microsoft Edge running in the guest.

I enabled 1 GiB huge pages, but that was at least a week before the first random reset. So I suspect a change in the VM config to enable nested virtualization, which I did a few days after I enabled huge pages.

LiKenun · April 27, 2024, 12:57am

12 days of uptime and counting… safe to say that the host has no issues. Time to find out where the problem is in the VM configuration.

alkafrazin · April 27, 2024, 1:12am

Is this plugged into a UPS or power filter?

LiKenun · April 27, 2024, 1:19am

Neither. Everything is plugged directly into the wall.