I didn’t think to peek at syslog
until the system caught a T10 protection information error. While my main issue was cleared up, there’s been a residual sluggishness with the VMs that I haven’t shaken off. And now that I’ve seen them, I cannot unsee:
The picture doesn’t give the full picture of my situation, but it does show that it’s been going on for a while. These entries have been filling the logs in bursts approximately every 10 seconds or so.
01.00.0
/00:01.0
/00:01.1
can be attributed to the graphics card slot. It was a GeForce GT 1030 until midnight of 2024-13-14. Then it was a RTX 4000 ADA. That the behavior persists across video card changes can only mean that the problem is not card-related.- The remainder of the errors under
00:02.1
are chipset-connected devices.5c:00.0
corresponded to an Intel Optane P5800X which—until today—remained connected via an M.2-to-OCuLink-to-ICY DOCK path including a redriver. The OCuLink cable is 65 cm, which I purchase from Supermicro. However, I had also inserted it into a different slot connected via a shorter 25 cm cable which I was working perfect on a different motherboard brand with the same setup. Yet, the PCIe AER entries did not abate for the P5800X (seen as AER entries for55:00.0
after I moved it to a different slot in the ICY DOCK enclosure). The Intel Optane P4800X works fine though.
I’m beginning to suspect the either motherboard or EMI taking a toll on the signal integrity. I’ve also had brief episodes of Linux complaining about possible EMI while initializing USB ports and SATA drives until I moved some cables around. But why would the primary PCIe slot generate so many AER entries? The Broadcom P411W-32P gave no such issues on the secondary slot, and that’s closest to where all the cables and wires run.
Should I panic or is this what a contemporary system is supposed to tolerate?
- Motherboard: ASRock X670E Taichi
- CPU: AMD Ryzen 7 7700X
- PSU: XPG Fusion 1600 Titanium
EDIT: The ones coming from the GPU went away after adding kernel command line option pcie_aspm=off
.
The PCIe 4.0 errors over OCuLink, however, are unresolved.