Asus WRX80 constant AER/Bad TLP errors in logs from NVME drives?

There is some sort of problem with the Asus WRX80 SAGE motherboard and using gen4 nvme drives in the Asus Hyper M.2 gen4 add-in card that causes the motherboard to report constant AER/Bad TLP erorrs in linux logs. This eventually makes the filesystem read only and you need to hard restart.

My system specs are:
Asus WRX80 Sage gen1 motherboard, most recent BIOS version 1106
Unraid OS
TR Pro 3975wx 32-core CPU
4x 64 GB DDR4 3200 MHz RDIMM ECC ram
1x Hyper M.2 gen4 card with 1x 500 GB Samsung 970 NMVE, 1x 1 TB Samsung 970 NVME, and 1x 1 TB WD SN850X NVME
1x Hyper M.2 gen4 card with 4x 2 TB Samsung 970 NVME drives
2x LSI 9300-8i HBAs
1x Intel x710-QDA2 40 G NIC
1x Nvidia 1660 Super GPU

My Hyper M.2 cards are in PCI slots 1 and 2.

This issue has been documented in multiple locations such as:
https://forum.level1techs.com/t/pro-ws-wrx80e-sage-se-wifi-aer-error/192187
https://forum.level1techs.com/t/asus-pro-ws-wrx80e-sage-dmesg-is-full-of-corrected-pcie-and-or-aer-errors/178004/8
this build log https://blog.tmm.cx/2023/01/18/build-log-threadripper-pro-5975wx-linux-workstation-on-the-asus-pro-ws-wrx80e-sage-se-wifi/
and this github page which I do not understand https://gist.github.com/zekome/35db528b33206e68f18439ad7fabfcd5

The error in the logs is:

Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 512
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]: It has been corrected by h/w and requires no further action
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]: event severity: corrected
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]:  Error 0, type: corrected
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]:   section_type: PCIe error
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]:   port_type: 4, root port
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]:   version: 0.2
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]:   command: 0x0407, status: 0x0010
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]:   device_id: 0000:00:01.3
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]:   slot: 0
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]:   secondary_bus: 0x03
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x1483
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]:   class_code: 060400
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0012

Device ID 00:01.3 traces back to [1022:1483] 00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge

There is a purported fix of enabling pci=nommconf flag documented here https://unix.stackexchange.com/questions/327730/what-causes-this-pcieport-00000003-0-pcie-bus-error-aer-bad-tlp . The github page above also reports a fix, but I do not understand it nor how to implement it.

I wanted to report that the pci=nommconf flag did not work for me on it’s own, but it did cause the logs to change, now indicating that the error was originating from a device ID that matched my WD SN850X NVME drive. The fix that worked for me was manually setting both of the PCIe slots with the Asus Hyper M.2 add-in cards to gen3 speed.

I have read that this could be due to the use of PCIe redrivers with incorrect settings for the bottom 3 PCIe slots causing AER errors for devices in those slots, but my NVME drives are in slots 1 and 2.

Has anyone else experienced this issue when using gen4 NVME drives, or a mix of gen3 and gen4 NVME drives on this motherboard?

Edit: I was still running into the system crashes. I removed pci=nommconf and set all pcie slots to gen3 speed. This seems to have fixed the issue completely. I’m going to look into getting a different brand’s board in the near future.

2 Likes

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.