There is some sort of problem with the Asus WRX80 SAGE motherboard and using gen4 nvme drives in the Asus Hyper M.2 gen4 add-in card that causes the motherboard to report constant AER/Bad TLP erorrs in linux logs. This eventually makes the filesystem read only and you need to hard restart.
My system specs are:
Asus WRX80 Sage gen1 motherboard, most recent BIOS version 1106
Unraid OS
TR Pro 3975wx 32-core CPU
4x 64 GB DDR4 3200 MHz RDIMM ECC ram
1x Hyper M.2 gen4 card with 1x 500 GB Samsung 970 NMVE, 1x 1 TB Samsung 970 NVME, and 1x 1 TB WD SN850X NVME
1x Hyper M.2 gen4 card with 4x 2 TB Samsung 970 NVME drives
2x LSI 9300-8i HBAs
1x Intel x710-QDA2 40 G NIC
1x Nvidia 1660 Super GPU
My Hyper M.2 cards are in PCI slots 1 and 2.
This issue has been documented in multiple locations such as:
https://forum.level1techs.com/t/pro-ws-wrx80e-sage-se-wifi-aer-error/192187
https://forum.level1techs.com/t/asus-pro-ws-wrx80e-sage-dmesg-is-full-of-corrected-pcie-and-or-aer-errors/178004/8
this build log https://blog.tmm.cx/2023/01/18/build-log-threadripper-pro-5975wx-linux-workstation-on-the-asus-pro-ws-wrx80e-sage-se-wifi/
and this github page which I do not understand https://gist.github.com/zekome/35db528b33206e68f18439ad7fabfcd5
The error in the logs is:
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 512
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]: It has been corrected by h/w and requires no further action
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]: event severity: corrected
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]: Error 0, type: corrected
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]: section_type: PCIe error
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]: port_type: 4, root port
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]: version: 0.2
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]: command: 0x0407, status: 0x0010
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]: device_id: 0000:00:01.3
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]: slot: 0
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]: secondary_bus: 0x03
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]: vendor_id: 0x1022, device_id: 0x1483
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]: class_code: 060400
Apr 20 00:40:00 MSaladUnraid kernel: {52}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0012
Device ID 00:01.3 traces back to [1022:1483] 00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
There is a purported fix of enabling pci=nommconf
flag documented here https://unix.stackexchange.com/questions/327730/what-causes-this-pcieport-00000003-0-pcie-bus-error-aer-bad-tlp
. The github page above also reports a fix, but I do not understand it nor how to implement it.
I wanted to report that the pci=nommconf
flag did not work for me on it’s own, but it did cause the logs to change, now indicating that the error was originating from a device ID that matched my WD SN850X NVME drive. The fix that worked for me was manually setting both of the PCIe slots with the Asus Hyper M.2 add-in cards to gen3 speed.
I have read that this could be due to the use of PCIe redrivers with incorrect settings for the bottom 3 PCIe slots causing AER errors for devices in those slots, but my NVME drives are in slots 1 and 2.
Has anyone else experienced this issue when using gen4 NVME drives, or a mix of gen3 and gen4 NVME drives on this motherboard?
Edit: I was still running into the system crashes. I removed pci=nommconf
and set all pcie slots to gen3 speed. This seems to have fixed the issue completely. I’m going to look into getting a different brand’s board in the near future.