ASUS WRX90, unRAID, AER Corrected Error nightmare on dual 4090 setup

I’m reaching out in hopes that someone can help me figure out what I’m doing wrong here. I get these errors:

Jul 23 22:53:02 w kernel: pcieport 0000:20:01.1: AER: Corrected error message received from 0000:21:00.0
Jul 23 22:53:02 w kernel: pci 0000:21:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jul 23 22:53:02 w kernel: pci 0000:21:00.0:   device [10de:2684] error status/mask=00000001/00000000

on my two GPUs connected through two 90cm riser cables in slot 4 and 5. I have tried to adjust most of the NBIO options that I know a little about, but cannot seem to get this going. my passthrough efforts cause ACPI BIOS Error when I try to boot Windows 11 in unRAID stable.

My setup is as follows:

Motherboard: WRX90 Sage
CPU: 7955WX
RAM: 4x64 Kingston Enterprise 4800MTs
Cooler: Noctua SP6\w standard dual 140mm
PSU: Seagate 1600W Titanium
GPU: Dual RTX4090 supreme (2 slot water cooled ones)

  • I’ve connected them to two 90cm riser cables
  • I also have a Intel Arch 750 as main head (slot 7)

Add in cards:

  • 2x Gen5 Asus Hyper NVME 16x PCIE (slots 1 and 2) (8x Firecuda 540 2TB sticks)
  • RTX4090 (slot 3)
  • Broadcom MegaRAID SAS 9361-24i (slot 4)
    • 12x HGST 12TB drives
    • 12x Samsung 990 Evo 8TB drives
  • RTX4090 (slot 5)
  • Intel x520 converged (slot 6)
  • Intel Arc 750 (slot 7)
  • 4x Corsair MP700 1TB (in motherboard slots)
  • Intel WiFi 6 + Bluetooth 5.2 Desktop kit

I suspect it has something to do with the riser cables, but because of space limitations, I can’t put them directly in the slots.

Is there something I can do here?

Try setting pcie_aspm=off on the kernel command line and see if that helps. I had to do that to silence similar errors on Ubuntu.

Thanks! After some reading I can see that this simply disables ASPM, but as far as I know, this is supported by my motherboard. The Windows VM’s are choppy at best, even when I use this.

Is there anyone here familiar with the Mobo and have made it work well with unRAID and Windows guests? What BIOS settings did you use, and are there anything I should consider with my current setup?

1 Like

This raises my eyebrow (left).

The RTX 4090 is a PCIe Gen 4 device. Gen 4 is not known to be cable-friendly without intervention (e.g., a retimer) and especially if you’re hanging the cables off slots that far down.

If your cables must be that long, try taking the slot down a notch to Gen 3 speeds.

If the performance penalty is too great, consider a retimer, Slim SAS cable, and PCIe slot adapter chain.

There’s also this massive thread for general problems including the one such as yours: https://forum.level1techs.com/t/a-neverending-story-pcie-3-0-4-0-5-0-bifurcation-adapters-switches-hbas-cables-nvme-backplanes-risers-extensions-the-good-the-bad-the-ugly

2 Likes

@LiKenun, I was going to add the same information you did. @swoy When your issue is fixed, you might want to consider making your Windows VM only use specific cores. I forget what the technical term is.

Yes, core isolation, already done this :slight_smile:

1 Like

Yeah, that is definitively my next step. I’m also contemplating life choices, and I’m looking at C-Paynes bifurcation adapters and the like. But I feel I need to read up on that a bit more first.

I know there are redriver options in the BIOS for slot 5,6 and 7 - but I have no clue how to work those options.

1 Like

What is the size of power supply for the two GPU. First thing I thought of was power issue. Could also be a bad riser cable. Try swapping them and see if the error changes?

1 Like

The power supply is the same as I mentioned in my initial post, a Seasonic 1600W, it has two native 12VHPWR connectors that I use.

Both GPUs have the same issues, and even when I move them to the top two slots I have this issue.

1 Like

Just to clarify, I have the exact same motherboard. But it looks like my errors are different from yours (I rebooted without that setting to check; and to see if the new 0502 BIOS fixed it – it didn’t.) Yours are at the physical layer, and mine were at the data link layer:

2024-07-23T21:32:13.067691-04:00 mighty kernel: pcieport 0000:00:01.1: AER: Correctable error message received from 0000:01:00.0
2024-07-23T21:32:13.067707-04:00 mighty kernel: nvidia 0000:01:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
2024-07-23T21:32:13.067708-04:00 mighty kernel: nvidia 0000:01:00.0:   device [10de:1b80] error status/mask=00001000/00000000
2024-07-23T21:32:13.067710-04:00 mighty kernel: nvidia 0000:01:00.0:    [12] Timeout

Hi @swoy, have you tried Checking for Firmware/Driver Updates? Ensure that all components, including the motherboard, GPUs, and other PCIe devices, have the latest firmware and drivers. This can sometimes resolve compatibility and communication issues.

I have updated to 0502, I also booted windows (which runs fine btw) on bare metal and updated to the latest gpu drivers to make sure they have the latest firmware.

I still suspect I need to adjust something in the BIOS, because I get weird issues when I offload PCIe devices to guests or when I shut them down. Guests start up takes a very long time to boot and I see CPU spikes that coincide with stutters on the screen.

The Intel Arc card fails PCI reset after a guest reboot - even with efifb (and using intel_iommu=on ) driver disabled. That being said, I haven’t tried any other solutions to unbind the card.

I am curious to know your BIOS settings. One thing that really bugs me, is the seemingly worse configuration options and documentation for the WRX90 board, on my WRX80, the BIOS provides a lot more.

Another gripe I have, is the Pre-Boot DMA protection, which takes 15 minutes on the WRX90, but takes seconds on the WRX80.

2 Likes

Did you ever figure this out
I have Pro WS WRX90E-SAGE SE / 7965WX and getting issues on slot 4 with a NIC so i dont think it would be a power issue…

Latest bios, good power supply and tested mutliple NIC’s and a video card and they are all throwing this error. Also slowed the speed down from “auto” to match the pcie cards gen

No, I never figured this out. ASUS Support is not very helpful either, they don’t escalate these issues and first line just drown out the requests with basic troubleshooting steps. Unless you have 2M+ subscribers on Youtube, they don’t seem willing to help out.

I ended up booting fedora, seems stable enough. unRAID and Proxmox doesn’t work properly on this motherboard.

I ordered a couple of c-paynes gen5 switches and retimers. Once I get the proper configs for my setup I will report back if this is solved (and that I can finally run proxmox)

1 Like

Yeah,

So I got the PCIe switches installed, no change on slots 3 and 4. It looks like they are the culprit themselves somehow. The retimers should work well but I still get these errors… I’m starting to think that this is a design issue with the traces on these boards, or some issue with power management settings I don’t know about. Anyone here who has experience on these issues?

I have tried anything I could think of. All power saving related features are switched off and I’m on the latest BIOS – at a loss here. I can’t have AER errors like these in a production environment :confused:

Are you getting multiple errors, or just one per card/slot on boot?

I recall having a correctable AER event as well on my WRX90 (for RTX3090) but only once at boot-up - not while under load or any other time afterwards. And the link speed is as expected.

Unable to reproduce it, everything working well - I wrote it off as a harmless quirk.

And one last question- do you get these events when doing a soft reboot AND when doing a warm reboot?

Asking because I vaguely recall something like this (where there was a single AER event at boot) being caused by a race condition in powering the card.

That is a very good point. I get more from cold boot actually. I assume there is some tuning or warming up to do then. With all the bells and whistles turned off or to the max, I get 5-10 of them early on, then they usually only happen once in a while. If I turn off the screens while the graphics cards are attached to those slots, then turn them back on, these errors arrive again and can sometimes cause Fedora to crash completely.

A disclaimer here is that I have no idea what I’m talking about, this is just sketchy educated guessing on my part, I know nothing about trace quality or PCIe connections.

I’m waiting for a reply from ASUS on these issues, but I’m afraid this is something I cannot fix with software.