WRX90e won’t boot with 6 GPUs

Hello,

I’m building a pretty monstrous system with my wrx90e and 7985wx threadripper, but I’ve hit a snag: it boots fine with 5 GPUs, but the 6th GPU keeps it from posting. It gets to QCode 94 and then jumps back to 00 and tries again. Endless loop.

I’ve tried removing a card from slot 2, and it boots up fine. I tried removing a card from slot 3 instead, and it also boots up fine. So it seems that it’s something to do with just having that many GPUs and not anything wrong with the slots.

I’ve tried disabling fast boot, turning on resizeable bar support, turning off VGA support on the board, disabling on-board audio, wifi, setting lanes to gen 4 for each pcie slot, and switching the ipmi switch on the motherboard, but nothing seems to be working.

The first 4 slots are nvidia 6000 ada GPUs. The 5th and 6th slots are 4090s.

I’m on the latest bios (currently 0404)

Any ideas?

3 Likes

Have you tested forcing PCIe 3 or PCIe 4x on all slots yet?

Also did you do a round robbin test with the GPUs to see if maybe one is marginal? IE test each card in each slot?

1 Like

Hello @Devinkb

That is really a pretty monstrous workstation.
Just out of curiosity what RDIMMs are you using?

I forced pcie gen 4 on all the slots, which made no change. Haven’t tried gen3.

I have tested each of the cards individually in another system and they all work, but I haven’t tried changing around the slots. This is my next test. Specifically I think I’ll be moving slot 2 or 3 to slot 7. I’ll also try removing slot 5 or 6 to see if it has something to do with having 4 of the same card. I’m pretty convinced this has something to do with addressing or mapping either the number of cards or the number of display outputs. I have even tried connecting a cable to each of the 24 display outputs and letting it cycle (hoping that it would eventually catch a display output) but to no avail.

Thanks for the suggestions!

As for the RAM, I’m using 8 x 64GB sticks of:
Samsung M321r8ga0bb0-cqkmg 64gb (1x64gb) Ddr5 4800mhz Pc5-38400
Dual Rank Ecc Registered 1.1v Cl40 Ddr5 Sdram 288-pin Rdimm Memory

All sticks show up fine in Windows and I had no problems getting the machine to post or getting it to recognize the ram.

1 Like

ONLY ADJUST 1 SETTING AT A TIME! Otherwise you will be in hell trying to figure out which of the 5 changes made the difference.

Try SR-IOV on/off as well.

Make sure the slots are correctly configured based on the number of required lanes.


You can also try adjusting the PCIe gain on the bottom slots to see if you have a signal integrity issue (part of why I suggested testing GEN 3)


I would suggest turning off PCIe Hot plugging at least as a testing step


These settings should adjust addressing if that is actually the issue.


In the NBIO/IMMOU section there are some addressing settings you can test as well.

4 Likes

Thanks for the new suggestions! Been working through some of these for the last 20 minutes. And yes, just adjusting one thing at a time.

Here’s what I’ve tried:

  • Moved slot three to slot 7 - no change. I put slot three back.
  • Removed slot 6 so that we’re back down to 5 GPUs (just trying to see if it was related to which GPU was removed). As suspected, the computer boots appropriately again now that there’s only 5 GPUs. It appears it doesn’t matter which GPU is removed, or which slot it is removed from. It always works as long as there are only 5 GPUs.
  • Adjusted bios to enable ten bit tag support - no change.

Will work through some of these other suggestion too. Thanks!

GL!

I dont have this board or a comparable config, so Im just working off the manual and way to much experience with server motherboards, but happy to help as much as I can.

The only comment I can make to multi GPU was to enable above 4G decoding, this fixed on my WRX80

I’ve seen this help on systems before going to gen3 allowed it to boot

Alright, just checking back in here - still no luck.

  • I’ve tried going to GEN 3 on all slots - no luck.
  • I’ve changed the three PCIe ARI Support, PCIe ARI Enumeration, and PCIe Ten Bit Tag Support (Turned them to enabled, one at a time, and all together) - no luck.
  • I don’t see PCIe slot Hot-Plug settings - maybe another setting somewhere is hiding that menu option from me?
  • I also don’t see above 4G Decoding as an option anywhere (not even in the bios manual)

I can tell when it’s going to fail because it gets into a q-code 94 PCI Enumeration loop (shows 94, then shows 00 and the process starts all over again). When I remove a GPU (down to 5 GPUs) it shows 94, then 96, and then the boot logo shows up on screen. So I think it’s really something failing while enumerating or “assigning resources”. I’ve looked at some of the other threads on PCI Enumeration and it seems it’s always something small like CSM being enabled or not enough power. I have three 1600W power supplies hooked up to this thing! :smiley:

If somebody knows how or where to turn on 4G decoding or turn off PCIe hot-swapping, I’m all ears.

And thanks again to everybody that has chimed in. I appreciate it!

1 Like

How are you physically fitting those gpu’s in the board? Adapters? Extenders?

I’m using single slot waterblocks for all of them. rack style with the ports all at the end. Heatkiller waterblocks for the 6000 adas, and ekwb rack style blocks for the zotac amp 4090s.

1 Like

CMS is under Boot > CSM

PCIe Hotplug is Advanced > AMD PBS > PCIe Hotplug

REBAR is Advanced > PCI Subsystem > Resize BAR support

Try your power settings too Advanced > APM Configuration > Power On By PCI-E (set disabled)

Try adjusting the error handling on PCIe too, Advanced > AMD PBS > RAS > PCIe GHES Notify Type

Under AMD CBS > NBIO Common Settings > there are several PCIe related toggles you can try flipping

Also, are the boot errors showing up in the BIOS/BMC logs?

Have you opened a support ticket with Asus?

1 Like

I am curious how you connected three PSU’s to your system. This board support only dual identical PSU.

Thanks infinitevalence.

  • CSM is disabled.
  • Resizeable BAR is enabled. (but no 4g decoding setting in the BIOS that I can tell - I’m assuming Resizeable BAR effectively achieves the same thing or is turned on with it)
  • the APM Configuration menu does show the Power On By PCI-E option and it is disabled. The other option that shows up in the bios menu for PCI-E Hot Plug support does not show up in my AMD PBS menu (only USB settings show up there). I’m assuming all that stuff is just disabled.
  • I will take a look at the error handling and see if I can get something to show up!
  • The boot errors have not been showing up in the BMC logs.
  • I probably have more learning to do in the RAS and NBIO settings areas. Apart from the ones you’ve already mentioned, I don’t know what a lot of the settings do.
  • I have talked with ASUS support tech with no luck and no conclusion, but I will end up there again once I give up on trying things. (The support here has been way more helpful - thank you!)

Gluon-free: you can basically just daisy chain PSUs together and achieve a multi-psu system. I’ve been doing this for years. That the board features dual power supplies for the PSUs seems to just be an extra perk that allows the board itself to pull power from two supplies (AFAIK, typically in a non-server, multi-psu system, the motherboard itself will only pull power from a single power supply). I use a dual psu adapter via a molex cable typically, and I plugged the dual psu adapter that came with the motherboard into that to attach the 2nd and 3rd power supplies.

On a final and more fun note, I finally succumbed to just booting into windows last night with 5 of the 6 cards working and I benchmarked the GPUs in Cinebench 2024. Compared to the highest score on hwbot.org, I’m only a thousand points below the number 1 spot holder at 126k points (They were running 6 4090s and achieved 127k points). I’m hoping this last GPU pushes me over the top! (I’m not building this system for benchmarks, of course, but it’s fun still!)

5 Likes

I assume its for rendering, AI/ML, or research because no one is gaming on 6 GPUs :stuck_out_tongue:

1 Like

The most frustrating thing - Huang gimped even RTX 6000 Ada cards by disabling NVLINK. So this is not the best even for researchers. And the H100 costs so much that no institution (outside big IT and world top laboratories) can afford to build servers with it…

what about ROCm and CUDA translation on AMD?

1 Like

Does 2 of 3 PSU’s are identical and connected properly via included cable to the motherboard? I feel that this could be a problem, this board can have some kind of protection.

Didnt tried, i have only one 4090 and outside of gaming and rendering i am doing small ODE systems which can fit in 24 GB of VRAM. But yes, if this will success my future cards will be AMD for sure.