Threadripper Pro MB: Thoughts on Supermicro, Gigabyte & Asus WRX80

Hi,

I currently have to purchase a Threadripper Pro Motherboard for NewEgg. My application involves running 6x 3090s. I already have a 5955WX cpu that i can’t return.

I originally got the Asus Pro WS WRX80E-SAGE WIFI-SI R that died (no boot, nothing) after updating the bios. After chatting with Asus Support they recommended I returned it and now i have a credit to use to get a TR Pro MB.

My thoughts so far:

  • Asus WRX80 sage was my first choice-- has all the bells and whistles, but its the first motherboard i’ve had completely die on me and had to return in years.
  • Supermicro MBD-M12SWA-TF-O seems good, but its costing close to the Asus, lacks the second 10g NIC and has one less PCIE expansion slot-- which does not matter that much.
  • GIGABYTE MC62-G40 - Seems ok, but it may not be able to run 6 GPUs due to it not having sufficient on-board power. Does anyone have any experience using 6x GPUS on the MC62-G40?

I am planning using PCIE 4.0x 16 riser cables for the GPUS. I have an Epyc system using the ASROCK ROMED8-2T/BCM motherboard with an AMD EPYC 7532 CPU using PCIE 4.0x 16 riser cables. I can get Gen 4.0 speeds if I use 2-3 GPUs. Using all 6x 3090FEs is stable/works when the bios is set to PCIE GEN 3.0 @8 – which sometimes drop down to running @1. Trying to run 6 GPUS at PCIE 4.0 @16 results in a a very long boot time and PCIE aer warnings, (not errors) that leads to a very slow/non performing machine and VMs not working in Proxmox.

I am hoping since the ThreadRipper Pro boards are meant for workstations they can better handle the GPU configuration I am working with.

For OS, i’m using Proxmox 7.3. PCI pass through to Ubuntu for NVIDIA cards work surprisingly well.

Which Threadripper PRO boards would you recommend running with 6 GPUS connected with PCIE GEN 4.0 x16 riser cables?

thanks,

1 Like

as far as I’m concerned the asrock creator is the best one out there as its currently the only one that supports PBO and curve optimizer on 5000 chips

it looks amazing, but unfortunately, newegg canada does not have it in stock.

There are about a million threads in hardware forums around the world that tell you to never buy a certain product or even brand again because someone had a bad experience. Production and quality control is never a hundred percent, no matter what you buy and there is always a small chance you receive a faulty unit. You like the ASUS board the most, then I would recommend that you get another one. I can tell you that there is no reason I have heard of that would indicate that there is a reoccurring problem with these boards. The ASUS Pro WS WRX80E-Sage SE WIFI (and the II) are mighty fine boards to have. I think you are aware that it does have two additional eight pin PCIe power connectors on the bottom of the board for the specific scenario that someone wants to attach multiple devices with a high power consumption.

4 Likes

According to Asus support there is a refurbished version that comes in a white box. Newegg descriptions says nothing about it being refurbished, which is what I unknowingly bought-- After complaining to Newegg about this, they changed the description removing the Asus NVME card, photos and added “R” to the name, but they do not spell out that its refurbished.

I ended up returning the board a few times-- first time for missing parts, i purchased it thinking it was new. Second time because the resent the same refurbished board again with the missing card. At that point, Newegg told me they made an error in their description and a new board would cost more. So i decided to try the “R” refurbished board and it died right after a bios update-- this saga all started in December and I just want to get the machine built.

I completely agree that one bad board does not make the entire lot suspect, I looked online to resolve the issues i was having and found lots of reports of it dying, which leads me to the exact opposite conclusion-- there seems to be a quality issue. I am tempted to fork out the difference and get the new version of the board, or spend even more for the version II. But prefer to get something that works and not deal with issues going forward.

The entire refurbishes saga reminds me that Newegg has gotten kinda bad (see GN’s recent experience…)

I just ordered the second revision of the Asus board, so no hands on experience yet, but it looks to be the same… just improved. And, as @GigaBusterEXE mentioned, PBO is nice to have and at least according to the bios manual, the new Asus Rev2 will support that, too.

I should have it in my hands on Monday and will report back, but if you can find it, I’d strongly consider it, too.

As for reliability, I think most high end motherboards don’t have much differentiation there.
On power to PCI-e I think the Asus is a good choice, as it has a lot of supplemental power going to the board which should help for your use case. I picked it for a similar reason, as I will stick some more GPUs in mine later as well.

Well Asus supports PBO, just not on 5000, did it mention curve optimizer because that would be a giveaway that it will support pbo on 5000

And I wonder if they’ll support it with a bios update to rev 1 boards, I don’t see why not since the Asus VRMs are actually better and better cooled than the asrock

Yes, page 48. I am hopeful about TR5000 overclocking.

I am curious about the bios update coming to revision 1, but I am not too hopeful about that, since i noticed V2 also got a price bump (here in Germany it is about 200 Eur more, if you can find V2 at all which isn’t even listed on the Asus Germany website yet)

Are you planning on using PCIE Gen4 x16 riser cables?

ASROCK ROMED8-2T/BCM works great with 6x A4000 installed directly into the PCIE slots, but craps out with with the riser cables.

BTW, the ASROCK is my first server motherboard I’ve worked with, and I have to say, at this point, its absolutely amazing-- there is no going back from IPMI.

I am not entirely sure yet which GPU config I will go with, but I may put in a a couple watercooled RTX 4090s, but not with a riser. But I could also end up just going with two Ada RTX6000 quadros.

The main reason why I think the Asus board is better on this, is because it has two 6 pin and one 8 pin PCIe power connector for the motherboard to supply expansion cards, while both the Asrock Creator and the ROMED8 both only have a single 6 pin to supply the slots with additional power. I am guessing that the added impedance of the riser cable took it over the limit in your case.

1 Like

I got my CPU today, and yes, Curve Optimiser is here and works. I have to figure out how to best use it though

1 Like

This is fantastic

So it boils down to 2 methods

  • Single core perf
  • All core perf

So basically if you want single core perf you’ll start by a +200 freq override then start small on curve optimizer, like -5, test on single and few core loads the CPU profile test on 3D mark are the types of test you want where it tests every core thread count, but it’s not perfect.

Work your way up, if you have golden silicon you can hit both -30 curve and +200mhz


For all core you want to aim for -30 curve, you might have to start with -500mhz and work the freq upward if you have bad silicon


Why negative frequency? Because the override primary affects the single core boost bins as all core is power limited, the higher the frequency the larger the gap of voltage increase needed

This is referred to diminishing returns, so if we cut out the inefficient bins we can get away with less voltage

You still need to test with different threaded loads but you should notice your all core turbo avx turbo will be higher

@QuietDevil We have purchased 6 WRX80 motherboards at my company in order to build out some engineering tooling.

We got 2 each of:

  • ASUS PRO WRX80E-SAGE WIFI
  • ASUS PRO WRX80E-SAGE WIFI II
  • ASROCK WRX80 Creator R2.0

Both versions of the ASUS motherboard have issues with Ubuntu (or Linux in general). Lots of PCIe AER messages, PCIe devices not being detected, etc. Some of these issues are discussed in these various threads on this site (sorry can’t include links in this post)

Even though some of these issues are marked as “Solved” and I even commented on a few of them with potential solutions - we are still seeing PCIe errors reported in our dmesg logs. I think the jury is out on whether these actually result in performance degradation but the fact that they are present bothers me.

We built out identical units with the ASRock Creator R2.0 and had absolutely zero issues with our Ubuntu install - no errors in the dmesg log.

I’m still working with ASUS to see if they can identify the issue, but judging by the fact that the ASRock works perfectly, I think it is safe to say that this is a BIOS/firmware issue.

The only cons we have with the ASRock board:

  • Only 5/7 PCIe 4.0 slots support full x16 lanes - the other 2 only support x8
  • If you have a GPU in the 7th slot chances are that your GPU will block the Front Panel USB 2.0 header, audio header, and potentially some of the front panel power, reset, LED etc… ASUS thought this out and made all of the headers at the bottom of the board 90 degrees so you can still use them if you have a GPU there.
  • I think ASRock created R2.0 because they couldn’t get Intel LAN controllers that were on the initial version of the board. So the R2.0 uses the Aquantia (Marvell) AQC113CS. This does have a linux driver that “works” but it is not full featured and depending on your use case you might want to build Aquantia’s “development” driver from their Github. So overall, probably not as reliable as the tried and true Intel NICs when it comes to Linux…

So in terms of stability/reliability with Ubuntu I’d say go with ASRock with the one caveat that I would really prefer Intel NICs over the Aquantia NICs.

1 Like

Yes, seeing those too (WRX80 Wifi II) I haven’t tried the new BIOS that released yesterday yet, have you?

I wish there was a “perfect” board, but it seems all have compromises. At least the Asus’ problems can be fixed in software (I hope) and since I build this as my next 3-5 year workstation I wanted space for many GPUs, 25G or 100G ethernet cards and more, so I ultimately settled on the Asus. But I think if someone builds a system without the need for as much PCIe, I would always get the Asrock instead. I run some Asrock Rack stuff in the data centre and it has blessed me with perfect uptime and not a single issue so far.

Hi,
I was about to buy ASUS WRX80E II mobo but this topic got me thinking about ASRock. I still prefer ASUS due to its feature set but at the same time I’d like not to deal with such PCIe AER errors. I started to look around for info and this is what I have found.

AER: maybe these errors are related to long traces. Do these errors come from PCIe slots 5, 6, 7? I found some info in BIOS about pcie slot redriving - did you experiment with it? Unfortunately these options are only available for slots 5, 6, 7 (located farthest from CPU) and for single M.2 port. Redrivers are there for a reason - I think ASUS knew that there could be problems with long traces. Maybe they assumed the redrivers are not needed for slots 1-4. Maybe this causes problems with some electrically weaker devices. Hard to say.

Another thing: ASRock board has the slots mounted using SMT technology - is it possible that it affects signal quality in a good way? It seems that the new ASUS W790 mobo for Xeons is also using SMT technology.

I can suggest an experiment: Put a card that shows AER errors in slot 5/6/7 and try to adjust various redrivers options.

Or maybe all the info I wrote above is a complete BS and the new bios fixes all the issues. Btw, do these errors occur on Windows?

1 Like

I will verify with the new bios tomorrow (on the version two board), sadly didn’t have the chance today yet. But if it is a hardware issue, that might indeed not fix it. I am seeing them, I currently have my GPU in a lower slot, since it is easier to remove from there while it isn’t sitting in its final case yet.

1 Like

Have a look at the info posted here (especially last two post - this looks indeed very good):

This is actually what I meant: motherboard is big, traces are long and CPU doesn’t have enough power to talk to distant devices.

1 Like

No, haven’t tried yet, but I’ll update the BIOS today and see if anything changes.

That’s an interesting thought. I can do some testing to see if it makes a difference. Though, anecdotally, I can say that the sources of the AER messages have not been consistent across our units. On one of our units it looks like something in the Chipset is throwing errors along with our boot drive (1TB Samsung 980 Pro). On the other unit, it’s a High Point RAID controller that is throwing errors. We have the 980 Pro slotted into M.2_1 which is the closest M.2 slot to the CPU and the only one that is not shared with the U.2 ports.

[   50.668030] {6}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 512
[   50.668036] {6}[Hardware Error]: It has been corrected by h/w and requires no further action
[   50.668038] {6}[Hardware Error]: event severity: corrected
[   50.668040] {6}[Hardware Error]:  Error 0, type: corrected
[   50.668041] {6}[Hardware Error]:   section_type: PCIe error
[   50.668042] {6}[Hardware Error]:   port_type: 4, root port
[   50.668044] {6}[Hardware Error]:   version: 0.2
[   50.668045] {6}[Hardware Error]:   command: 0x0407, status: 0x0010
[   50.668046] {6}[Hardware Error]:   device_id: 0000:60:03.1
[   50.668048] {6}[Hardware Error]:   slot: 0
[   50.668049] {6}[Hardware Error]:   secondary_bus: 0x61
[   50.668050] {6}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x1483
[   50.668051] {6}[Hardware Error]:   class_code: 060400
[   50.668053] {6}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0012
[   50.668093] pcieport 0000:60:03.1: AER: aer_status: 0x00000080, aer_mask: 0x00000000
[   50.668100] pcieport 0000:60:03.1:    [ 7] BadDLLP               
[   50.668103] pcieport 0000:60:03.1: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID
[   55.759246] pcieport 0000:60:03.1: AER: aer_status: 0x00000080, aer_mask: 0x00000000
[   55.759255] pcieport 0000:60:03.1:    [ 7] BadDLLP               
[   55.759259] pcieport 0000:60:03.1: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID

[  160.428124] {13}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
[  160.428131] {13}[Hardware Error]: It has been corrected by h/w and requires no further action
[  160.428133] {13}[Hardware Error]: event severity: corrected
[  160.428134] {13}[Hardware Error]:  Error 0, type: corrected
[  160.428136] {13}[Hardware Error]:   section_type: PCIe error
[  160.428137] {13}[Hardware Error]:   port_type: 0, PCIe end point
[  160.428138] {13}[Hardware Error]:   version: 0.2
[  160.428139] {13}[Hardware Error]:   command: 0x0406, status: 0x0010
[  160.428141] {13}[Hardware Error]:   device_id: 0000:29:00.0
[  160.428143] {13}[Hardware Error]:   slot: 0
[  160.428144] {13}[Hardware Error]:   secondary_bus: 0x00
[  160.428145] {13}[Hardware Error]:   vendor_id: 0x144d, device_id: 0xa80a
[  160.428147] {13}[Hardware Error]:   class_code: 010802
[  160.428148] {13}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0000
[  160.428217] nvme 0000:29:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[  160.428223] nvme 0000:29:00.0:    [ 0] RxErr                  (First)
[  160.428226] nvme 0000:29:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID

According to lspci these two devices are:

29:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
60:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge

Not sure what the GPP Bridge is and any information I find is pretty cryptic. So if anyone has any information about that - I would love to know.

I just read through this thread and WOW! Can’t believe I did not come across this thread before. This indeed looks very promising.

@marcin512 Thank you for pointing this out!

I will say that some of the issues/resolutions in that thread parallel what we have experienced. Mainly, we had our NVMe RAID controller, which is a PCIe 4.0 x16 device, throwing lots of errors and demonstrating some instability. When we forced the PCIe slot that it was plugged into down to PCIe 3.0 the issues went away.

So now for my speculation - it does make sense to me that a lot of these AER messages are related to PCIe signal integrity, and my guess is that there is a transmission error that gets corrected via PCIe’s internal CRC or FEC mechanism. Then these errors are reported as “corrected” (because they are). Again, for the NVMe we’re seeing Rx (transmission) errors, and for the GPP bridge we’re seeing BadDLLP errors and the Data Link Layer Protocol is exactly where packet errors would be detected/corrected:

I just updated the BIOS on mine and all the AER erros are still there, sadly.