[SOLVED] 5995wx on ASUS WRX80E-SAGE with 1TB memory installed goes into an infinite Q-Code loop

Given the fact that 3000 series does work with 1 TB, perhaps the problem is in the microcode of the CPU itself. If that is the case, only a new BIOS can solve this (the OS can also update the microcode but obviously only after a successful boot).

I know from relatively reliable source that there is a beta BIOS from ASUS which definitely solves the issue but for some reason they don’t think it is ready for release yet. Hopefully they will release it soon (at least in “BETA” state) because obviously the issue is not isolated to single type of DIMMs.

2 Likes

Agreed. Furthermore, I believe that some of the underlying details of the 5000wx’s memory controller differ from those of the 3000wx. In addition to the benchmark results mentioned in my previous comment, I obtained the following:

Test 2 (memory occupation: ~200 GB)

  1. 5995wx + 8 x 32GB = 39min
  2. 5995wx + 6 x 128 GB = 10h 50min
  3. 5995wx + 7 x 128 GB = several hours, and canceled the test
  4. 3995wx + 8 x 128 GB = 41 min
  5. 3995wx + 7 x 128 GB = 52 min

For some reason, the 3000wx seems to have much more stable performance scaling on the memory channel. This is very interesting.

1 Like

I’ve been in a living hell dealing with ASUS support. I’ve submitted word doc feedback forms, done “live tech chat”, and spoke with a “manager” about what is going on. At the end of the day, I have nothing to report other than 1TB simply doesn’t work on this motherboard – despite it being advertised as working and despite using the memory listed on their QVL.

Has anyone else found a solution?

1 Like

Is there another motherboard I should consider using?

1 Like

You are the only other Comsol Multiphysics user I know of on the forums!
I’ve been solving a huge sweep of ~50M DoF problems in the mef module (the most expensive of the AC/DC series) with about the same memory foot print as your test number 2 case and it’s taking between 3-8 hours per case depending on how many steps to converge (some times boundary conditions are on the ragged edge of convergence).

Those results definitely make it sound like something is going wrong with the 5000 series when using 128GB RDIMMs, even when less than all 8 are populated.

2 Likes

I can try this work on a buuuuunch of mobos but I don’t have the software. Where did you get your memory? I’m tempted to order a bunch to try.

It sucks that the new workstations are also going to be ddr5… Spending a lot on memory :-/

3 Likes

I am using this precise memory: 128GB for Samsung M393AAG40M3B-CYF DDR4-2933 ECC RDIMM 2SRx4 - Newegg.com

I am also tempted… to order a Gigabyte MC62-G40 and see if it can run 128GB sticks.

2 Likes

I could shoot you a standalone comsol application that’ll run a self contained simulation that’ll benchmark the systems in a similar manner.

​​​ ​ ​
​ ​ ​​​ ​ ​​
I just realized the strangeness in the 5995wx results only occurs when consuming larger amounts of memory. The 128GB DIMMs seemed to performance fine when using little memory.

@dahlia123
Did the solution converge in around the same number of solver steps between the very slow performance and the fast performance?

2 Likes

Great to see you using COMSOL! Using the RF module, I’m researching plasmonics.

Yes, the calculations ended with the same number of steps and exactly the same outputs. A very strange thing happened during the calculation (Test 2) on 5995wx with 6 or 7 dimms. I frequently experienced a random, severe slowdown in the system’s response. For example, switching the active windows from COMSOL to Windows Calculator took about 30 seconds. This was not like a typical slowdown that occurs when the CPU is fully occupied. I’ve never experienced such an abnormal slowdown while using COMSOL before.
Anyway, I didn’t know that COMSOL has a function for creating a standalone app. I’ll also think about sharing a standalone app that replicates my Test 2.

1 Like

Me throwing spaghetti against the wall:
I wonder if this is somehow a Windows problem; if comsol was running the same simulation on linux if the same thing would happen?

Yeah its a sort of new feature called comsol compiler, its pretty slick, it’ll package up the simulation with a UI that you can define and spit out a file that can be run on an end user’s computer without them having to have deep FEA knowledge or even a comsol license.
It does require that end user have the comsol runtime installed though. The runtime will self extract and install from the simulation file; having the runtime install does kind of erect a trust barrier to having randos from the internet running it.

​​​ ​ ​
​ ​ ​​​ ​ ​​
Anyways here’s an example I made that I think is stressing the memory the exact same way your problem is; its a dead simple example, user clicks the “Compute” button after launching it and it’ll run and then display the solve time at the bottom when finished. Takes about 44 minutes on a dual socket haswell system.
The actual .mph file is in there if you wanted to see how I constructed methods/variables/UI.

disclaimer: I butchered one of the existing example files to create this so its a tiny bit hacky, but it works.

https://drive.google.com/drive/folders/1YX0rqS85H-Z1rzjLTw_k6776FSBppKEB?usp=share_link

2 Likes

I believe this is not related to the Windows issue because it only occurred with the combination of 5995wx + 6 or 7 x 128GB dimms + heavy CPU and memory occupation: when I tested a system with 3995wx + 7 x 128GB sticks, the slowdown did not occur. So I suppose the same thing could happen on Linux, but I’m not pretty sure of course.

Great! Thank you for sharing the file. I should have realized COMSOL has such a cool feature!
I just have tested my system with your file and the results are as follows:
5995wx + 8 x 32 GB udimm (3200 MHz, 256GB in total) = 32min 21s
3995wx + 8 x 128 GB rdimm (3200 MHz, 1TB in total) = 31min 40s

Interesting that for an unknown reason, the 3995wx system was marginally (~2%) faster. When I get back to my office the following week, I’ll test with 5995wx + 7 rdimms to see how long it will take, and get my standalone app ready as well.

2 Likes

Hello fellow WRX80E Sage owners, we met the same problem here in Singapore and we would like to update everyone that there is a new BIOS for this board Version “9901” and we have successfully used 8 x MTA72ASS16G72LZ-3G2R on this board with all the memory modules detected.

We are testing the system as we speak.

Hope the information helps!

6 Likes

This is hopeful news! How did you get access to BIOS 9901?

1 Like

Something tells me you’re part of the Asus tech team. Anyway, thank you for responding with such exciting news! Can’t wait to see the new bios.

Are you experiencing “slow” memory performance when accessing large >200GB amounts of memory with the new BIOS while using 128GB dimms?

​​​ ​ ​
​ ​ ​​​ ​ ​​

This seems somewhat reasonable (or at least not crazy), both configurations have the same memory speed which is likely the primary bottleneck on large problems (as opposed to core performance).

…Speaking of core performance I had originally thought the new M1/M2 Macs with their high memory bandwidth numbers would perform well in these large sparse matrix solvers but that is not the case. Apple/ARM’s core performance is so bad compared to x86 that the core performance is the bottle neck instead of memory bandwidth by a wide margin.
AArch64 is still stuck using NEON for SIMD, while x86 has blazed ahead with AVX 1-3.

1 Like

I have the same issue and seems i also need the new BIOS.
System will NOT boot if AMD 5965wx processor is installed with 8x128GB memory. (Samsung M393AAG40M3B-CYFCQ)

If any 1 memory module is removed (7x128GB installed), system will boot.
If 8x64GB memory is installed, system will boot.

If processor is replaced with AMD 3955wx using 8x128GB memory, system will boot.

Also tried different memory (micron MTA72ASS16G72LZ) with same results.

Asus support has been beyond frustrating to deal with…

1 Like

Here is an update with 7 dimms:

5995wx + 8 x 32 GB udimm (3200 MHz, 256GB in total) = 32min 21s
3995wx + 8 x 128 GB rdimm (3200 MHz, 1TB in total) = 31min 40s

5995wx + 7 x 32 GB udimm = 40min 26s (no noticeable slowdown happened)
5995wx + 7 x 128 GB rdimm = 18h 38min 25s (severe slowdown occurred)
(in both cases 0-6th slots were occupied)

Your app also successfully reproduced the bizarre slowdown with 7 x 128GB rdimm and got whopping 18.5 hours which is almost 30 times slower than the 7 x 32 GB config.

I should have tested 7 x 32 GB previously to get a better picture of the problem. Now my conclusion here is that this issue is not actually related to the memory channel because 7 x 32GB works just fine. Severe slowdown on 6 or 7 x 128GB rdimms, and no post on 8 x 128 GB, everything actually stems from 5995wx’s inability to handle 128GB rdimms under the current bios.

I haven’t tested 7 dimms on the 3995wx system again because the system is currently busy, but I’m pretty sure 3995wx + 7 dimm will work just fine.

Agreed. I also feel that the 2% difference is due to the calculation being memory-bound. Hopping the next-gen mid-range workstations such as Zen 4 Threadripper Pro and Sapphire Rapids will give a massive performance leap with ddr5 and extended memory channels.

Please refer to this performance issue with 128 GB sticks as well.

2 Likes

Hello All. I just came across this forum pulling my hair out over a similar situation with the 5995WX but with an ASRock WRX80 motherboard. I cannot get the machine to boot with 8x128GB sticks. If I start removing sticks the system boots, but starts to slow down once I start memory intensive work like CFD. I have run 8x32GB sticks just fine. I currently have 4 sticks of samsung 128GB 2933 installed and the system reliably boots, but still fails once I hit it with any CFD work. I will try with 8x 64GB dimms tomnorrow and see how that works. Curious that this doesn’t seem specific to ASUS boards…

1 Like

What BIOS version are you running?
The ASUS WRX80 boards don’t have the option of being able to update to the most recent version of AMD’s AGESA microcode update which we think is the cause of the problem, or at least somehow related.
Asrock actually offers AGESA 1.0.0.4 and AGESA 1.0.0.5 as a beta for their boards which might fix the issue?

1 Like

It looks like I’m on the latest BIOS for the WRX80 Creator; version 6.06. The microcode update says A00F82/A008205. I’m starting to think it may be a thermal issue. The machine came with no active cooling for the memory and is in a desktop tower. We have Threadripper towers from other vendors, and they have custom active cooling shrouds designed for their DIMM slots. I placed one of those over the DIMMS on my 5995wx machine, and it seemed to make it through multiple rounds of the Passmark threaded memory benchmark where it normally would not finish more than 1 pass. Doesn’t seem to be an issue with the 64GB DIMMs; they run beautifully in this machine with no cooling.

2 Likes