Dual socket EPYC 9654 + Windows Workstation benchmarks on COMSOL, etc

Happy New Year!

I recently built a dual-socket workstation based on 96-core EPYC 9654 for my research. This post is to share some benchmark results from this system with those interested in the performance of many-core CPUs, like Threadripper 7995wx, running on Windows.

Most of the benchmarks have been done on COMSOL, and for easy comparison with other systems, I used the benchmark files shared by twin_savage. For the files, you can check out the following post: https://forum.level1techs.com/t/workstation-for-monte-carlo-simulations/203113/10. Thank you twin_savage for allowing me to use the files for the benchmarks and to share the results in this post.

Here are the specs of the system

  • Dual-socket EPYC 9654 system
    CPU: AMD EPYC 9654 (96 cores) 2S - HT off, all core turbo: 3.55 GHz
    M/B: Gigabyte MZ73-LM0 rev 2.0
    Memory: Samsung M321R8GA0BB0-CQKZJ (DDR5 64GB PC5-38400 ECC-REG) 24EA (2 x 12ch, 1.5TB in total)
    GPU: Nvidia Geforce G710 :smile:
    PSU: Micronics ASTRO II GD 1650W 80Plus Gold
    OS: Windows 10 Pro for Workstations (fTPM is not included in the MB so Windows 11 is not available yet)
    SSD: Samsung 990pro 2TB + 860evo 4TB 2EA

The following are quick specs of three systems used for comparisons

  • AMD R9 5950x system
    CPU: AMD Ryzen R9 5950x (16 cores) - HT on, all core turbo: ~4.3 GHz
    Memory: DDR4 32GB PC4-25600 4EA (128 GB)
    OS: Windows 10 Pro for Workstations

  • Intel i9 7920x system
    CPU: Intel Core-i9 7920x (12 cores) - HT on, all core turbo: 3.7 GHz
    Memory: DDR4 32GB PC4-25600 8EA (256 GB)
    OS: Windows 11 Pro for Workstations

  • AMD Threadripper 5995wx system
    CPU: AMD Ryzen Threadripper 5995wx (64 cores) - HT off, all core turbo: ~3.8 GHz
    Memory: DDR4 128GB PC4-25600 ECC-REG 8EA (1 TB)
    OS: Windows 11 Pro for Workstations

++++++++++++++++++++++++++++++++++++++++
0. NUMA in Windows 10 Pro for Workstations

By default, Windows splits 192 cores into 4 NUMA nodes: 64 + 32 + 64 +32. The MB provides a specific NUMA per Socket (NPS) option in bios to control, but the 4 NUMA splitting overrides the NPS=1 in bios. Presumably, this is due to the 64-core threshold of the Windows processor group. NPS=1 works when the CPU is 64 cores or under per socket.
In the default, 3 processor groups form: 64 + 64 + (32+32) like the following figure.

If NPS=2 is set, then the 192 cores become split into 48+48+48+48. The number of cores per node becomes symmetric, but the N of processor groups increases to 4. More importantly, the directly accessible memory channel becomes asymmetric, which can impact significantly on the overall performance. I don’t think NPS=2 is useful for 96c CPU.

NPS=4 splits the 192 cores into 8 x 24-core nodes. Symmetric N of cores, asymmetric memory channels in a node, and 4 processor groups as 4 x (24+24).

As pointed out by wendell in the following post, https://forum.level1techs.com/t/windows-doing-dumb-stuff-on-big-systems/204408, NPS=3 would be the best option for 96c/192c system because it brings symmetric configurations of cores and memory channels with reduced 3 processor groups. Unfortunately, this option is not available in this system.

1. COMSOL: default run (NPS=auto)

Just ran the benchmark with no additional options. Here we go:

10 hours for the 50GB benchmark!! Comparisons with other systems are here.

The reference Xeon W5-3435x result is by twin_savage as in the post above.

The 200GB bench is a memory-bandwidth-thirsty one, while the 50GB bench stresses CPU more than the memory bandwidth.

The default run uses all available physical CPU cores: all 192 cores are used during the benchmark. But for the 50GB benchmark, EPYC 9654 2S is the slowest system :smile:.

2. COMSOL: CCD control

Maybe the bizarre results are related to the Windows scheduler, which is not that smart, and/or the interconnection between the CCDs in AMD CPUs that yields high latency for the CCD communication. Because both can be relieved by reducing the number of cores, I disabled several CCDs via bios and ran the benchmark again. The results are

Yeh! It works. Turning on only 2 CCDs out of 12 CCDs gave the best result for the 50GB benchmark, even though the directly accessible memory channel was reduced to 6 per socket. The impact is more pronounced in the 50GB benchmark, and the 200GB benchmark also got a mild speedup. But overall, the results are not impressive especially when compared with the reference Xeon results above.

3. COMSOL: -np option

COMSOL provides -np N option that specifies N cores to be used in the calculation. For instance, comsol61_benchmark_50GB_windows.exe -np 16 can be used to utilize only 16 cores.

Turning on all 12 CCDs, -np option was used here. The results are

Unfortunately, two results are missing (Update: result with -np 64 for 50GB bench has been added on Jan. 3rd, 2024). Hard to talk about the trend specifically, but it is true that -np option is very helpful. The lower performance on -np 16 seems to be due to the limited direct access to the memory.
Because -np 32 uses only half the core resource of a 64-core NUMA node, I ran two simulations simultaneously to see the impact of multitasking and the potential productivity of the system. Exemplary results are as follows:

As expected, two simultaneous 50GB benchmarks have been finished in about 4h 46m, meaning that the productivity is almost doubled.

But one thing I would like to stress here is that the Windows scheduler is not that smart to allocate multiple jobs to NUMA nodes in a better way. For instance, if the first job with np 32 is running on NUMA node 0 (64c), and if I add a second job, then the scheduler locates the second one in the same node 0. This is not the best way in terms of the utilization of memory bandwidth.
But anyway, there were 32 available seats so it is okay. Now, no empty seats in node 0 but lots of empty seats in node 1, 2, and 3. And adding a third job? Guess what? The scheduler adds the third one on the SAME NUMA NODE 0. The scheduler refuses to see the empty seats in node 1,2,3. I don’t why, and to tell you the truth, I’m not sure whether this bizarre allocation is the problem of the Windows scheduler, or “NUMA-aware” COMSOL is responsible for this. I don’t know.
In short, as far as I know, there is no way to utilize all 192 cores on Windows by using the “np N” option with N less than 192.

4. COMSOL: NPS=N option

This benchmark is also incomplete. Ran only for 200GB benchmark, and here are summary.

Presumably, NPS=4 reduces unnecessary communication between all 12 CCDs by making the boundary of a node within 3 CCDs. My theory is that this would be helpful to reduce cores’ waiting time caused by the long latency of the Infinity Fabric.
The best result for 200GB bench so far is obtained by using NPS=4 together with -np 96

5. FDTD: default run (NPS=auto)

The finite-difference time-domain (FDTD) is a popular algorithm for the simulation of electrodynamics. It solves Maxwell’s two curl equations numerically. At every step, all spatial data of electromagnetic fields, allocated in memory, are read and updated, making FDTD memory-bandwidth-intensive. But the complexity of computation is relatively low compared to other algorithms like FEM.
There are commercial programs for FDTD, such as Lumerical FDTD, but I’m currently using my homemade one, written in C++ and utilizing OpenMP and SIMD.

The following is a summary of the FDTD benchmark.

Nothing to say but the 2S 9654 is just INSANE on FDTD.

Conclusion
Running COMSOL on a big system on Windows is very tricky. The abnormal performance might be due to Windows’ premature scheduler and/or a unique topology of AMD CPUs involving long latency for communication between CCDs.

If results from Linux were available, it would be more specific to say which one is responsible for the low performance. Running on Windows Server 2022 would give different results, but unfortunately, I don’t have a license for this.

In terms of the 12 x 8c topology of the CPU and its impact on the performance, a comparison with a monolithic chip like Intel’s MCC-based Saphire Rapids or upcoming 2 x MCM Emerald Rapids like Xeon 8592+ (64c, 2 x 32c) would be helpful.

Overall, I feel that Windows is not yet capable of handling a big system. A 64c CPU would be a better choice than a 96c one to run on Windows, in terms of NUMA, Windows processor groups, and consequent computational efficiencies.

++++++
Small updates on Jan. 3rd, 2024:

  1. A figure for NUMA nodes and Windows processor groups has been added
  2. A result with -np 64 for the 50 GB bench has been added
6 Likes

I am working on getting answers to the nps=3 question.

Comsol on Linux also seems to benefit from zram; Amber posted some preliminary linux results.

This writeup is amazing, I will use it in the video Im working on

5 Likes

I am super interested in all of the numerical computation benchmarks we are seeing here… especially because you folks seem to have the latest hardware and a need for speed.

I don’t know the Comsol benchmark you are using but I am assuming that it’s using implicit time stepping? Explicit time stepping is much more friendly to massive parallelism and does not have memory requirements that balloon out of control during linear solves.

Comsol is also proprietary and may not share which algorithms are being used for physics modules, in the long run a better choice might be to use something like openfoam with known benchmark problems to keep this sort of effort going.

1 Like

I’d be interested in the performance of this system with Linux. That W5-3435x system has about ~30% higher numbers (worse performance) when running on Windows as compared to Linux, and it is a relatively simple 16 core arch without numa yet Windows still couldn’t squeeze all the performance out of it.

The reason I’m defaulting to blaming Windows instead of Comsol is because I’m not seeing these large discrepancies in performance on older hardware platforms between Windows and Linux; It’s almost like Windows needs it’s scheduler hand-tuned by Microsoft for each new architecture that comes out… and it takes years for those to reach us.

I’m pretty sure all of Comsol’s solver’s are open source, and even opensource BLAS can be used if desired (but open source BLAS tend to have poor performance compared to the more optimized closed source BLAS).

Openfoam is decent with FVM, but because of that it lacks the ability to do complex multiphysics problems that are better suited to FEM like Comsol.
The whole FVM vs FEM rivalry might end up being like the CISC vs RISC rivalry in the future when more efficient method translation schemes are developed causing a convergence between the two, but we’re not there yet, and probably won’t be for several decades.

Also Comsol it better at freely disseminating knowledge of how to setup complex boundary conditions; whereas with openfoam knowledge of how to setup complex problems is silo’ed away behind a payed consulting services or “experts”.
As an example of how arbitrary/weird some of the problems Comsol freely shows how to simulate, a literal step-by-step guide to modeling the electron orbitals of an unperturbed hydrogen atom using the Schrödinger equation is up for grabs:

This is why I’ve been so impressed with Comsol, it’s that they seem to be the only ones out there freely sharing a wide berth of knowledge while all the other programs have it squirreled away treating it as something to be protected, although Ansys seems to be catching up.

​​​ ​ ​
​​​ ​ ​

It’s using a stationary segregated FGMRES solver which is much less memory intensive that something like a stationary direct pardiso solver. We can’t really use time stepping here because that would invoke Einstein’s general relativity because of the EM portions of the problem; which could be done since Comsol let’s you input arbitrary PDEs to govern… but this is waaay outside my wheelhouse.

1 Like

There is nothing open source about Comsol. Blas and or MKL are just linear algebra libraries and those are not really “solvers” for time stepping pde’s just parts required to help do it. The actual parts of the code that do stuff other than solve linear systems you are not going to see.

Ok so I looked at the thread and tried to piece together what is happening based on the Comsol word salad. It sounds like there are two similar benchmarks that are looking for a steady state solution to a Navier-Stokes + Maxwell equations problem. Interestingly the reference to a “pseudo time stepping cfl-ratio” means that you are using explicit time stepping to look for a steady state solution, just keep moving time forward until the solution stops changing.

Comsol and Ansys are exploding in price unfortunately, I think they are looking at ABAQUS costs and salivating over what they might be able to charge next year.

About the elements vs volumes… a compressible Euler problem with shocks and complex equations of state is just as “multiphysics” as anything Comsol does and finite volume CFD codes have been able to do those for a very long time. Elements will take over the world once they can conserve mass locally as easily as the volume codes.

1 Like

I don’t mean to imply Comsol itself is open source, but that all of the algorithms being implemented to run a simulation are and that the BLAS used to implement the solver on CPU can be. The only reason I bring this up is that if you wanted to audit what is being solved for and how, you could; or if you wanted to replicate a solution on another software package, you could (assuming same solvers/BCs are available in other package… and that same mesh is being solved for).

The pseudo time stepping is a convergence “trick” used to help stabilize the CFD solver’s (GMRES) solution, I never really thought of it as integral to the solution since you can turn it off if you don’t want it stabilizing the solution but I suppose you are technically right since it can be used to scale the solution being derived by GMRES, just like a PID loop:

yeah I wish Comsol would keep to it’s academic roots better in regard to pricing, at least it isn’t in Ansys territory yet which has always been the industry standard and grew accustom to industry pricing.

I swear I don’t mean to start a FVM vs FEM war; I just admire the simplicity of stacking physics on top of each other arbitrarily that FEM affords without trying to come up with a bespoke method conversion scheme dependent on the specific mixture of physics being used (because I might be trying to mix two physics that have never been mixed together or at least that the software developers hadn’t though of to mix).

I definitely think the future will be hybrid FEM-FVM, atleast when CFD is involved. There are many problems that FVM simply cannot solve due to it’s lack of mixed formulation support, while FEM can do everything because of it’s mixed formulation support; The downside to FEM is that isn’t as efficient at pure CFD (no Multiphysics) as FVM because of lack of conservation.

1 Like

I’m doing additional tests with various NPS and -np parameters. Now I’m very inclined to believe that the low performance is due to the Windows scheduler combined with the 64-core-limited unit of the Windows processor group, of which the latter is responsible for the incapability of using all 192 cores with the -np N option.

2 Likes

I’m looking forward to seeing your next video!

For those interested in the performance difference between the math libraries set by -blas option in COMSOL, here is a quick comparison.

AOCL stands for “AMD Optimizing CPU Libraries”.

1 Like

Your Intel MKL vs AOCL comparison matches up with my experience. It’s crazy how Intel’s libraries work better on AMD processors than AMD’s own libraries.

2 Likes

What solvers is the benchmark using?

Welcome!

For the “50GB” benchmark, the problem is fully coupled (the physics are fully coupled, not the solver) k-ε Turbulent Flow to Magnetic and Electric Fields @ 12m DoF; the CFD solvers are GMRES and the magnetic/electric field solver is FGMRES, the linear solvers are segregated. Element’s are discretized in a mix of linear and quadratic methods.

What the segregated solver’s convergence looks like:


as you can see it was the CFD portion of the problem that takes the most time to converge.

​​​ ​ ​
​​​ ​ ​
​​​ ​ ​
​​​ ​ ​
The “200GB” benchmark is only a Magnetic and Electric Fields problem @ 100m DoF; a fully coupled FGMRES solver is used.
Element’s are discretized quadratically, which is interesting because you can refine solutions without refining meshes using this technique, this probably wouldn’t be a familiar concept if coming from a pure CFD background but is very powerful (also brings up the degrees solved for by 82.5 million for this model+mesh).

What the fully coupled solver’s convergence looks like:


as you can see the convergence happens fairly quick, which explains why even though this benchmark has a much much larger mesh/memory foot print, the solve time isn’t too bad.

BTW, these were all just the default solvers comsol applied to the problem without any user input.
The math/solvers behind this kind of multiphysics is much more complex than CFD-only software so there isn’t really a 1to1 translation in knowledge there, but usually the defaults generated are pretty good and drastically cut down on the learning curve. Meshing is still where alot of the art in FEA is.

Can you create a CFD-only benchmark, maybe an Ahmed body example?

There are some pretty bold performance claims on the COMSOL website. I mean about AOCL. I can’t post links for some reason but you can surely find the page.

1 Like

Yeah, I can just use the same model without the electromagnetics and apply a high enough pressure gradient to the inlet that turbulence occurs in the serpentine structure; then I’ll adjust the mesh so that it hammers the system enough to make it a meaningful benchmark, but not so much that it is unwieldy to run.

This a very good point, all these benchmarks were compiled with comsol 6.1, which uses AOCL 3.2.1. comsol 6.2 uses AOCL 4.1.1 which is the most recent version of AOCL. I think I’ll recompile these existing benchmarks in comsol 6.2 and put them up on the drive.

I saw those claims on comsol’s website, but I have a feeling they cherry picked the most drastic differences.

I’ve created the CFD only benchmark and put it in the share. It runs in 22 minutes on a reasonably fast computer and only occupies 10GB of memory. I’m going to make a thread about it pretty soon showing how to run it. Also its got AOCL 4.1 which will maybe unseat the MKL on the newest AMD systems.

2 Likes

7960X is up and running, with first pass tune up, prices in China were crazy so I ordered on new egg and got a friend to carry in form Hong Kong…

DDR 6400 4x32G DR CL32-38-32
ADIA Bandwidth 183G Read, 69ns Latency (not sure why so high…)

Manual OC 5250, 5100, 5150, 5200 for each CCD respectively 1.225V, passes prime95 320k FFT, which is my personal stability threshold. Funny price 320k prime95 takes about 460W, but running COMSOL only about 300W average, max 350W.

Ran the COMSOL 6.1 50G model:

    3h 15m, not bad.

This took 4h 47m on my 7950X, 50% faster for 3x the money, but I’ll take it…

Ran the COMSOL 6.2 10G CFD Only Model with following result:

File -numasets -blas -np 7950X (s) 7960X (s)
10G CFD Only v6.2 2 mkl all 1850 1288
10G CFD Only v6.2 1 mkl all 1853 1294
10G CFD Only v6.2 4 mkl all – 1327
10G CFD Only v6.2 2 aocl all – 1391
10G CFD Only v6.2 1 mkl 16 – 1419

Note that 2 numasets looks to have best performance by like 0.5% or margin of error, matching numasets to CCD resulted in lower performance. AOCL is slower as others reported, using default MKL is probably the best for most tasks.

With 16 cores, it is slower, so we are still getting core scaling form 16 to 24 cores, starting to wonder if the 7970X would have been worth it.

Note that I did notice weird thread scheduling behavior, it was putting multiple comsol threads on a single CPU’s with HT, so I disabled HT in process lasso for the COMSOL processes to fix that, what is odd is I never did see this on my 7950X.

1 Like

Dang that is matching the 8 channel workstation Xeon’s performance!.. And you did this on Windows (assuming from the process lasso mention) and all of it’s thread scheduling quirks.
I bet the Linux scores would be even better.

~10% faster with 50% more cores is actually pretty impressive for such a small problem. The bigger problems will scale with core count much more nicely.

Hi everyone. Something to keep in mind is filling up all of your memory channels, it would be interesting to see someone benchmark these problems as a function of memory throughput, just pull out sticks and reduce the number of active channels.

You might be able to use the Intel memory latency checker to quantify how many GB/s of memory throughput your configuration has. These sort of problems are heavily memory bandwidth limited and your performance will improve if you use all the channels.

1 Like

You’re right, filling out all the memory channels is super important. I bet the larger problem would scale almost 1:1 with memory channels, half the memory channels twice the time. The smaller problems likely wouldn’t scale as well.

I’m real tempted to try this but all the water cooling fittings connected to the RAM on my system are a pain to move around.

This command with MLC will give you memory bandwidth numbers that compare closely with both aida64’s read numbers and stream triad’s:
./mlc --peak_injection_bandwidth -b2g -Y

Screenshot from 2024-01-29 08-53-43