CFD-Multiphysics Benchmark for x86 and ARM (Windows/macOS/Linux)

twin_savage · January 26, 2024, 9:06am

If you’re here your interested in Finite Element Analysis, how fast your computer is, or in different µarch and BLAS performance. Here’s the background you’ll need to know to run the benchmark(s):

The benchmarks are very easy to run and do not require any special knowledge of FEA or computer science.

There are three different benchmarks:
- CFD-only model that uses very little RAM but lots of CPU resources.
  Run time on fast computer ~23 minutes using 10GB of memory
- Electromagnetics-only model that uses a plethora of RAM and relatively less CPU resources.
  Run time on fast computer ~16 minutes using 260GB of memory
- Multiphysics model that performs a coupled CFD-Electromagnetic simulation and uses a moderate amount of RAM and a ridiculous amount of CPU resources.
  Run time on fast computer ~190 minutes using 60GB of memory

Each benchmark is a self contained script/executable for Linux, Windows or OSX; x86/arm µarch’s are supported but the right script/executable needs to be run according to OS and µarch

Example Installation/Run

For Windows, you simple double click the .exe and it will ask to install a runtime on the first start, after that it will launch.

For Linux, the .sh file can be run with bash, a runtime will install first launch, after that the benchmark will start.
Screenshot from 2024-01-25 20-40-58

For OSX, it installs like any other tarball

Once the benchmark has launched, click the ‘Compute’ button on the top ribbon and wait for it to solve, a solve time will be displayed in the bottom middle of the screen.
the “10GB” CFD benchmark takes about 23 minutes to solve on a reasonably fast computer, the “260GB” EM takes ~17 minutes and the “60GB” CFD-EM takes about 190 minutes.

Example Screenshots

Different BLAS can be called upon during startup; large uplifts in performance can be had from choosing the most optimized for your specific µarch

BLAS selection

The Basic Linear Algebra Subprogram is the “heart” of FEA, it is what translates the linear algebra operations to solve a given problem into instructions for the CPU. There are many different BLAS that have been optimized for different architectures or even specific CPUs. The undisputed king in x86 has been Intel with their Math Kernel Library (MKL) for the past 20 years, often outperforming AMD’s own inhouse developed AOCL even on AMD processors. However AMD just released AOCL 4.1 and it would seem large improvements have been made with its use on Zen 4 CPUs.

For ARM processors Arm Performance Libraries seem to be preferred for maximum performance, even on Apple silicon.

These are the BLAS available:

Windows:
OneAPI 2022.2.1.0 (MKL) (default)
AOCL 4.1.1

Linux x86:
OneAPI 2022.2.1.0 (MKL) (default)
AOCL 4.1.1

Linux aarach64:
OpenBLAS 0.3.25
Arm-Performance-Libraries_23.10 (default)

OS X arm64:
OpenBLAS 0.3.25
Arm-Performance-Libraries_23.10
Apple vecLib (default)

To select a non-default BLAS, an argument during the launch of the benchmark must be made:

x86-64:
-blas {auto} | mkl | aocl

arm64/aarch64:
-blas {auto} | veclib | armpl | openblas

Scheduling and alternate NUMA schemes

To be written

Benchmark Link

Benchmarks - Google Drive

Feel free to post performance results of your system (please include hardware specs), discuss general µarch performance differences and BLAS.

twin_savage · January 26, 2024, 9:22am

[Reserved for scores]

CPU	RAM	RAM Channels	Operating System	CFD-only	CFD-EM	EM-only	Notes
Dual EPYC 9374F	1536GB	24 (1536-bit) DDR5	Ubuntu	13m 20s	2h 47m 50s	�
Intel Xeon W5-3495X	�	8 (512-bit) DDR5	Ubuntu	�	2h 56m 21s	�
Intel Xeon W5-3435X	512GB	8 (512-bit) DDR5 @ 5880MHz	Centos Stream 9	23m 12s	3h 11m 52s	16m 38s	~500 watt power usage
Intel Xeon W5-3495X	512GB	8 (512-bit) DDR5 @ 4800MHz	Linux	19m 23s	3h 12m 1s	20m 30s	~375 watt peak power usage
Apple M1 Ultra	64GB	1024-bit wide LPDDR5-6400	MacOS 14.4.1	31m 33s	3h 36m 10s	�	FGMRES run on CFD-EM using 16 p-cores; ~112 watt power usage
Threadripper 7960X	128GB	4 (256-bit) DDR5 @ 6600MHz	Windows 11	20m 34s	3h 50m 6s	�	-numasets 4
Ryzen 7950X3D	64GB	2 (128-bit) @ 6200MHz	Windows 11	28m 18s	5h 13m 36s	�	-numasets 2
Apple M1 Max	64GB	512-bit wide LPDDR5-6400	�	�	5h 27m 34s	�	FGMRES run on CFD-EM using 8 p-cores; ~30 watt power usage
Ryzen 7950X	96GB	2 (128-bit) @ 6200MHz	fedora 39	32m 25s	5h 52m 56s	�

dahlia123 · January 30, 2024, 4:52am

Thank you for providing the file. I will rush to run the benchmark with my system as soon as the current workload queue becomes empty.

twin_savage · January 30, 2024, 5:01am

Another thing I thought of to test, because of @RMM’s post on your thread, is to use the -numasets comsol command to enforce NUMA at the userspace level instead of using NPS in the BIOS.
Doing this might help Window’s thread scheduling?

Mach3.2 · January 30, 2024, 8:53am

System specs

AMD Ryzen 5900X
2* 16GB DDR4-3733 15-17-16-32 1T GDM off

CFD Only

Runtime: 46m 41s

EM and CFD + EM results to follow.

Edit: The EM benchmark seem to be thrashing my boot SSD, i suppose 32GB of RAM isn’t quite enough?

twin_savage · January 30, 2024, 5:09pm

ahhh I should have explained memory usage better. The CFD-only bench uses about 10GB of RAM, the CFD-EM uses ~60GB of RAM and the EM-only uses ~260GB of RAM; I say “about” because memory usage will scale with the number of core you throw at the problem so these numbers could grow or shrink a bit depending on system.

Another quirk I forgot to mention: the CFD-EM benchmark actually has two compute buttons, one for PARDISO and one for FGMRES. On x86 processors there shouldn’t be much of a difference (slight favor to PARDISO) between these two solvers in performance, but on ARM the PARDISO solver isn’t implemented as efficiently as it is on x86 so I added an alternative FGMRES solver to make sure the ARM versus x86 comparison is fair with optimal algorithms for each µarch.

I’m hoping to get someone with an M1/M2 Ultra to run the the two benchmarks that will fit in memory, I’m super curious on how it’s compute/memory performance stacks up against x86 in the real world.
Even one of the Ampere processor linux machines would be interesting to see, but I don’t have as high of hopes for these machines compared to Apple.

dahlia123 · February 1, 2024, 2:21am

Yeah, that command would make some difference in terms of the scheduling. I will try that also later on!

Tonythebuilder · February 2, 2024, 5:16pm

I tried the CFD only version. And the CPU behavior was a bit strange.

My system:
Intel Xeon w5-2455X
ASUS Pro WS W790-Ace
2x Micron RDIMM 64GB, DDR5-4800, CL40-39-39
Win10 on a KC3000
LDME6 on a Corsair MP600 via Hyper M.2 Gen5

The times:

48 min; Win 10, CPU stock (3.9Ghz), 65% CPU usage. 45°C max
42 min; Win 10, CPU at 4.6 Ghz (By Core Usage), 75% CPU usage, 61°C max
41 min; LDME6, CPU stock (3.9Ghz), only 12 Cores where used, Threads Idle
40 min; LDME6, CPU at 4.6 Ghz (By Core Usage), only 12 Cores where used, but he switched all the time

Summary Windows

Win10 stock CPU (3.9Ghz)

Win10 CPU at 4.6Ghz

Its a bit strange why my CPU acts a bit strange. Its like somewhere is still an artificially handbrake.

Dont get me wrong, most of the time i run in stock (3.9Ghz) and its a fast system. But still wierd.

Tonythebuilder · February 2, 2024, 5:17pm

Summary Linux

Linux (LMDE 6) stock CPU (3.9Ghz)

Linux (LMDE 6) CPU at 4.6Ghz

Sorry he only allows 2 Pictures per post.

twin_savage · February 2, 2024, 8:18pm

hmmm… That is very interesting behavior, I would have expected the 4.6GHz Linux run to be ~15% faster the extra CPU frequency affords over 3.9GHz Linux.
The CPU thread load graph of the LMDE6 @ 4.6GHz makes it look like either processes are being juggled around on different threads or more likely the cores are memory starving for work and throttle down. 40min might be the limit for 2 channels of DDR5-4800; if you felt like investigating further you could bump your memory speed up to 5600MHZ and see if the weird CPU load choppiness on LMDE6 @ 4.6GHz persisted.

The faster Linux performance seems to be a trend, at least on newer hardware. I’m going to run the benchmark on dual E5-2650L v4’s and see if the difference in performance between Linux and Windows is still there. I have a hypothesis that the Windows scheduler on newer hardware doesn’t work correctly, but that on older hardware it does.

One more thing I neglected to mention about CPU usage on the benchmark:

The benchmark will use exactly half of the threads available on a system with hyperthreading, or if SMT is turned off in BIOS it should use all the cores/threads (it will automatically detect between the two scenarios and schedule accordingly).
There is a slight performance regression if you force it to use all the threads available on an SMT system because there is a small overhead to creating more worker threads that is not “overcome” by the meager amount of memory bandwidth/ALU increase that occurs from using SMT.
Many HPC supercomputers will intentionally disable hyperthreading at the BIOS level because of this phenomenon; but that isn’t necessary here.

Tonythebuilder · February 3, 2024, 5:10pm

The jugling you call, if seen allready on the previous generations of Xeon Scalable. Mostly in VM with Linux guests. I thought thats normal behavior. I found it more strange that with 3.9 Ghz he nearly draws a line.

And it gets more intresting:
I increased the RAM speed to 5200 MT/s. Because i remember that in the W790 Memory OC Thread, it was writen. That you need to increase the Voltage for 5600 MT/s. And iam not smart enough for Overcloaking.

So i went into the BIOS, put on Sync all Cores (instead of By Core usage). To see if the jugling went away (it doesnt). And also just switched from 4800 to 5200.

40 min; LDME6, CPU stock (3.9Ghz), RAM@5200
38 min; LDME6, CPU at 4.6 Ghz (Sync all Cores), RAM@5200

Summary

LDME6 4.6Ghz RAM@5200

LDME6 stock (3.9Ghz) RAM@5200

For me its intresting to see that just bumping up the Memory Speed, by 400 MT/s, has nearly the same effect as bumping up the CPU clock by 7 Mhz.

In summer i will upgrade to 256 GB (4x 64) to see what effect quad channel has.

Mach3.2 · February 3, 2024, 5:29pm

CFD is usually memory bandwidth bound; your compute time of ~40 minutes is only slightly faster than my 5900X when your setup should be a lot faster.

I bet you’ll probably get it close to ~20 minutes with all 4 channels populated.

stillmoms · February 4, 2024, 2:34pm

Ran the CFD-EM benchmark on my 2 year old M1 Max 14" MacBook Pro because I was curious. Interestingly the benchmark identifies it as an 8-core processor, indicating it uses the efficiency cores not at all, I’m guessing because they lack functionality required to make them useful to the operations the benchmark is carrying out. In any event:

Text and screenshot results

Solution time: 19654 s. (5 hours, 27 minutes, 34 seconds)
Physical memory: 46.76 GB
Virtual memory: 452.02 GB
Ended at Feb 3, 2024, 11:26:52 PM.

Turns out running a simulation like this on a laptop is definitely possible, since 64GB of RAM, but certainly not fast. Five and a half hours at 30W is still ~165Wh, which… actually, I do wonder how that works out in terms of efficiency. Interestingly, during the actual computation portion the 8 performance cores were visibly maxed out all the time, but the efficiency cores were left mostly unloaded, which meant the whole OS remained perfectly responsive. I did close as many other user-land apps as I could to make as much RAM available as possible but I didn’t go nuts trying to kill unnecessary background processes.

I wonder how much of an improvement an M2 Ultra would be, and actually given how beefy the M3 Max is compared to the M2 Ultra, I suspect the M3 Ultra will be significantly faster still.

quilt · February 4, 2024, 3:05pm

Sadly, on my system, the benchmark crashes immediately. A logfile with the following contents is created in the working directory:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fe01dc0263d, pid=73926, tid=74205
#
# JRE version: OpenJDK Runtime Environment Temurin-11.0.20+8 (11.0.20+8) (build 11.0.20+8)
# Java VM: OpenJDK 64-Bit Server VM Temurin-11.0.20+8 (11.0.20+8, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C  [libLLVM-17.so+0x120263d]  llvm::SelectionDAGBuilder::visitInsertValue(llvm::InsertValueInst const&)+0x51d
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /home/quilt/tmp/comsol_bench/core.73926)
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
#

The OS is fedora 39… Perhaps some version conflict?

twin_savage · February 4, 2024, 7:04pm

This is the correct behavior, the efficiency cores supposedly slow the overall computation down because managing them is more overhead than the work they produce (although I have strong suspicions that might not be true in all situations).
The runtime is smart enough to figure out the core topologies for all the M1, M2 and M3 Macs (maybe not the M3 Ultra though), but when the M4 Macs come out it might guess at p/e core topology wrong. In that case the -np # command could be used to force it to use a specific number of cores; you could force it to use the two efficiency cores if you used -np 10 argument when launching through terminal.

The Apple efficiency cores actually support the full NEON instruction set that is used so they are legitimate targets, contrary to Intel CPUs where the e-cores have a curtailed instruction set.
To balance out my praise for Apple I’d like to rag on them for not yet implement SVE and only supporting more shallow NEON instructions.

Was the swap being filled up when this was running? It reported an awfully high virtual memory usage at the end which I haven’t seen before (might be some kind of quirk of the BLAS library used).
47GBs sounds about the right number for solving on 8 cores though.

Very favorably, I just ran the same model on a Dell M630 and it took 9h 6m 6s using a constant ~200 watts throughout the run.

I have high hopes for it, the M3 Ultra is going to be the first Mac with the right amount of memory (assuming it’s going to top out at 256GB) for the problems I commonly work with. It would be very nice to replace my 500w Sapphire Rapids space heater with ~70 watt Mac Studio.

twin_savage · February 4, 2024, 7:22pm

This a stab in the dusk, but I think it might have something to do with the output window rendering. You could switch to using the pure software renderer rather than the opengl renderer it defaults to by using the command:

./comsol62_benchmark_CFD_only_10GB_Linux_x86-64.sh -3drend sw

sirn · February 5, 2024, 8:44am

I also couldn’t get the benchmark to work with OpenGL for some reason, but -3drend sw fixed it. Here’s a CFD only 10GB one on 3495X with 8x DDR5 4800 MT/s. The CPU is slightly overclocked, power usage peak at 375W.

I also have some other (mostly idle) VMs occupying about 180GB of RAM (out of 512GB) while running this.

sirn · February 5, 2024, 12:34pm

I got 192 minutes for CFD-EM, which is a bit slower than 190 minutes reference.

EM is also slower as well, at 20 minutes 30 seconds versus 16 minutes reference.

Not sure if this is due to my current overclock profile being only 3.3 GHz boost (all cores), and with only 4800 MT/s RAM, or if some of the VM (Windows machine & K8s cluster) are running something in the background, causing the slowdown. I will find some time and re-run again.

twin_savage · February 5, 2024, 5:53pm

These results make some sense because they are bottlenecked by memory bandwidth (especially the EM) more than the CFD bench. I’m running about as heavy of an overclock on my memory I can get away with at 5880MHz.

FWIW I was able to set my memory to 5600MHz without any increase in voltage or tweaks in timing, you might be able to get away with an easy memory overclock too.

quilt · February 5, 2024, 11:23pm

Here is the result of my 7950X on the CFD only benchmark:

32m 25s

This is with a noctua D15 cooler, PBO enabled and a very slight curve offset (-5 CCD0 -10 CCD1), 2x48GB DDR5 6200 CL32

The software rendering did the trick to avoid the crash.

I’ll try the CFD_EM one tomorrow, but I don’t have enough RAM for the EM one