CFD-Multiphysics Benchmark for x86 and ARM (Windows/macOS/Linux)

Anyone willing to run those with Arrow Lake to satisfy my curiosity?

@wendell maybe? :eyes:

It is fascinating how there were huge gaps in nvidia’s lineup across generations on FP64 performance, to the point where being several generations behind is the best solution (although this has been less true in recent years with the newest tesla cards finally getting good FP64 performance back… while going into the 5 figure range)

The benchmark does use AVX-512 (because it’s using MKL 2022 for almost all of the solving). Zen 2’s AVX2 instructions must be what is letting you keep up with the newer systems here, relatively speaking.

What library did you try using? There are some prerequisites for the library to actually work properly like lapack interface spec.

There are definitely some areas AOCL can outperform MKL on AMD hardware, but they always seem to be relatively small computations like you noticed.

I know cuBLAS can’t currently be used with comsol, but I suspect it would take relatively minor work to get it to interface (same could be said about the Intel’s Data Center Max GPUs and oneAPI).
The reason I keep hearing for why no work is being put into this approach is that it would be severely bottlenecked by the PCIe bus since so much of the solver would still need to interface with the CPU’s memory.

Ansys pulled the trigger on this hybrid approach to mixed results; for some simulations, using a GPU for compute leads to regression in solve times compared to using CPU only.
Conversely Ansys does have some pure-GPU solvers too, but they only work with the more simple physics like purely CFD simulation.

It seems to be coming. I can’t post links but search for the article “Get an Early Peek at COMSOL Multiphysics v. 6.3”.

1 Like

On my PC’s Win 11 need forced even affinity to work properly, otherwise it puts multiple threads onto a single physical core like in your screen shot, this slows it down a lot. For 16C/32T that would be 55555555 for example.

I am very interested in your result on 9950X with proper affinity, also please post your memory timings, as these result scale with memory latency well, to compare you need to compare at similar timings. If you have good a-die with low timing it should be comparable to my results above.

Note please run it with this command in windows run window will force HT off , or use Lasso or task manager to force it before the run.

cmd.exe /c start “” /AFFINITY 55555555 “E:\comsol62_benchmark_CFD_only_10GB_Win_x86-64.exe”

2 Likes

Figured I’d post my results here instead of Ryzen 9950X RAM Tuning and Benchmarks - #5 by RMM

Results

25 minutes 40 seconds

Configuration

I’m running my 9950X with 2x48GB DDR5-6400 XMP profile with tuned timings and clocks plus a slight PBO boost as described in the above thread as Tuned + PBO.

Operating system is Arch Linux running pretty recent kernel Linux bigfan 6.11.3-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 10 Oct 2024 20:11:06 +0000 x86_64 GNU/Linux

I didn’t bother disabling hyper-threading nor setting process affinity as the Linux scheduler seems to be doing the right thing. Notice btop showing a single core passing back and forth between its threads but never going over 100% (so overall it stays at a constant 50% utilization of 32 threads).

The CPU temperature oscillated up and down with each iteration between 67 and 74 deg C. This benchmark has a lovely coil whine signature. It seems less CPU intensive than y-cruncher BKT which pegs the CPU to 100% (all 32 threads) with temperature at 78 deg C. Idle is typically just over 40 deg C.

Thoughts

I’m running a lean setup with dwm windows manager on X11. I killed the compositor picom after running startx and have no other terminals open and minimal background tasks.

Might be able to squeeze just a touch more out of it renicing it to say -3 e.g.

$ sudo nice --3 ./comsol62_benchmark_CFD_only_10GB_Linux_x86-64.sh

Might try setting affinity to just first thread of each core e.g.

$ taskset --cpu-list 0-15 ./comsol62_benchmark_CFD_only_10GB_Linux_x86-64.sh

The CPU clock supposedly dipped down to 600MHz occasionally according to btop. I could try setting a different power management profile.

Also might be fun to bench it again against my BIOS default baseline config as per the other thread.

I didn’t try the other BLAS -blas aocl given others above suggest even on AMD it shows a slight regression over the default mkl psure? But could just try it…

Cheers!

1 Like

Thank you for sharing that! I had not seen it before:

Its interesting the route of “GPU compute” they took, it’s only being used for acoustic simulations and to train neural network surrogate models, rather than a wider scope of computations.
​​​ ​ ​

​​​ ​ ​
For the uninitiated, a surrogate model is basically a way to pre-compute the bulk of a problem (via traditional FEA means) and then infer results based on inputs that might be a little different than what the pre-computed solution was. Deep neural networks can be used for this and will be able to leverage graphics card compute soon.
I personally prefer the polynomial chaos expansion surrogate model instead of the deep neural network because the PCE can quantify the uncertainty in it’s results/estimates while the DNN cannot.

Thanks for the quick and detailed reply, I ran my 7950X3D results again below in Win 11 24H2 to compare.

7950X3D, CO -25 -10 , DDR6200 CL28 59ns latency, forced 55555555 affinity, Win 11 24H2
29m 4s (1742s) default -numasets 1
28m 20s (1700s) with -numasets 2

Yes BLAS switch will probably not improve, also based on the core usage it is loading all physical cores so Linux is already avoiding HT, however the -numasets 2 should give a improvement as shown above, this instructs Comsol to reduce core to core communication across the CCD boundary as if it was a different NUMA node.

Also note that X3D will also give a 3-5% advantage, here to hoping they do a dual CCD X3D for the 9950X3D that should be very good if they do it.

2 Likes

I tested my 7950X3D and saw a 12.4% advantage for X3D. Post #143.

1 Like


This affinity trick did distribution of workload between threads uniform in Task Manager, and PC run much cooler and quiet (max freq. I saw was 5.4 GHz) but I don’t see any increase in performance.
I mentioned my RAM earlier, its two sticks of KF560C32RS-48
Running at 6000 MT/s CL32, no manual tunning. After trying everything including PBO I believe that memory bandwidth is bottleneck here.

2 Likes

You’re running an AMD 9950X right? Yeah, just tuning my RAM alone got me a 30% faster score from 36m13s down to 25m30s. A little PBO didn’t help any more though. This benchmark is pretty heavy on memory i/o.

@gluon-free , as lemma mentioned in their insightful comments in the other thread, even a little RAM tuning can provide significant uplift for some workloads without pushing speeds into potentially unstable territory.

I’m guessing you could probably get at least 10% improvement by

  • confirming 1:1 uclk=mclk if it isn’t already
  • fabric clock at 2000MHz if it isn’t already
  • snug up tertiary/secondary timings a little
  • leave primaries mostly alone at your CL32 XMP profile

I’m curious the results if you decide to tune RAM. Thanks and cheers!

2 Likes

Yes, I manually set 1:1 mode in BIOS. I wanted to try manual memory overclock to 6200 MT/s on 1.375 volts with tighter manual timings, but I didn’t find good set of custom timings for similar 48 GB module of Hynix M-dies suitable for 9950x.

1 Like

I see… In my limited experience with 2x48GB Hynix M-die on 9950x is to not worry as much on pushing past 6000 MT/s and spend more effort on tuning timings. I didn’t run the configurations in my benchmark, but anecdotally 6000 MT/s tuned will outperform 6400 MT/s loose timings. Even 5600 MT/s tuned would be better in some tests I’m guessing.

Have fun and keep us posted if you find a happy config for your rig!

Well, I’m no Wendell, but I acquired a 265K and figured I’d throw this into what testing I’m doing with it.

265K with 6400 32-39-39-84 16gb x2, under an Enermax Liqtech III 360 AIO. XMP on, otherwise on very literal “out of box” settings, so PL1 residency for the vast majority of the benchmark at a whole 125W. Win11 24H2.

32m00s, backed up with a 31m17s. Temps in the mid-50s C.

Might be a little bit of truth to the whole “nobody got fired for buying Intel” thing. Will likely retest later with some more power, but I kinda like the relative sanity of PL1 as an out-of-box setting; this only wound up on a liquid cooler because it was priced competitively to big tower coolers.

Edit: Oh yeah. Gigabyte calls F8 the “launch” BIOS (and it was released to their website in September), but the board I have is on F5.

E2: Updated to F10c; the OOB default + XMP config regressed to 33m40s even though power limits climbed to 250/250 from 125/250. Ring clock also dropped by 100mhz for unclear reasons (I’ve seen 3900 be referred to quite often). Re-enabling the tick boxes for “high bandwidth” and “low latency” modes brought it generally back in line at 30m40s or so, and then just bumping max ring clock to 4200 brought:

Scaling isn’t amazing on that tweak, but it was noticeable. This is also still somewhat “slow” memory versus what Arrow Lake can run, but I don’t think a 50% increase in memory speed is terribly likely to make enough gain to justify the cost. As it sits, the platform combo was priced like buying a 9950X and the same kit of RAM, and the motherboard was actually the factor that made me go for it.

3 Likes

Did you try to run your current memory faster, if you have A-Die SR it should run at 7200 or higher? it would be interesting to see scaling with higher speeds on the 265K. One interesting test would also be run it with the -np 8 switch, so we can compare it 8 core vs 8 core to get an idea on how good the big cores are in arrow lake. Also try to tweak the timing if you can, that can have a big impact, or send us your ADIA BW and latency and we can get a idea form that how tuned your timing are to look at more apples to apples comparison with Ryzen. If you have tuned timing that could be worth as much as 10% or more in this test vs same clock without just EXPO values.

My current experience tuning DDR5 is a fat zero and as a result I haven’t done a whole lot of tuning. I was mostly curious initially of what sort of OOB results could be had, and it really wasn’t bad, though the level of seeming fall-off with the newer BIOS is slightly alarming. Also my first ever(!) Intel system.

Just finished up trying 7200 using timings provided by Gigabyte’s AI Snatch tool (1.45V, something like 36-42ish?) and it spit out a 30m45 and 30m33 with the BIOS in “Unleash” mode. However, I ran AIDA as well, and got bandwidth /and/ latency figures within 1gb/s and 1ns of 6400+LL/HB as best I remember, so it basically running the same is entirely unsurprising.

Other things I’ve seen: There seems to be some type of clock dropping past 75C that isn’t registered as anything noteworthy in HWinfo; performance just kinda drops off even though all the claimed clocks are still there. Also a lot of odd interaction between P&E&ST workloads; at default 250W power limits, increasing the allowed multiplier in 8c-active workloads decreases MT performance, but increases ST without actually changing the individual core multiplier allowances… While not throttling on current, power, or temperature. Allegedly.

1 Like


in MAC mini M4 16G,CFD only 10G
46min18s
This may seem surprising, and single-core performance is a big advantage, even if the overall specifications are not high

1 Like

Thanks for sharing this! I was curious about how the M4s performed.

The virtual memory reported has me wondering if the benchmark is being slowed down by swapfile usage though. The M1s also reported high swapfile usage, but their scores were pretty decent aswell which made me think it was some kind of weird error in reporting; maybe it has something to do with *nix OSes mmap() call being reported as actual virtual memory. I’m going to look into this because it keeps popping up.

In fact, I think the virtual memory usage statistics may be due to mismetering, because the monitor did not observe swap usage on this MAC, and the total disk amount is only 256GB, so it is obvious that something is wrong

Good result for 4 cores, tried -np 4 on a 7950X with PBO with tuned DDR6200 CL26 memory, to compare core to core, and it was 55m 44s on 4 core, so the M4 cores are a good 20% faster then Zen4. Its an interesting test of IPC, as its not limited by memory. It would also be interesting if anyone can run it with -np 4 on a Zen5 CPU and or the new intel ones, as above ~8 cores its more a memory test. FYI using all 16 cores on the same 7950X was 29m 49s.