Workstation for Monte Carlo Simulations

@dahlia123 was experience similar performance on dual EPYC 9654’s in Windows to your dual 9754’s. We were hypothesizing that it was a Windows scheduler problem not utilizing the hardware fully.
It’d be super interesting to see some of these AMD platforms running the same benchmark on Linux, that way we’d have more solid proof Microsoft is the source of our problems.

​​​ ​ ​
​​​ ​ ​

On Linux the “200GB” benchmark will silently crash if you run out of memory; which may happen on some of the higher core count CPUs with “only” 256GB of memory because benchmark memory usage weakly scales with core count… but if the “50GB” benchmark is crashing on Linux then something else is going on, I wouldn’t mind helping out troubleshooting if that is the case.

1 Like

“New” Comsol benchmark error?

Is it from only 128GB of RAM?

I tried to run the 260GB file in the folder and received this error:

But, the 50GB benchmark ran without an issue:

Result - 7h:27m:21s

When I initially ran the 50GB benchmark in a 7995WX system w/256GB RAM, I got 9h:43m:24s. So seemingly randomly it’s 2 1/2 hours faster in this run, with half as much RAM.

System:

[System Summary]

Item	Value	
OS Name	Microsoft Windows 11 Pro	
Version	10.0.22631 Build 22631	
Other OS Description 	Not Available	
OS Manufacturer	Microsoft Corporation	
System Name	DESKTOP-HUGHT75	
System Manufacturer	ASUS	
System Model	System Product Name	
System Type	x64-based PC	
System SKU	SKU	
Processor	AMD Ryzen Threadripper PRO 7995WX 96-Cores, 2501 Mhz, 96 Core(s), 192 Logical Processor(s)	
BIOS Version/Date	American Megatrends Inc. 0404, 12/20/2023	
SMBIOS Version	3.6	
Embedded Controller Version	255.255	
BIOS Mode	UEFI	
BaseBoard Manufacturer	ASUSTeK COMPUTER INC.	
BaseBoard Product	Pro WS WRX90E-SAGE SE	
BaseBoard Version	Rev 1.xx	
Platform Role	Workstation	
Secure Boot State	On	
PCR7 Configuration	Elevation Required to View	
Windows Directory	C:\Windows	
System Directory	C:\Windows\system32	
Boot Device	\Device\HarddiskVolume1	
Locale	United States	
Hardware Abstraction Layer	Version = "10.0.22621.2506"	
User Name	DESKTOP-HUGHT75\COMSOL	
Time Zone	Pacific Standard Time	
Installed Physical Memory (RAM)	128 GB	
Total Physical Memory	127 GB	
Available Physical Memory	104 GB	
Total Virtual Memory	144 GB	
Available Virtual Memory	118 GB	
Page File Space	17.0 GB	
Page File	C:\pagefile.sys	

Yeah that’s it running out of memory and getting upset. It’d probably take slightly more than 256GB of memory to get that benchmark to run on that many cores.

​​​ ​ ​

​​​ ​ ​

This is interesting, I think this is a genuine improvement in performance rather than a testing variance. Is the 128GB memory significantly faster than the 256GB memory used previously?
The benchmark should scale fairly strongly with memory performance; as long as enough memory is installed to complete the test, extra doesn’t improve performance.

1 Like

I would suggest you look into a used workstation from HP or Dell. I use at lot HP Z stations for a very good reason, they last! They are designed to run at 100% cpu load for the entire three years of the MFG warranty, rack mounted. Which means the power and cooling systems are not adhoc, they engineered solutions. I love building computers but when I need computer to work everyday because they make money for me (my company) we always go workstations. Also I would consider running your systems from 208/240 power, way more efficient on the power supplies. A dual cpu HPZ840 with E5-2687V4 processors and 512GB of memory can be had for less than $1300 dollars. It will 1200 watt PS add whatever drives you like. If you move up to a Z8 you can get even more cores / threads as needed.

2 Likes

Consolidated Comsol Benchmarks

*For some of these benchmarks, I didn’t realize I should’ve been jotting down RAM channels so I listed them as “?”

Results formatted hour:minute:second

Sorted by CPU core count.

Running the benchmark, no special commands

CPU RAM RAM Channels Operating System 50GB 200GB
Intel Xeon W5-3435X (16-Core) 256GB ? Linux 3:08:13 17:25
2x Intel Xeon E5-2650L (8-Core) 256GB DDR4-2400 8 Linux 8:21:00
2x Intel Xeon E5-2650L (8-Core) 256GB DDR4-2400 8 Windows 10 21H2 8:48:00
AMD Ryzen Threadripper 7960X 128GB DDR5-6400 4 ? 3:15:00
AMD Ryzen Threadripper 7970X (32-Core) 256GB 8 Windows 11 3:58:43 36:54
AMD Ryzen Threadripper 7970X (32-Core) 256GB 8 Linux 3:30:12 19:46
2x EPYC 9374F (32-Core) 1.5 TB DDR5 16 Linux 13:20
AMD Ryzen Threadripper 7980X (64-Core) 128GB DDR5-6400 4 Windows 11 1:09:57
AMD Ryzen Threadripper 7995WX (96-Core) 256GB DDR5-4800 8 Windows 11 7:39:42 19:07
AMD Ryzen Threadripper 7995WX (96-Core) 256GB DDR5-4800 4 Windows 11 10:12:35 43:35
AMD Ryzen Threadripper 7995WX (96-Core) 256GB DDR5-6000 4 Windows 11 7:39:42 17:32
2x EPYC 7773X (64-Core) 256GB DDR5-4800 8 Linux 9:43:11 23:41
EPYC 9754 (128-Core) 384GB ? Windows 11 8:30:19 25:19
EPYC 9754 (128-Core) zen4c 386GB ? Linux 19:51

If anyone can run the benchmark feel free to add to the post! I’ll add you to the table. All data welcome.

Thank you @dahlia123 for all your benchmarks :slight_smile:

I’m going to do more tests to confirm the -np and NPS info from their testing on Linux, so will update soon™

3 Likes

Hi All
I have been reding into your Forum-posts and are very curious about this :slight_smile:
My Colleague and i are looking into building a new pc for Simulating in Comsol!
For starters i want to be sure that i read this post correctly - All the benchmarks you have been running, it seems that we want to models to be finished as fast as possible, and from that i can basilcy read from @Level1_Amber post the 16th of Jan, that Threadripper 7970x is Trashing the 128Core Epyc-processor when looking into cost and cores vs performance? :slight_smile:

He is the user, and i will be looking into the technical part of building the system!
At the moment he is using an i9 9900x with 128Gb of ram.
When he is running one basic Electroacoustical sim, he is utilizing around 60-70% CPU and are using around 30Gb of ram (This is optimized so we can run more sims at the same time, and not use too many resoruces, hence this would increase the time used exponentially)
(We are tryiong to run as many simulations as possible at the same time, in order to reduce time, but at the moment 2 parralel simulations are scaling from 40min/each to 1h each when run at the same time :frowning: )

Sadly the model he is running is filled with a lot of company secrets, but we will look into doing a model that very alike, but more generic!
Its a wasteland of information in the simulation-world, and as i can reed on all of your benchmarks so far, it all comes down to which models you are running, and how the PC and OS can handle this.

Would it be possible that i could upload “our version of a benchmark” and some of you could de benchmarks with that one, just for us to clarify if it would make sense to Build a 2x7773x system on a Server-platform, or if we should “just” aim for a 7970x? :smiley:
Thank you in advance!
Looking forward to read your answers/recommendations :slight_smile:

Welcome!

This is true, and it will be for most all scenarios. The very high core count EPYCs are better suited to virtualization workloads where all the cores aren’t competing for the relatively limited memory resources; that being said, a dual socket F series 9374F EPYC system is currently on the top of the benchmark performance charts in the thread linked at the bottom.

Information about FEA performance can still be somewhat of an art more than a science, although there are some trends that emerge, like running on Linux always gives better performance than Windows, and that smaller problems don’t take advantage of larger systems with many cores and channels of memory very well.

no problem, I don’t mind running it on some of the machines I have.
Also don’t count Apple out of the race, their old M1 Ultra is actually beating a Threadripper 7960X in the CFD+Electromagnetic simulation (Threadripper was running on Windows though which is a handicap) in the following benchmark thread:

Thanks! - and thank you for the thorough response :slight_smile:

Sounds like i should take that into consideration as well :slight_smile: - Although i am trying to compile some sort of performance/price - comparison, since we are in a semi-small company, and buying a COMSOL-pc for around $30,000 is considered a large investment, and a $50k pc would mean a serious argument and ROI over the current 9900x-system we have today :slight_smile:

Right now we are looking into a 7970x system with 256Gb of ram for ~$8k, versus a dual 7773x system with 2Tb of ram for around ~$31k

We are open to all kinds of platforms, right now my colleague are using it on windows because that is what he knows, but i have a sense that he would be adjustable to changes if that means more simulations and faster results :smiley:

Fantastic, thanks! - it might take a week or two for him to make a generic lookalike that will run close to the same demands as ours do now, but i will make sure that he makes a “scaled down” version taht runs well on the current system, and mabye a “full” sim that would be the dream-scenario to investigate on a new powerful system!

  • I noted down that right now the 7970x is the fastest confirmed system for the price, and i will look into a setup with Dual 9374F :slight_smile:

And then i would love to see something from the M3 macs, or even excitingly wait to see what the new M4 (and maybe M3 Max and Ultra) could bring to the table in the near future :slight_smile:

1 Like

Another question that has come up, is the availability of Cloud-based solutions.

Do any of you have any experience with doing COMSOL on larger Cloud-based systems regarding Prise/Performance/Flexibility and such? :slight_smile:

That is a good question, I haven’t evaluated the cloud solutions in years but ~2 years ago they were very very uncompetitive with running locale hardware.
The only situation were the cloud made sense was if you had a very large sweep-type workload that you needed done quickly rather than a constant stream of simulations to run.
Because all my simulations are part of an iterative design process I never had enough variations of a simulation to run at once to have the cloud make sense.

There’s also the IP security aspect to look at, depending on the sensitivity it may be inappropriate to use even the govcloud solutions.

I have gotten the same sense up until now.
Although in our case we have many different parameters and iterations we are looking into, so we thought it might be a Time-saving investment rather than a cost-savings alternative :slight_smile:
As an example we are looking into wanting to simulate around 177.000 different variations of the same device, and that would have to run Serial (or at least with a max of 2 simultanious runs) at a local system.
But with a cloud based, we could “just” pay for a range of Core-Hours on lets sat 10x 32c “setups” and pay for the time saved in regards to the calendar-years of research and sims it would take now (Estimated 15years :stuck_out_tongue: )

But to come back to the original subject!
We have been looking around, and Comsol actually has a model for a loudspeaker-unit with Thermo-viscous losses that remsembles the workload we have at the moment :slight_smile:

[OW Microspeaker: Simulation and Correlation with Measurements (comsol.com)](https://www.comsol.com/model/ow-microspeaker-simulation-and-correlation-with-measurements-78121)

Sadly we dont have the Runtime-Toolbox, so we cannot make an .exe for you to just run, but right now we have a result for our own system:

CPU: i9 9900x @ Ht: Enabled, Turbo-Boost: Enabled
MBO: MSI x299 SLI Plus
RAM: 8x16Gb - 128Gb DDR4 @ Cl15 2133MHz
GPU: Quattro RTX4000
HDD: Samsung Evo 950 1Tb
(We have dedicated ~100Gb Pagefile in Virtual ram, which means that everything above the 128Gb will be Read/Written from the ssd, and slow the test down, but making it able to run despite lacking Physical RAM)

Results for everything is by now:

50Gb
7h 42m 15s

200Gb
1h 34m 29s

“Study 3”
13m 49s

I would love if “Study 3” Could be made as some sort of easy excecutable so it could be tested among as many systems in here!
But right now we are of course looking into the 7970x and the 2x 9374F as the primary options :smiley:

The link should contain everything needed to run the model, but its not as “handy” as the .exe’s :slight_smile:

Looking forward to hear from you all, and to see what the tests say!

(And! If it is possible for you to try and run simulataneous tests, to see how much we can push the capabilities, that would be awesome :smiley: )

This may very well be true. One of the cloud machines I was interested in trying but never got the chance to were google’s graviton 3E and 4 machines, which at one point were basically being subsidized by google to try and gain customer traction.

​​​ ​ ​
​​​ ​ ​

On my Xeon w3435x machine, study 3 of ow_microspeaker takes 6m 49s when running 1 study at a time; if I run 2 instances of the study at the same time it takes ~10m 10s to solve; if I run 3 instances of the study it takes ~19m 39s to solve.



At only 360k DoF, this particular example problem would actually run fairly well on an AMD system (in single and batch sweeps).
AMD systems tend to perform more poorly than expected on larger problems due to some fairly consequential trade-offs AMD decided to pursue in order to scale to large core counts and to achieve ease of manufacturability.
Jason Rahman has some interesting articles on the differences in core to core latency between Intel and AMD; Intel is roughly an order of magnitude better than AMD when looking at single socket servers, and when going to dual socket servers Intel’s lead over AMD shrinks to a factor of ~5 times less latency than AMD.

Nice, i’ll have to look into those! - Atm we are looking into one of the Finnish Supercomputers that have the option of enterprise-usage!
And COMSOL also has a handfull of “partners” in cloud computing they recommend themselves, but their payment-plans are not very intuitive :sweat_smile:

Nice! Thank you very much! - that is a great start! Thats already almost 3 times the speed when running 2 at the same time :smiley:

So you are saying that because its a relatively “small” problem, we might see an advantage on the Threadripper or EPYC system over the XEON?

I am looking very much forward to see those results as well :smiley:

I was thinking; do you have the ability to Boil down “Study 3” into an runtime-edition like the ones in the benchamark-folder in GDrive? :slight_smile:

These results were achieved without playing with the number of threads that comsol utilizes per study, I think tweaking the number might actually improve performance when running more than 2 instances at once.

Yes, in general AMD cores are more performant and especially more efficient than Intel cores when an entire problem can be done on a single CCX (and to a lesser degree CCD if a CCD contains multiple CCXs); but when a problem needs to scale to more than the 8 cores that AMD has per CCX to a larger number, the performance of AMD scales poorly compared to Intel.
This phenomenon could be take advantage of by running multiple sweep iterations on the different CCXs simultaneously in a sweep study.

At only 360k DoF, the problem doesn’t parallelize as well as a larger problem so 8 cores is closer to the linear region of Amdahl’s law than say 32 cores would be.

I’m going to attempt that when I have some free time. I wanted to see if I could put number of studies to run in parallel as one of the options when running the app on the top ribbon of the GUI.

Yeah, i have been reading here and there about seperation in Numa/nodes, and designating a certain corecount could help for specific cases :slight_smile:

Thank you very much! - i am just grateful that there is someone that is able to help us Narrow this question down, in order for us to get the best possible solution :slight_smile:

I’m new to this and have a user request for a machine to run GATE from the international OpenGATE collaboration which is based on the Geant4 toolkit. Looks like they have example downloads, but I don’t have any comparable system to test with. I’ve read through this entire forum page and I’m still unsure whether AMD Threadripper Pro 7000 series, Intel W-series, or server CPUs would be the best choice. The budget is quite large at $30K and I would be buying new OEM, likely Dell or Lenovo. We are a large Hi-Ed institution and get significant volume discounts. The customer previously had an older Intel Xeon 32-core/64-thread (unsure which model) and 1TB RAM. Any help would be greatly appreciated. Thanks!

Welcome!

I get the impression that GATE isn’t very well optimized to run on highly parallel architecture computers based on what I’ve read on github (documentation and some of the open issues requesting multithreading capability), depending on how true this is you may want to opt for the fastest single core machine which is typically a consumer platform machine. Intel seems to comfortably be in the lead at the moment with a couple of Apple SKUs coming close.

While not the be-all and end-all, passmark’s single threaded benchmark chart gives a good approximation of single threaded performance if a core isn’t being memory starved or doing heavy SIMD:

1 Like

Hey
Did you have the chance to test the benchmark on more systems?
And maybe and -exe version? :smiley:
Best Regards
Oliver

1 Like

Hi,

Here are some benchmark numbers from a 2S EPYC 9654 system.

System
CPU: AMD EPYC 9654 (96 cores) 2S - HT off
M/B: Gigabyte MZ73-LM0 rev 2.0
Memory: Samsung M321R8GA0BB0-CQKZJ (DDR5 64GB PC5-38400 ECC-REG) 24EA (2 x 12ch, 1.5TB in total)
GPU: Nvidia Geforce G710 :smile:
OS: Windows 10 Pro for Workstations
NUMA setting; NPS = 2
COMSOL Ver = 6.1

“Study 3” Benchmark with “np = N” option

-np 192 = 55m 15s
-np 48 = 21m 23s
-np 24 = 13m 36s
-np 16 = 13m 0s
-np 8 = 13m 37s
-np 16 (8 instance studies) = about 20m (attached screenshot)

The trend indicates that as the number of processes (np) decreases, the computation time generally decreases, which is expected considering the latency between CCDs. However, after a certain point, the time doesn’t reduce significantly or even increases, suggesting that the system might be reaching its optimal parallel processing capability.

Running 8 instance studies with -np 16 each achieving ~20m total or about 2m 30s per study shows an excellent parallel efficiency. This suggests that dividing the workload into multiple smaller chunks is efficient for EPYC systems.

NPS = 0 or 4 could be considered further, but I’m currently unable to reboot the system. I will do this later on :grinning:

Note: For the 8 instance studies, Process Lasso was used to manage the workload allocation effectively in Windows 10. This program ensures that workloads are distributed evenly among NUMA nodes, avoiding the pitfall of all workloads allocated to a SINGLE NUMA node, which can significantly degrade performance.

3 Likes