Workstation for Monte Carlo Simulations

Threadripper has 64 cores now, wow. Maybe a machine like that could approach the power of a 64-core cluster that my colleague could use before he graduated. 2GB RAM + 256MB cache per core should be enough for each to load the geometry file, compute a trajectory, and save the results; the more cores, the more trajectories can be computed.

Huh, even Lenovo sells workstations like these. Wow, 7500$ difference between the 5975wx and 5995wx. Do you know what the ~45% discount is about or when it expires? That discount is really helping me afford one of those builds. Both Lenovo and Puget list a 64-core Threadripper PRO system for about 11-12k$ which is pretty close to the high end of my budget.

I assumed Intel must have something to compete. Is that what the Xeon W is for? However, the Threadripper and Epyc systems had the most discussion, by far.

An “X” suffix means that it has more cache. And 3rd gen (meaning Milan or Genoa?) can be more economical per core. Thank you, I will keep those tips in mind. Though, I am not sure that I ever saw a MilanX mentioned anywhere. However, the Epyc systems seem to come in these thick pizza box shapes rather than regular towers. Can we just set those on a desk too?

128 cores sounds amazing. We were hoping to avoid the extra variables of building something ourselves, since we are not that experienced with what all works together well hardware-wise.

twin_savage, It was cool to learn what MHD and FGMRES are. That is some drastic time reduction! That reduction was just from changing the CPU? How kind of you to offer running a sample! Though, our simulation would not be classified as FGMRES. I can send you some installation/compilation instructions for Windows or Linux.

Can you give me some stuff I can do on the hardware I have? Happy to run the numbers as it were. Especially if it’s easy for me to run the numbers

1 Like

While the Xeon cores are slightly faster than the Threadripper Pro cores, they certainly aren’t 7.5̅ times faster as the run times indicate, the main reason it is so much faster is because the Xeons have much more memory bandwidth than the Threadrippers, despite the core count deficit.
My understanding is that at least some monte carlo simulations are also a memory bound workload which is why I bring up the example.

Could you send instructions for installation/compilation on both Windows and Linux? I’ve been observing fairly big performance deltas between Windows and Linux with my specific simulation on the current Xeon and I’m curious to see if it manifests with other workloads as well.
Linux seems to be performing about 30% faster than Windows and I suspect the thread scheduler on Windows isn’t working as well as it is on Linux for the Xeons.

​​​ ​ ​

​​​ ​ ​
@wendell
not trying to hijack but… Want to try my simulation workload too? I’ve got it in separate self contained runtimes that’ll execute on Win/Linux x86 and ARM.
The only issue with it is that it’s memory footprint will scale with the number of cores you throw at it so >256GB of memory is required unless only a small number of cores are being used.

Sure? What’s my least headache step by step to run something and get numbers?

1 Like

Do you have a mini-problem set example that people can run on their own hardware, we can get you some data points of different systems so you can choose the best system for your budget.

Serve The Home I believe also offers a similar service of running confidential simulations on several hardware configurations to guide you to the optimum system for your budget.

I have a genoa 9124.

If you get multi-cpu systems, you can usually run those at a higher clock speed.

for linux grab comsol61_benchmark_EM_only_200GB_linux.sh out of:
https://drive.google.com/drive/folders/1YX0rqS85H-Z1rzjLTw_k6776FSBppKEB?usp=sharing

make sure it’s executable and run the script via terminal (does not need to be a root account, I’m just too lazy atm to properly setup xrdp)
image

you’ll get the following window, all you need to do is click the “Compute” button and wait ~20 minutes for it to complete

the solve time will be displayed here:

This is what typical resource consumption will look like, it’s purposely not using hyperthreading to achieve best performance:

​​​ ​ ​
​​​ ​ ​
​​​ ​ ​

For Windows you’ll need to grab comsol61_benchmark_EM_only_200GB_windows.exe file from that same google folder and execute it. On Windows it will ask you to install a self contained comsol runtime before the benchmark can actually be run, but other than that it is the same as linux.

​​​ ​ ​
​​​ ​ ​
*theres a super slight difference in degrees of freedom solved for between Windows and on Linux because the mesher is a tiny bit different between the two, but this shouldn’t be enough to drastically influence the results.
**also the benchmark isn’t perfectly deterministic, it’ll vary abit run to run, but not by the numbers I’m seeing between the two OSes.

2 Likes

================================================================================
Number of vertex elements: 390
Number of edge elements: 21726
Number of boundary elements: 627473
Number of elements: 13034930
Minimum element quality: 0.159

<---- Compile Equations: Stationary in Study 1/Solution 1 (sol1) ---------------
Started at Nov 6, 2023, 9:53:44 PM.
Geometry shape function: Quadratic Lagrange
Running on AMD64 Family 25 Model 17 Stepping 1, AuthenticAMD.
Using 1 socket with 16 cores in total on w11pnative.
Available memory: 98.07 GB.
Time: 165 s. (2 minutes, 45 seconds)
Physical memory: 13.12 GB
Virtual memory: 13.67 GB
Ended at Nov 6, 2023, 9:56:28 PM.
----- Compile Equations: Stationary in Study 1/Solution 1 (sol1) -------------->
<---- Dependent Variables 1 in Study 1/Solution 1 (sol1) -----------------------
Started at Nov 6, 2023, 9:56:28 PM.
Solution time: 13 s.
Physical memory: 15.34 GB
Virtual memory: 16.16 GB
Ended at Nov 6, 2023, 9:56:42 PM.
----- Dependent Variables 1 in Study 1/Solution 1 (sol1) ---------------------->
<---- Stationary Solver 1 in Study 1/Solution 1 (sol1) -------------------------
Started at Nov 6, 2023, 9:56:42 PM.
Linear solver
Number of degrees of freedom solved for: 100365915.
Solution time: 271 s. (4 minutes, 31 seconds)
Physical memory: 88.6 GB
Virtual memory: 98.94 GB
Ended at Nov 6, 2023, 10:01:13 PM.
----- Stationary Solver 1 in Study 1/Solution 1 (sol1) ------------------------>

Then out of memory error. I only have 96GB on this box.

I am going to attempt to set 1tb of virtual memory then run it again. Though the results will probably not be meaningful.

2 Likes

Thanks for trying anyways!!

I’ve actually got another benchmark that only has a 50GB memory footprint, but it is a electromagnetics problem coupled to a CFD problem which is super compute intensive. I put it in that same share drive as comsol61_benchmark_50GB if you want to run it.

Sure, I will run that tonight (maybe overnight).

It consumed 224GB of ram, and took 5hours 10 minutes with the virtual memory being supplied by the intel d7-5600 6.4tb.

2 Likes

That is actually alot better than I thought it would be.

The “50GB” benchmark took a w5-3435x 3 hours and 8 minutes to solve:

Do you know how much ram your simulation requires? That needs to be in your budget.

Do you know if your fortran code is compatible with the nvidia compiler to run on GPUs?

For a given CPU architecture, ie zen4, you can compare systems directly to determine their value.
value per mhz * cores.

You also mentioned that cache helps. AMD has an X line with a larger cache. They also have an F line that are Frequency Optimized, ie more ghz.

You had mentioned cache, beyond that there is memory bandwidth. IE the amd epyc 9004 line has 12 ram channels, and 480GB/sec bandwidth to main memory. The amd epyc 7xx3 has 80GB/sec bandwidth to main memory.

You should also check with other people who are running your package. Are there optimizations for intel or AMD?

The Intel AI optional accelerators on some of the xeon CPUs accelerate some AI tasks 6 times faster. Though in that example they were running Stable Diffusion with the final result that a $13,000 intel cpu would be slower than a $600 nvidia GPU.

It is worth it to check with your particular software vendor to determine if some optional accelerators would be beneficial.

These are the AMD

50gb took 6h 12 min.

I will check that i have the performance profile set later today, and it is not configured to go to sleep. This one completed about 1/2 an hour after I woke up.

1 Like

More memory channels would definitely bring that number down.

I just finished testing the “50GB” benchmark on an M630 with dual E5-2650L v4s and eight total channels of DDR4-2400 and it got 8h 21min running on CentosOS Stream and 8h 48min on Windows 10 21H2; that 5% variance is within normal range.

1 Like

yeah, 6 hours 13 minutes 29 seconds. At least it is consistent. When I get more ram I will rerun it.

I know the amd side better than the intel side, on the amd side this is what I would recommend:
BTW the only reason I am not recommending dual epyc is because everywhere I looked, those were only available in prebuilt systems designed for gpu accelerators or with nvme backplanes, and started at 10K with a 1.2k cpu and no ram or storage.

btw the zen4 cores are 30% faster than the zen4c cores.

Motherboard, includes 10g ethernet
https://www.newegg.com/p/N82E16813183820
cooler per the manufacturer good for 500w
https://www.newegg.com/p/13C-000S-000M7
cpu 96 core zen4
https://www.serversupply.com/PROCESSORS/AMD%20EPYC%2096-Core/2.4GHz/AMD/100-000000789_368317.htm
SSD - this is what I am running, I got it a bit lower, the devision of intel was bought by solidigm, and these drives existed in their warehouse still branded intel. They are dumping them now for some reason. Mine had zero hours on it. They are rated to be filled three times a day for 5 years.
You may be able to get a better price elsewhere, mine was $359.

for reference this is the price for that drive you will pay if you get it built in as part of a server:

and it’s cable:

The pcie5 ports on the motherboard are MCIO 8x instead of slimsas.

or you can get a standard m.2 drive for one of the motherboard slots. The above drive is between 3x and 20x the speed of a typical m.2 SSD.

GPU:
You don’t NEED a GPU, but life will be more plesant with one. The IPMI VGA port has a max resolution of 1024x1280. Probably an NVIDIA gpu as there are more models you can accelerate with that, though you have the slots and budget to get an AMD GPU too so you have more flexibility, or even have one drive the monitor while the other gets used for pure compute.
I got this one because it is only 2 slots wide and air cooled(I plan to use all of my slots):
https://www.newegg.com/pny-geforce-rtx-4070-vcg407012tfxxpb1/p/N82E16814133854

For power the motherboard needs ATX+4+8+8. If you get a GPU for it, that will also need an 8 pin or better connector. Make sure your power supply can do that.

25 SSD
400 CPU
150 ram ( a complete guess I don’t know what they need, but they get hot)
200-600 per gpu
power supply: here is a basic one
https://www.newegg.com/rme-corsair-rm1200e-1200-w/p/N82E16817139315

A basic PC case that is large enough to hold the motherboard:
https://www.newegg.com/p/11-129-274?Item=11-129-274&cm_sp=product-_-from-price-options

Spend the remainder of your budget on ram, try and fill all of the slots. You can see from the other discussions that ram bandwidth makes a difference.

If you know that cache makes a significant increase in your speed, this CPU is available for your socket:
https://www.serversupply.com/PROCESSORS/AMD%20EPYC%2096-Core/2.55GHz/AMD/100-000001254_381212.htm

My build with the $1,200 version of that cpu and a $600 gpu cost about $3500. I built a dell server as a duplicate of my computer and that came out to be $20k.

1 Like

Purpleflame, please do take the offers to run benchmarks. You could easily get multiples of performance for your money by buying the right hardware.

3 Likes

Current gen AMD Epyc for cpu compute tasks
96 core zen4, 2.4ghz idle, 3.1ghz all core under load
768GB ram ECC DDR5-4800 12 channels of 64GB each
6.4TB u.2 enterprise SSD
12GB nvidia current GPU rtx 4070

Epyc genoa cpu compute system 10k budget
800 Motherboard supermicro with 10g and ipmi eatx
70 heatsink up to 500w (includes thermal paste)
5300 96 core epyc genoa, 3.1ghz all core
400 6.4tb u.2 drive intel d7-5600, 3x to 20x m.2 speed
50 cable for u.2
600 4070
170 power supply
70 case silenced with sound insulation and baffles

800+70+5300+400+50+600+170+70

7460 +10% tax = 8200

1800 for ram
$190 for 64 gig sticks on server supply, ebay is more than that now.
2280 for 12 sticks
2508 for 12 sticks with 10% tax

10708 total with 10% tax

I configured an identical system as a dell server.
the gpu they supply is 1/4 the speed
they require water cooling for an additional $1.2k, but the button went away.
$57k
I can’t attach a PDF of the config

Hope you all are well, Exard3k, greatnull, Lt.Broccoli, MikeGrok, twin_savage, and wendell. Apologies for my delayed reply. Work and people keep me busy. On the upside, I can consider something like this project. Let me try to respond to all the main points I see. Apologies for the long post!

I am most humbly grateful for the offers to run some tests on this software! A pleasant surprise! A little preview/testing helps before plunging with such an investment.

I am still going to look for an EGSnrc example to run. The GEANT4 package comes with some test examples; one of which approximates one of the types of simulations I want to run; though it uses an old machine and orders of magnitude fewer voxels and particles. While I compiled it for my desktop and ran it, the executable seems to not be totally self-sufficient. It still needs the GEANT4 libraries installed on my machine. So it seems that testing will require compiling those libraries first then compiling the examples to run them. That makes for more than a simple run and see situation so I am not sure how much I can ask. This post is long enough. I will post information for anyone still interested.

From the few compiled runs I did, I learned that I might need between 0.5 - 25GB file size (text file of numbers) depending on whether I want to simulate upto a whole 4-room suite. My i7-7700, GeForce 1080, 8gb ram desktop struggles.

I did try some online configurations. Here are the results. Not sure how to insert a link on this forum to the respective webpages. Interesting similarities and differences. Seems like the main issue I need to figure out is choosing Threadripper or Xeon-w. I am not sure of a similar website for Epyc so that was not included. I have been thinking I could add 1-2 gpus for additional value and variety of work – kind of liking the additional possibilities: render 3D resulting images rather than just 2d, gpu accelerated simulations via PyTorch, AI inference for go (game), image generation, etc… Maybe going with a generation old is worth thought. Not sure about future proofing or upgrading then though.

  • System76: 11 182$
    • Threadripper Pro 59995wx, RTX A4000 16GB, 128 GB RAM, 2x 1TB SSD
    • I happened across this vendor. Apparently they are less than 100 min away from me. Who knew! Do any of you have a good experience with them?
  • PugetSystems: 11 672$
    • Threadripper Pro 59995wx, RTX A4000 16GB, 128 GB RAM, 1x 1TB SSD
  • Lenovo: 11 708$
    • Threadripper Pro 59995wx, T400 8GB, 128 GB RAM, 2x 256GB SSD
    • Quote includes a 45% discount (via a code?). I am not sure how much I can rely on that. Though my work seems to have suddenly fell in love with Lenovo so maybe I can get a deal through them.
  • Dell: 10 915$
    • Intel w9-3475x, RTX A4000 16GB, 128 GB RAM, 1x 512 GB SSD
    • Lenovo seems to sell a generation prior of these cpus.

I am not sure on RAM total. Initially I thought that 2GB/core would be fair. Given that I may have to ultimately write upto 25GB, I am not sure.

The EGSnrc Fortran code is officially known/intended to compile for gcc-fortran. I will have to read about Nvidia’s Cuda Fortran. There are some academic papers claiming to run it on a GPU with upto 60x performance (2011-2013) though without links or code included. I am hoping to get that working if I can on my machine. Maybe that means I should think about GPUs too? EGSnrc is better for my therapy-oriented simulations while GEANT4 is more shielding-oriented and general purpose.

Not sure if MHz*cores is enough. Seems like number of channels, bandwidth, and bus speed(?) are important judging from the posts here. Thanks for the notice on the F line. I had not heard or seen it. I will look into whether anyone mentions optimizations for Intel/AMD. There is no vendor. Both are open source libraries.

MikeGrok, I think we will have to discuss these configurations more. 96 cores is enticing. You seem to demonstrate the economic power of building myself.

I expect more cache means that each core is better able to hold the data needed and the computation results. As twin_savage says, big cache may be a hindrance if it cannot be moved back and forth well.

Which one? Would be helpful for people to know what to test. Can the test be easily modified to approximate your simulation?

There is a bit more to it. Threadripper and Threadripper Pro have different memory configurations: Threadripper Pro has double the memory channels, which can really matter. Xeon-W is power hungry (and hot), but does have some accelerators that Threadrippers don’t. In recent generations of hardware, Threadripper is better for things that aren’t accelerated, but testing is truly needed.

The new Threadripper 7000 series have been announced. They should perform similarly to the Epyc 7004 series. You may not have the time to wait. How soon are you looking to purchase?

I would try to get a GPU with more memory if you’ll be doing computation on it. But if you don’t know what you’ll do with it, you can upgrade it later provided the system integrator has put a powerful enough power supply in and given the extra cables. This is a good argument for building your own.

Most general purpose machines will have 4 to 8 GB of memory per core these days. You are building a machine to fit a specific need, but I would still aim for 4 GB per core if you can. It will give you more flexibility for future uses. Also, you want to make sure all memory slots are populated to give yourself maximum memory bandwidth, which is likely to be a bottleneck on the high core count machines like the 64 core 5995WX.

I would! If it runs on a GPU, try to get someone to benchmark for you. Sometimes GPU can be way more cost effective, but not always! It largely depends on how parallelizable the data processing is, and how few branches the code takes.

Definitely not.

Yes! I run one application that can fully saturate a memory channel with just two cores! If I were building a system for that, I’d get a 16 core with 8 channel memory. General purpose machines tend to start hitting bottlenecks around 8 cores per channel. Some applications have small working sets of memory (like Blender) and are fine with 16 cores per channel. Memory bandwidth has not kept up with CPU performance, but those ratios roughly work between generations and vendors.

In general open source will adopt any performance enhancements available by either vendor a few months after CPUs are out. You may need to recompile the software to enable those enhancements as often software from distributions will be compiled for generic hardware for broader compatibility.

Cache effectiveness really depends on the size of the working data. It also depends on if the cores are working on the same data so they aren’t fighting over L3 cache. If L3 isn’t big enough to hold all the data for all the cores sharing that L3, it doesn’t really help much, and this is where main memory bandwidth really matters. Having a bigger L3 cache is rarely harmful, but it may or may not be beneficial. I would only spend money on a bigger L3 cache if you know your working set across all cores fits. Benchmarking will show that.

1 Like

64c tr is 1 hour 6 mins with 128gb ram

thanks yall

6 Likes