Dual EPYC CPU workstation crashes ANSYS Fluent when run with more than 1 core

Hi all!

I’ve recently built a dual EPYC CPU workstation for CFD and the assembly and setup went fine but I am running into an issue that I’m hoping maybe someone can nudge me in the right direction.

I mainly use ANSYS Fluent for my CFD work and while it was working fine on my old PC, it’s exhibiting weird behavior on the new workstation. It seems any time I try to launch it using more than 1 core, the program terminates with an error.

Workstation Specs:
Motherboard: MZ73-LM0 (rev. 2.0) AMD EPYC™ 9004 DP Server Board Gen5
2x CPU: AMD EPYC 9684X with 96 cores and 1152 MB L3 Cache
RAM: 24x 32gb DDR5 RDIMM 2Rx8 4800MHz
OS: Win11 Pro for Workstations
Bios: SMT off

Error:

Connected License Server List: @localhost


Info: Your license enables 4-way parallel execution.
For faster simulations, please start the application with the appropriate parallel options.


Host spawning Node 0 on machine “Workstation” (win64).

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 11720 RUNNING AT Workstation
= EXIT STATUS: -1 (ffffffff)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 19632 RUNNING AT Workstation
= EXIT STATUS: -1073741819 (c0000005)

I’m wondering if anyone has any thoughts why the software might be having issues allocating more than 1 core? It’s not a licensing issue, since by default Ansys allows 4 cores.

Windows fully updated.
Chipset driver installed.
Bios on the newest version.

Any ideas what else I can try?

Thanks!

How big is the problem? First thing that came to mind is you don’t have enough memory for solving on more than 1 cores but it would have to be a pretty big problem for that to happen.
Just realized you said you’re getting that message just during launch, which makes me think licensing again; perhaps there’s some kind of NUMA licensing aspect beyond core licensing (I don’t remember this being true when I use to use fluent though)? if so NPS could help.

What’s your NPS setting in BIOS? it might help to play with it.

I was able to start a previous version of Fluent 2023R1. That one loads multiple cores from both CPUs just fine, although I find the software slows down to a crawl somewhere between 64 cores and 96 cores.

It’s something to do with MPI and AMD EPYC system.

I’ll take a look at the NPS in BIOS and will report back. Thanks!

I have no clue yet but do you get in touch with Ansys ?
They should help you and they can some times give you specific version to best suit your use.

Please also note that your are leaving a lot of CPU performance on the table with W11 operating system. We are used to say ~20% for CFD application but in your high core count should be even more (see last L1T video as exemple).
That might be the reason why a CFD software (wich should scale decently for cluster application) strugle arround 64 cores.

PS: I am really interrested by the CPU scaling perfomance when your setup will work with Ansys.
You also have a benchmark made by Twin_Savage if you have time to spend on this.

Hi! Thanks for your reply.

So after working with my vendor and continuingly having trouble with Win11Workstation, I ended up following their suggestion and switching to RHEL 8.8. After some newbie linux mistakes, I got the latest Fluent 2024R1 installed (and licensing moved, what a pain) and I am able to use all 192 cores. So yay!

To be honest I am a bit underwhelmed, I expected faster performance from dual cpus.

Doing some simple benchmarks on my own (600k mesh, multiphase), I’m seeing better sim times from using less cores than 192.

Partitioning and load balancing basically had no effect.
Using intel mpi vs openmpi has no effect.

I still have to look into NUMA nodes and maybe switching my nvme ssd into a raid 0 configuration.

Time on the comsol 10gb cfd benchmark:
Server:
MZ73-LM0 (rev. 2.0) AMD EPYC™ 9004 DP Server Board Gen5
2x AMD EPYC 9684X with 96 cores and 1152 MB L3 Cache
24x 32gb DDR5 RDIMM 2Rx8 4800MHz
RHES 8.8
SMT Off

Runtime: 18min 8s

Is there a good source for optimizing AMD EPYC server performance? Maybe even specifically for CFD?

Good news there !
Does your vendor say it was not possible at all on W11 Pro with this CPU ?
RHEL is what they advice ? (I have a similar choice to do in a near futur :p)

With CFD you can’t expect a small mesh to perform better on too many cores. There are simple rules to follow as well as lots of things to optimise your system and your run submission.
To start building some knowledge and know your system benchmark it.

On most CPU the linear scaling (for a large enought mesh) disapear near 16 to 24 cores depending on the platform.
For 3D cache CPU it continue to scale almost lineary past 32 core, I don’t know the exact limit (taht I expect to be between 48 and 64 core) but you can test it and tell us.
After that you will still get a gain but les and less.

Considering the mesh size I think that 1 million cells are too low to benchmark anything pro, below that i’m sure you can get a consumer i7/i9 which will compet fine with 8 cores or so on a dual chanel memory.
For commercial CFD software I think the limit of cells per core is around 50k (might depend on the problem), so for 192 cores it give you at least a 10 million mesh to exploit those cores. With scaling and bandwidth constrain probably even more.
For multiphase it is probably not as easy to predict.

Do you wright a lot of data ?
Beause i’m not sure a raid 0 will help you a lot there. You can time your simulation to know if it slow down due to too much writting.

Thank you very much for your participation to the bench, it is close to the best score.
One thing would be interesting there, to run it on only 16 core and/or to run it on few core splited on the two CPU to have max bandwith.

1 Like

Hey! Thanks for your reply!

I tested the NPS settings in bios, by default it was set at NPS-Auto which I think basically used NPS1.

I ran some quick benchmarking and NPS4 gave me the best results.

I tested with Ansys Fluent at 96 cores and got the following values:

NPS Auto: 96c, 33.368, 165.89 (cores, total wall clock time (s), and total sim time (s))
NPS0: 96c, 36.589, 179.665
NPS1: 96c, 33.285, 165.653
NPS2: 96c, 32.855, 164.280
NPS4: 96c, 32.748, 163.735

My motherboard sadly does not provide an NPS3 option.

There’s still memory interleaving option that I want to test and see what impact it might have. I read somewhere here on these forums (might be your benchmarking thread) that it can be beneficial.

1 Like

Yeah I’m happy to have a working system, now just need to optimize it as much as I can.

RHEL 8.8 is the latest linux that is supported by Ansys Fluent 2024R1 and is what the vendor suggested to maximize use of the hardware. Their explanation was that Win11Pro Workstation CAN work but it will always be only be able to use about 80% of the hardware’s capability and to use the hardware to the fullest, linux is a must. I definitely prefer Windows to Linux since that’s what I’ve been using for a long time now.

I did some core testing last night on my small benchmarking test and got the following graph.

It’s interesting to see how around 48 cores, calc times really level out. The total wall clock times continually decrease as cores amount goes up but the actual sim time stops decreasing somewhere between 96 and 128 cores. Not only that but it actually increases after 128 cores.

Like you’ve mentioned, this is only relevant to my small mesh test case and performance can look different on denser meshes.

As far as IO is concerned, I typically export an animation and write some tracking data per timestep. I tested my simple case file with and without output and it had basically zero difference on final sim time.

For my machine and Fluent, it looks like memory interleaving ON is more beneficial with total sim time being 169.158s vs 190.7031s.

1 Like

I think this particular CFD benchmark is too small to take advantage the full parallelism that these high core count EPYCs bring. The CFD-only “10GB” benchmark is “only” ~1 million degrees of freedom; a more analogous measure to compare it to Ansys’s cells metric is the mesh element count which is 482705 elements.

Usually these are kind of program specific, but here’s one for comsol which has some overlap with ansys:

as you can tell with the article, alot of the optimization is just in changing the way you setup and solve the the problem; some of the suggestions won’t be applicable to ansys though like changing element order (a FVM vs FEM thing).

This is consistent with what alot of the new threadripper users are noticing.

This is a weird one, Wendell mentioned he could get NPS=3 to sometimes assert, and that it was either the fastest or among the fastest schemes.

A few other notes:

NPS Auto: 96c, 33.368, 165.89 (cores, total wall clock time (s), and total sim time (s))
NPS0: 96c, 36.589, 179.665
NPS1: 96c, 33.285, 165.653
NPS2: 96c, 32.855, 164.280
NPS4: 96c, 32.748, 163.735

Memory Interleaving enabled performed better than when it was off (169.158s vs 190.703s)

AVX512 enabled was worse when it was disabled. I don’t think it matters since this isn’t an intel system anyway. (163.77s vs 163.37s)

Private memory region on auto performed the best than distributed or consolidated. (163.37s vs 164.108s vs 163.724s)

DDR Power Down performed worse when disabled than on auto (163.37s vs 163.673s)

I looked for CMD2t to try to set that to 1 but I couldn’t find the option in BIOS.

1 Like

northstrider
AVX512 enabled was worse when it was disabled. I don’t think it matters since this isn’t an intel system anyway. (163.77s vs 163.37s)

It’s more related to library and program compatibility, you have to work to use this instruction.
Maybe they don’t support AVX 512 on AMD yet, but ask them to be shure.

northstrider
RHEL 8.8 is the latest linux that is supported by Ansys Fluent 2024R1 and is what the vendor suggested to maximize use of the hardware. […]

That is true, as I said linux is best to maximise your hardware, my question was about the linux distro to choose.
RHEL might have the latest kernel or beeing more supported by Ansys.
Was everything supported out the box, or did you have trouble installing some drivers ?

Sorry for the late reply, just saw your post.

On RHEL, I had to manually install all the prereq libraries from the Ansys help page and after that Fluent installed without any issues.

Fluent’s ridiculous licensing costs are making me second guess my commitment to them and instead look for a different CFD package. I currently do mostly mixed fluid flow, so maybe COMSOL Multiphysics with a CFD addon is the way to go.

Don’t worry it is fine.

I’m on Siemens boat, and not shure cost are really better for CFD application.
Don’t know the current state but for pump/compressor application Anys CFX has a strong reputation. I’m currious of the state of the art now but I don’t think Comsol is the way to go for this precise topic (at the same time i’m not really informed on what comsol can do in CFD).

It really comes down on you cash flow. If you’re backed by your company, then yeah Ansys is the way to go but as a small business, Ansys prices are going to bankrupt me.