I have a problem with an Enterprise build so I figured I’d ask the experts. We’re building a workstation for use solving Computational Fluid dynamics (CFD) engineering simulations.
Our relevant hardware config:
Supermicro H11SSL-i motherboard
512 GB registered DDR4 2666
AMD Epyc Rome 7702p processor
Seasonic 1000W Gold PS
The issue we have is that, for our application (Ansys CFX), we see excellent scaling up to 16 cores. After 16 cores, we actually see a drop off in performance (e.g. 56 cores is worse than 28 is worse than 16).
We’ve ruled out application specific issues (in fact we’re running the exact LeMans benchmark that you see results for halfway down the page edit: It says I can’t include links in the page so Google Image: ‘amd epyc rome ansys cfx benchmark’ without the quotes and click the first image.
We’ve tried 3 different fresh OS installs (Windows Server 2019, Windows Server 2016, and CentOS Linux 7.x) Same behavior.
We swapped the motherboard to an AsRock Rack EPYCD8-2T. Same behavior.
We tried a different 1kw power supply. Same behavior.
Tried tuning the BIOS to the settings shown in the link above. Same behavior.
Tried many other separate BIOS tweaks. Same behavior.
We’re in the process of RMAing the CPU now. I’ll keep you updated on that.
Question for the group: Are you all aware of any BIOS settings that might cause poor application scaling performance beyond 16 cores on a 4 NUMA node CPU such as this? Or any other thoughts you might have?
Maybe the CPU is defective…but it just feels like there’s something else that needs to be tweaked. It’s almost like the system is power-throttling at higher core counts…but it’s not lost on me that the problem crops up at 17 CPUs and (16+1) and that there’s 16 cores per NUMA node.
It just feels like we’re going to get the new CPU back on RMA and have the same issues. But I hope not.
Any thoughts/help would be GREATLY appreciated.
You might experiment with the BIOS setting for the “apparent” number of NUMA nodes - the CPU can present itself as having 1, 2, or 4 nodes.
I invite your attention to the AMD white paper “Workload Tuning Guide for AMD EPYC 7002 Series Processor Based Servers” , which discusses that & other BIOS settings that affect performance:
(You may want to Google for the latest revision; it is updated frequently.)
FYI, the “first” Google image for my query per your suggestion was quite irrelevant, and I couldn’t identify the image you were referring to. I image the order frequently changes.
We’ll be interested in anything you learn.
P.S. The scaling behavior for an “embarassingly parallel” benchmark or program may shed light on possible CPU failure.
Thanks for your reply. I will check out your link and your suggestion on apparent number of NUMA nodes (once we get the CPU back from RMA!)
IF you are still interested, you can google ‘amd epyc rome ansys cfx benchmark dell’ and click the resulting link.
FYI - most CFD software is actually ‘embarrassingly parallel’. It’s actually one software application that can make use of massive multi-CPU configurations. Currently the software vendors test parallelization out to thousands of CPUs (for a single problem case).
I appreciate your reply!
Found the Dell paper & performance data; thanks!
Even though CFD is highly parallel, the need to transfer data between adjacent mesh elements may be a factor in the scaling issues you are seeing. Running a rendering benchmark (Cinebench or …) or some code with no communication between elements could be revealing. If it also shows scaling issues, something like the power-throttling you mentioned (or overheating? or a CPU failure?) might be the problem. If the rendering scales really well up to 64/128 cores/threads, then things related to NUMA nodes or memory layout or … seem more likely.
I wish you Good Hunting.
Thanks for your thoughts. I will stop back when we get our new CPU and everything rebuilt.
And again - I really appreciate your thoughts. I think, at a minimum, we may stand to gain some performance based on NUMA node layout based on some reading I did in the link you provided.
Thanks again! Will report back in time…
I’m an ANSYS CFX user and spent a lot of time in a multi-node Windows Server cluster trying to improve performance. I hope you are able to resolve your issues. I’m very interested to hear how CFX runs on EPYC.
While my experiences are on older Intel hardware, I’ll share them here in case they can be of use. (if you are an experienced CFX user some of these may be rudimentary)
CFX’s default solver setting is -numa none, you can set this to -numa auto to enable. I did not see significant changes to my solver times with this setting however my compute nodes are 4.5 cores per RAM channel. With the 7702p, being 8 cores per RAM channel and different architecture, I wonder if this could be important.
CFX solver defaults to -affinity implicit settings. I have found that my large parallel jobs do benefit from changing this to -affinity explicit.
Simutech has an article on importance of # of RAM channels and AVX-512 for CFX (https://www.simutechgroup.com/images/easyblog_articles/11/maximizing-memory-performance-for-ansys-simulations-2018.pdf). The 8 cores/channel ratio is high and may limit scaling (although this wouldn’t account for the abrupt change in performance going from 16 to 17 cores).
CFX parallel scaling is limited based on the size of the mesh and # of partitions. I usually target ~10,000 vertices / core as my minimum…going below this and the partitioning starts to cannibalize itself. I think around 7,000 I started to see performance degradation.
Disable SMT. Since you are referring to cores and not threads I assume you’ve already done this. However when I first got started no one told me about this. ANSYS CFX (and many of its other solvers) were not intended for SMT and run significantly better without it.
Again, all the best diagnosing the problem.
This is an excellent suggestion. We’re in the RMA ‘paperwork’ process now, but I still have the hardware and will give this a shot.
Also an excellent suggestion. Will try
I’ll read it - again, thanks!
I think for the benchmark case I’m using (CFX’s own LeMansCar benchmark) I’ve got ~100,000 elements per CPU core (unstructured mesh), so it’s a reasonable partitioning I think.
I’ve got SMT disabled - but thanks for the reminder!
I really really appreciate your help. Let me do some further investigation and get back to you. Again - can’t emphasize how much I appreciate the help!
This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.