A workstation for scientific computing with support of 512 GiB of memory (Threadripper Pro 3995WX?)

We are considering to build a workstation for scientific computing (general data science, solving PDEs, etc.). Since the workstation is intended for only CPU-intensive computations, we do not need the best GPU. Instead, we would like to get (within our budget) the most powerful possible computing workstation that supports having at least 512 GiB of RAM. Due to the memory requirement, building a system around the Threadripper 3990X is out of question. Instead, we have thought to buy Threadripper Pro 3995WX.

The question is, are you aware of sWRX8 motherboards that have 16 DIMMs? If not, which motherboard and memory combination would you suggest?
If all the current sWRX8 motherboards have only 8 DIMMS slots, it of course implies purchasing 8x64 GiB RAM modules, which are very expensive. In that case, we might purchase half of the RAM modules now and half of them later, so if the memory exceeds the budget, it wonā€™t be big deal. Do you have hints, useful experience, warnings regarding 64 GiB RAM modules?

Of course, there is also an option to consider EPYC CPUs (which have better selection of motherboards than TR Pro, particularly because they usually have more than 8 DIMM slots) instead of Threadripper Pro. However, EPYC CPUs and motherboards are so expensive that although purchasing 512 GiB of RAM for an EPYC setup could be more affordable than purchasing the same amount for a TR Pro setup, the overall setup would cost likely more in the EPYC case. Not to mention that TR Pro 3995WX is perhaps the most powerful CPU for scientific computing at the moment. Do you have opinions?

Budget. $11500 USD (10000ā‚¬)
Location: Europe
Parts to purchase: a CPU, a motherboard, RAM, a power supply, a decent GPU, SSD (with minimum capacity of 1TB), a case, and fan coolers.
Overclocking: No
Custom water cooling: No
Operating System: Linux

4 Likes

There is no simple answer to your question.
Iā€™ve worked on PDEs for quite a while. Do you use any library like MKL or BLIS/BLAS/whatever? If you do so an Intel CPU might be a better option since MKL tends to be better optimized. Will AVX-512 be usued in your applications? Iā€™ve seen many cases where AVX-512 brought an easy 2x speedup over AVX2.

PDEs usually donā€™t require much inter thread communication - we ran those with MPI on clusters of quad core CPUs. without a huge performance hit.
Therefore a dual socket or even quad socket system might be a good option since you get way faster memory speed (theoretically 12 or 24 channels) which can drastically improve performance on some Tasks.

My recommendation:
If you donā€™t use MKL and NUMA might be a distadvantage: Go with AMD Threadripper Pro

If NUMA isnā€™t problem:
Go with a 4 Socket Intel 3647 System.
I would recommend:
24*32GB RAM=768GB | ~ 3600ā‚¬
4x Intel Xeon Gold 6210U = 80 Cores, 160 Threads | 1,100ā‚¬ each => 4500ā‚¬ - thatā€™s less than one 3995WX and you get more cores
Thatā€™s around 8100ā‚¬ so far. I think there are good Quad Socket Boards available at around 1000ā‚¬.

Power Supply (itā€™s 4x 150W CPUs), SSD etc. should be obtainable with about 1000ā‚¬.

Just a note, I think the Xeon 6210U is single-socket only.

But to your question, I think the first thing to figure out is where your particular application falls in terms of being compute bound versus memory bound. For example, finite volume methods (commonly used in CFD) are usually memory-bound - they require few floating point operations per load or store to/from memory, so the overall memory bandwidth is more important than core count and core speed. By contrast, an application that does a lot of multiplying large matrices, or uses a lot of trig functions might be more compute-bound, i.e. lots of floating point operations per load or store.

I use a workstation with 2x EPYC 7302. Theyā€™re 16 cores each, but each has 8 memory channels. It was way, way cheaper than a top-end threadripper pro would be, and has twice the memory bandwidth. For my use cases, that tends to matter more.

Edit: after re-reading your initial post it sounds like you do many different kinds of work, and it might be hard to characterize whether your needs are more on the compute or memory bandwidth side. Are you writing your own software for this, or using existing software?

1 Like

Thatā€™s correct the Xeon 6210U is single socket only. I think Intel changed their naming scheme for those Xeons again. The german intel website literally claims that the key feature of Xeon Gold CPUs (over Xeon Silver and Bronze) are their ability to scale up to 4 sockets. What messā€¦

1 Like

Whatā€™s the canonical reference to determine if a specific Xeon CPU supports more than one socket? Intel ARK doesnā€™t even mention it for the 6210U?

Why isnā€™t Epyc considered? You can purchase the same 64 cores in a Epyc 7742 cpu for less $4000 and pair it with an Asrock Rack single socket ROMED16QM3 board.

Or go with the Asrock Rack dual socket ROME2D16-2T and put in dual Epyc 7502 cpus for around $5200.

Both setups would net you 16 DIMM slots for populating with cheap RAM.

Epyc Rome cpus are cheap now since everyone wants the newest Milan cpus.

I read your text once again and would like adjust my recommendations a little bit.
First of all: Who writes the software that will run on this workstation/server?

If you plan to write your own software then I strongly recommend against AMD.
You can reduce this to few questions you should answer:

  1. Will you use MKL?
    Yes? => Choose an Intel system (more later on)
    No?
    => Will you use vectorized FP64 compute?
    Yes? => Choose Intel, AVX-512 will bring lots of performance
    No?
    => Think about something that isnā€™t x86.

Intel has two or rather three main selling points:
AVX-512, MKL and (cheap ish) quad socket systems (give you more memory and more memory bandwidth).
If you donā€™t benefit from either of these Xeon would a bad choice.

But both AMD EPYC and Threadripper donā€™t have any good sellig points execpt beeing x86.
ARM is a strong competitor. Especially the Ampere Altra Lineup is very compelling with its superior performance in non vectorized workloads.
The weaker points of Altra are the lack of AVX or rather SVE in ARM terms. The vectorization only goes up to 128bits.
The second weak point is the rather small cache which might decrease performance a bit. It mostly depends on your workload. Prices are great, you could get 128 cores instead of just 64 cores on EPYC/Threadripper Pro.
PowerPC with POWER10 might be a good option too. But as far as I know POWER10 is hardly available to normal customers.

As already mentioned the Xeon Gold 6210U is single socket only.
I would replace it with Xeon Gold 5218 then. Similar price but 4 cores less. Total performance should be similar to one 64 core EPYC/Threadripper Pro. But you get about 3x the memory bandwidth. In my experience thatā€™s way more important than raw performance in these types of applications. We even had a 4 socket system with Quadcore CPUs. Having 16 very fast cores with lots of memory bandwidth can be better than a single 64 core system when working on huge data sets.

The question is, are you aware of sWRX8 motherboards that have 16 DIMMs?

Iā€™ve heard many leaks which claimed that AMD locked the WRX80 platform to one DIMM per channel to limit the memory capacity.
As far as I know AMD never officially specified that but there arenā€™t any boards with 16 DIMM slots available.
Iā€™m pretty confident in assuming that this isnā€™t by accident but I canā€™t tell you for sure.

It is listed on Intel ARK. On ARK you will find the section ā€œExpansion Optionsā€ in which you should find the ā€œScalabilityā€.
1S means 1 socket, 2S is 2 sockets etc.
On some models intel lists the Scalability as S8S which means 8 sockets. Donā€™t know why they put an additional ā€˜sā€™ in there

I can more than match and in fact exceed the all-core clocks of any Intel Xeon processor with my AMD Epyc hosts.

And the FP64 math performance simply trounces the Intel Xeon math performance in my experience. I gave up on my Xeon Broadwell-E 1660 V4 overclocked to 4.0Ghz all-core which couldnā€™t even get to 50% of performance of my TR 2920X or my Epyc 7402P or Eypc 7443P processors.

As long as you donā€™t absolutely need the AVX512 application support, I get more bang for the buck in my applications with AMD.

Note: the Xeon 5218 only has 1 AVX512 FMA unit per core, so the floating point performance might not be significantly better using AVX512 versus just using AVX2 (not sure though).

To summarize the main points from the discussion so far:

  • Make sure to consider the total memory bandwidth of the system, try to get 8 channels or more.
  • If NUMA is a problem, look at EPYC or TR Pro parts, but try to get a sense of how compute/memory bound your workload is. Unless itā€™s very compute bound, you probably wonā€™t have the bandwidth to feed 64 cores - 32 might give similar performance at substantial cost savings.
  • If NUMA isnā€™t a problem, consider a multi-socket solution of individually cheaper CPUs. Intel if you want MKL and or AVX512, otherwise EPYC (consider rome) or perhaps even non-x86.
1 Like

Considering how insistent you are, about applying 512GB RAM, a std. workstation platform isnā€™t going to do the job. Itā€™ll be some multi-CPU style mainboard [SSICEB - SSIEEB / Server]. After that, itā€™ll be the CPU ratings [__C/2xT at __GHz base x _CPUs] and instruction sets, that your engagements can make use of, if not require [capt. obvious being AVX 512, would armbar you to Intel Xeons]

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.