Threadripper 3990x Launch | Level One Techs

wendell · February 7, 2020, 2:23pm

The 3990X Defies Benchmarking. But I tried anyway!

This is a companion discussion topic for the original entry at https://level1techs.com/article/threadripper-3990x-launch

nx2l · February 7, 2020, 2:43pm

the store footage was a bit jerky/studdery

thevillageidiot · February 7, 2020, 3:11pm

Seems like some scaling issues on that page of benchmarks

poomer · February 7, 2020, 3:35pm

My goodness those Blender results are ridiculous! What a true monster of a CPU

alpha754293 · February 7, 2020, 7:29pm

I’m somewhat surprised with the GROMACS results where going from 32 cores to 64 cores only resulted in a 31.4% increase in performance.

Hmmm…

It’s too bad that there aren’t that many publicly available, MPI benchmarks people can run.

The Motorbike OpenFOAM benchmark wouldn’t be much of a challenge for a processor like this.

Shame.

I would really like to see what this can do against a single socket 64-core AMD ‘Rome’ EPYC vs dual socket 32-core EPYC server.

nx2l · February 7, 2020, 7:40pm

considering the improvements (any even) from going to 64 cores from 32 at the same TDP… IMO impressive.

wendell · February 7, 2020, 7:43pm

I believe that there are some optimizations that can be done… but remember both cpus are 280w.

For some reason my own gromacs tests aren’t super fast, but I think with clear linux I could recompile with the latest zen2 compiler optimizations and claw back as much of a perf uplift. the thing here is that the same test conditions were the same just swapping cpus in the same everythin

alpha754293 · February 7, 2020, 7:53pm

Do you know if parallelization in Gromacs is written with OpenMP or is it written with MPI?

And I get that this is running the two processors with the same everything, but 31% performance increase with a 100% increase in the core count is quite underwhelming.

Still, depsite that, I wished that there was a comparison between AMD EPYC 7702P and dual AMD EPYC 7542.

I can only suspect that it’s a combination of effects – the Gromacs benchmark problem is now “too small” to be able to really make use of the 3990X, and/or Gromacs (especially if it is parallelized via OpenMP) has poor parallelization scalability, and/or the issues reside with the OS and hardware.

Based on that result, my workload is one that runs on clusters, and therefore; this result would suggest that it would be better for me to get two 3970X systems than one giant 3990X system. It would be better performance-per-dollar.

wendell · February 7, 2020, 8:00pm

31% more performance for the same power is underwhelming? Best case scenario I saw was ~60% but that was for things that tended to live more on the same cpu. When I overclocked it, I could get the cpu to use more like 500w+ (instead of 280w) and got much better scaling as long as I could keep it under 85c

alpha754293 · February 7, 2020, 8:56pm

Well…I don’t have nor know what the at-load power consumption levels are for the same test (nor for any particular given set of test(s)), so it’s hard to say whether it’s really “the same power”. Again, there’s a lot of discussion and confusion about what “TDP” actually means vs. what people think it means.

(To me, thermal design power, or TDP, should be the amount of power (in Watts) that is dissipated by the processing unit (whether it is CPU or GPU) in the form of heat. One way to do that is to assume that the processor is 0% efficient, which means that all of the power that you are supplying to it just comes back out as heat.) The other possibility is to figure out how much of the power is being put into the processor, how much of the power is used to do work (where FPUs typically and historically consume more power than the ALUs), and how efficient it is at performing those computational tasks/operations, where the inefficiencies are dissipated back to ambient in the form of thermal power, hence the term. Some would argue that by assuming that the processing unit with an efficiency of 0%, that all input power comes back out has heat, you will need to design and engineer a thermal management solution that is capable of dealing with that thermal power. (Heat transfer of heat exchangers.))

The point is that 280 W is the amount of “thermal dissipated power” that is the CPU’s budget. Whether the two CPUs actually consumes the same amount of power, given it’s “thermal dissipated power” “budget” – that I don’t have the data to be able to perform any sort of reasonable analysis.

But the part that is missing in that statement is the cost efficiency as well.

If your task cannot be parallelized across multiple systems (e.g. clusters), then the single task is going to be the best that it can be, and you would have spend $3990 on it to gets the best that you can get for that single task that cannot be parallelized across multiple systems. (e.g. programs that are parallelized via OpenMP/threads only).

However, if your task CAN be parallelized across multiple systems (e.g. clusters, using MPI), and using the Gromacs score to illustrate the point, for the same amount of money, I can get two times 3.377 (=6.754) vs. 4.438. (with 1x64 core system)

The total power that you will consume will increase, but a 6.754 Gromacs score is also 52.19% faster than the 1x64core Gromacs score of 4.438 despite it also being 64cores, but consumes less power than two separate systems.

So…now the question is - what’s more important? Twice (100% increase) the performance at “twice” the power consumption at the approximately the same CPU cost, or 31% increase in performance, without the 100% increase in power consumption, at the same, total CPU cost.

Therefore; as mentioned, for the type of workloads that I run, these results might actually point me towards having two Threadripper 3970X systems because I would place a greater emphasis on getting the 6.754 Gromacs score (at the cost of double the power consumption, with approximately the same total CPU cost) vs. the 31% increase in performance, at potentially half the power consumption cost, but at about the same total CPU cost.

If, with two 3970X systems, I can get 6.754 in total, combined Gromacs score, over the single 64-core system’s 4.438, 2x32 core to 1x64 core, both being 64 cores in total, I would get a 52.19% advantage for about the same total CPU cost, but would pay double in power consumption costs.

My hypothesis is that even without the power consumption data, and doing the power consumption cost on the assumption of $0.15/kWh, running the system 24/7/365, and let’s say that I were to assign a value of I “earn” $1 per every Gromacs point, I will still come out ahead with two 3970X systems than I would with the single 3990X system.

TCO and SWaP.

(I was originally looking at this in order to replace my existing four half-width, dual socket Xeon blade server system (which currently has 64-cores already), but again, either the Gromacs benchmark doesn’t scale well (which is entirely possible), or the hardware doesn’t scale well, or both – looking at it, at face value, it would suggest that two 3970X systems would get me a total higher score, spending approximately the same amount of money on CPUs alone, but comes at the expense of a higher electricity bill.

jay_c · February 7, 2020, 9:46pm

such a interesting launch but it sucks we cant get more than 256GB of ram otherwise it would be so ideal for some of the builds we need at work

Ulfric · February 8, 2020, 5:44am

Hey @wendell I have a question to ask. Regarding the absence of RDIMM/LRDIMM in Threaripper platform, Is it dependent on Memory Controller or Is it Motherboard vendor related? Can AMD tweak it in the future with firmwire update?

Bagaget · February 8, 2020, 10:33pm

If you can use a cluster you could also go with 8x3800x…

alpha754293 · February 14, 2020, 8:56pm

Yeah…I thought about that as well, but that comes at the price of bandwidth.

Intrasocket and/or intra-die and/or intersocket is usually going to be faster than my 100 Gbps Infiniband, and on some of the newer applications that I use, there is some potential to actually run the software as both SMP mode and MPP mode (hybrid) where it uses MPI within a system, but then uses OpenMP between systems.

The advantage with that is that especially with really high core count systems, rather than having thousands, if not millions of pairwise combinations of cores (loosely, for MPI communications, MPI doesn’t exactly work this way, but for a grossly simplified, but not 100% correct explanation), between “pizza boxes”, it’s OpenMP (e.g. a pair of link, as I understand it), and inside the “pizza box”, it’s MPI (because intrasocket/intradie/intersocket communication is much faster than internode).

The other consideration is the total power requirement. The CPU prices might be cheaper, but by the time you’ve built eight computers, the total price might not be vastly different than if you built a bigger system with more expensive CPUs to begin with.