AMD: 3990X And The Unbenchmark | Level One Techs

wendell · February 17, 2020, 3:04pm

https://www.level1techs.com/article/threadripper-3990x-launch

This is a companion discussion topic for the original entry at https://level1techs.com/video/amd-3990x-and-unbenchmark

freqlabs · February 17, 2020, 3:34pm

I’m curious about how much of that performance gap with a custom compiled Clear Linux vs Ubuntu disappears if you give Ubuntu the same benefit of compiling a kernel with the native CPU target / optimizations.

crener · February 17, 2020, 8:11pm

If you want something thread heavy to benchmark I can send an 800 image panorama data set that took my i7-4790k about 3 days to compute and 2 days to render using Hugin…

wendell · February 17, 2020, 9:38pm

I tested this today for the upcoming video with the unreal compile… it helps, but it doesn’t totally narrow the gap. there seem to be optimizations in things like libc, etc.

Also should probably test gentoo. lololololol

Actually I know a talented researcher that unironically uses it. And it does wotk.

wendell · February 17, 2020, 9:39pm

github me my good man?

crener · February 17, 2020, 10:46pm

The image data is about 4gb and Github limit you to 2gb sadly… I’ll compress it and upload it to some online storage and give a link in a few hours.

I just looked at the project again and after the render it should spit out an image of 148858 x 22658 or 3.337 gigapixels

crener · February 18, 2020, 12:26am

I’ve deleted some metadata I generated to help the CPU reach full potential and got the size down to 5.7GB and packed in a smaller project in case you wanted to experiment with settings or something in a more responsive project. Just uploading the 7Zip to Drive and I’ll give the link when done.

There are some instructions on what todo and rough instructions on how to regenerate the metadata in the zip…
Will share link when up

crener · February 18, 2020, 8:01am

@wendell have fun

alpha754293 · February 24, 2020, 9:43pm

So this is one of the HPC programs that I use and here, I tested it with four nodes, where each node in my cluster has dual 8-core processors (so ranging from 16-64 cores) in regards to the question about scalability:

16 cores:

 N o r m a l    t e r m i n a t i o n                          01/16/19 03:39:45

 Max. Memory reqd for implicit sol: max used               0
 Max. Memory reqd for implicit sol: incore                 0
 Max. Memory reqd for implicit sol: oocore                 0

 Memory required to complete solution (memory=    114M memory2=     20M)
          Minimum     15M on processor     2
          Maximum     20M on processor     8
          Average     18M

 Additional dynamically allocated memory
          Minimum     28M on processor     2
          Maximum     75M on processor     0
          Average     55M

 Total allocated memory
          Minimum     43M on processor     2
          Maximum     90M on processor     0
          Average     72M

 T i m i n g   i n f o r m a t i o n
                        CPU(seconds)   %CPU  Clock(seconds) %Clock
  ----------------------------------------------------------------
  Keyword Processing ... 1.3960E+01    0.03     1.4206E+01    0.03
    KW Reading ......... 5.0505E-01    0.00     5.4715E-01    0.00
  MPP Decomposition .... 2.5213E+01    0.06     2.5319E+01    0.06
    Init Proc .......... 2.1634E+01    0.05     2.1685E+01    0.05
    Decomposition ...... 2.0315E+00    0.00     2.0476E+00    0.00
    Translation ........ 1.5478E+00    0.00     1.5850E+00    0.00
  Initialization ....... 5.9473E+00    0.01     6.4751E+00    0.02
    Init Proc Phase 1 .. 2.7664E+00    0.01     2.9729E+00    0.01
    Init Proc Phase 2 .. 2.8175E+00    0.01     3.1058E+00    0.01
  Element processing ... 1.8861E+04   46.06     1.8884E+04   46.04
    Solids ............. 1.5378E+03    3.76     1.5399E+03    3.75
    Shells ............. 1.7304E+04   42.26     1.7323E+04   42.24
    Beams .............. 1.4345E+01    0.04     1.4601E+01    0.04
  Binary databases ..... 8.6215E+01    0.21     8.6502E+01    0.21
  ASCII database ....... 9.8933E+02    2.42     9.9322E+02    2.42
  Contact algorithm .... 9.6032E+03   23.45     9.6181E+03   23.45
    Interf. ID   2000001 2.8150E+02    0.69     2.8145E+02    0.69
    Interf. ID   2000002 1.2705E+01    0.03     1.2985E+01    0.03
    Interf. ID         3 8.8787E+03   21.68     8.8920E+03   21.68
  Rigid Bodies ......... 8.5038E+02    2.08     8.5300E+02    2.08
  Time step size ....... 2.5449E+03    6.22     2.5508E+03    6.22
  Rigid wall ........... 2.5349E+03    6.19     2.5366E+03    6.18
  Group force file ..... 5.2004E-01    0.00     1.2632E+00    0.00
  Others ............... 4.2369E+02    1.03     4.2800E+02    1.04
  Misc. 1 .............. 9.1964E+02    2.25     9.2477E+02    2.25
  Misc. 2 .............. 1.7300E+03    4.23     1.7319E+03    4.22
  Misc. 3 .............. 6.1449E+02    1.50     6.1846E+02    1.51
  Misc. 4 .............. 1.7422E+03    4.25     1.7437E+03    4.25
  ----------------------------------------------------------------
  T o t a l s            4.0946E+04  100.00     4.1016E+04  100.00

 Problem time       =    2.0000E-01
 Problem cycle      =    317461
 Total CPU time     =     40946 seconds (  11 hours 22 minutes 26 seconds)
 CPU time per zone cycle  =         67.287 nanoseconds
 Clock time per zone cycle=         67.396 nanoseconds

 Parallel execution with     16 MPP proc
 NLQ used/max               120/   120

 Start time   01/15/2019 16:16:09  
 End time     01/16/2019 03:39:45  
 Elapsed time   41016 seconds for  317461 cycles using    16 MPP procs
             (     11 hours 23 minutes 36 seconds)

 N o r m a l    t e r m i n a t i o n                          01/16/19 03:39:45

32 cores (2 nodes):

 *** termination time reached ***
  317461 t 2.0000E-01 dt 6.30E-07 write d3dump01 file          10/13/19 00:07:39
  317461 t 2.0000E-01 dt 6.30E-07 flush i/o buffers            10/13/19 00:07:39
  317461 t 2.0000E-01 dt 6.30E-07 write d3plot file            10/13/19 00:07:39

 N o r m a l    t e r m i n a t i o n                          10/13/19 00:07:39

 Max. Memory reqd for implicit sol: max used               0
 Max. Memory reqd for implicit sol: incore                 0
 Max. Memory reqd for implicit sol: oocore                 0

 Memory required to complete solution (memory=    114M memory2=     12M)
      Minimum   8818K on processor     5
      Maximum     12M on processor    17
      Average     11M

 Additional dynamically allocated memory
      Minimum     14M on processor     5
      Maximum     65M on processor     0
      Average     29M

 Total allocated memory
      Minimum     22M on processor     5
      Maximum     75M on processor     0
      Average     40M

 T i m i n g   i n f o r m a t i o n
                    CPU(seconds)   %CPU  Clock(seconds) %Clock
  ----------------------------------------------------------------
  Keyword Processing ... 1.1976E+01    0.06     1.1987E+01    0.06
KW Reading ......... 2.4177E-01    0.00     2.4249E-01    0.00
  MPP Decomposition .... 2.4325E+01    0.12     1.9930E+01    0.10
Init Proc .......... 2.0867E+01    0.10     2.0880E+01    0.10
Decomposition ...... 2.2030E+00    0.01     2.2068E+00    0.01
Translation ........ 1.2548E+00    0.01    -3.1570E+00   -0.02
  Initialization ....... 2.5591E+00    0.01     2.7199E+00    0.01
Init Proc Phase 1 .. 1.2150E+00    0.01     1.2520E+00    0.01
Init Proc Phase 2 .. 9.1202E-01    0.00     1.0146E+00    0.00
  Element processing ... 9.3352E+03   45.87     9.3412E+03   45.87
Solids ............. 7.9076E+02    3.89     7.9090E+02    3.88
Shells ............. 8.5314E+03   41.92     8.5357E+03   41.91
Beams .............. 8.0992E+00    0.04     8.2428E+00    0.04
  Binary databases ..... 2.0879E+01    0.10     2.0910E+01    0.10
  ASCII database ....... 6.0295E+02    2.96     6.0677E+02    2.98
  Contact algorithm .... 4.5238E+03   22.23     4.5274E+03   22.23
Interf. ID   2000001 1.3074E+02    0.64     1.3073E+02    0.64
Interf. ID   2000002 7.0132E+00    0.03     7.1966E+00    0.04
Interf. ID         3 4.1936E+03   20.60     4.1959E+03   20.60
  Rigid Bodies ......... 5.4605E+02    2.68     5.4655E+02    2.68
  Time step size ....... 1.3286E+03    6.53     1.3292E+03    6.53
  Rigid wall ........... 1.2944E+03    6.36     1.2943E+03    6.36
  Group force file ..... 1.4184E+00    0.01     2.2577E+00    0.01
  Others ............... 1.8893E+02    0.93     1.8927E+02    0.93
  Misc. 1 .............. 4.2716E+02    2.10     4.2869E+02    2.10
  Misc. 2 .............. 7.9770E+02    3.92     7.9745E+02    3.92
  Misc. 3 .............. 4.2313E+02    2.08     4.2444E+02    2.08
  Misc. 4 .............. 8.2341E+02    4.05     8.2292E+02    4.04
  ----------------------------------------------------------------
  T o t a l s            2.0353E+04  100.00     2.0366E+04  100.00

 Problem time       =    2.0000E-01
 Problem cycle      =    317461
 Total CPU time     =     20353 seconds (   5 hours 39 minutes 13 seconds)
 CPU time per zone cycle  =         33.418 nanoseconds
 Clock time per zone cycle=         33.438 nanoseconds

 Parallel execution with     32 MPP proc
 NLQ used/max               120/   120

 Start time   10/12/2019 18:28:09  
 End time     10/13/2019 00:07:40  
 Elapsed time   20371 seconds for  317461 cycles using    32 MPP procs
         (      5 hours 39 minutes 31 seconds)

 N o r m a l    t e r m i n a t i o n                          10/13/19 00:07:40

64 cores (4 nodes):

 *** termination time reached ***
  317461 t 2.0000E-01 dt 6.30E-07 write d3dump01 file          10/07/19 02:57:30
  317461 t 2.0000E-01 dt 6.30E-07 flush i/o buffers            10/07/19 02:57:30
  317461 t 2.0000E-01 dt 6.30E-07 write d3plot file            10/07/19 02:57:30

 N o r m a l    t e r m i n a t i o n                          10/07/19 02:57:30

 Max. Memory reqd for implicit sol: max used               0
 Max. Memory reqd for implicit sol: incore                 0
 Max. Memory reqd for implicit sol: oocore                 0

 Memory required to complete solution (memory=    114M memory2=   8308K)
          Minimum   6257K on processor    11
          Maximum   8308K on processor    14
          Average   7690K

 Additional dynamically allocated memory
          Minimum   4485K on processor    11
          Maximum     57M on processor     0
          Average     16M

 Total allocated memory
          Minimum     11M on processor    11
          Maximum     64M on processor     0
          Average     24M

 T i m i n g   i n f o r m a t i o n
                        CPU(seconds)   %CPU  Clock(seconds) %Clock
  ----------------------------------------------------------------
  Keyword Processing ... 1.2456E+01    0.11     1.2463E+01    0.11
    KW Reading ......... 1.2362E-01    0.00     1.2401E-01    0.00
  MPP Decomposition .... 2.5727E+01    0.23     2.2669E+01    0.20
    Init Proc .......... 2.1364E+01    0.19     2.1372E+01    0.19
    Decomposition ...... 2.7909E+00    0.03     2.7952E+00    0.03
    Translation ........ 1.5722E+00    0.01    -1.4982E+00   -0.01
  Initialization ....... 2.5269E+00    0.02     2.6746E+00    0.02
    Init Proc Phase 1 .. 1.1874E+00    0.01     1.2793E+00    0.01
    Init Proc Phase 2 .. 8.4771E-01    0.01     8.8940E-01    0.01
  Element processing ... 4.7038E+03   42.37     4.7072E+03   42.39
    Solids ............. 4.1726E+02    3.76     4.1708E+02    3.76
    Shells ............. 4.2772E+03   38.53     4.2793E+03   38.53
    Beams .............. 4.4827E+00    0.04     4.5382E+00    0.04
  Binary databases ..... 1.4090E+01    0.13     1.4098E+01    0.13
  ASCII database ....... 3.5925E+02    3.24     3.5968E+02    3.24
  Contact algorithm .... 2.4678E+03   22.23     2.4698E+03   22.24
    Interf. ID   2000001 4.3555E+01    0.39     4.3998E+01    0.40
    Interf. ID   2000002 2.1559E+00    0.02     2.1772E+00    0.02
    Interf. ID         3 2.3431E+03   21.11     2.3444E+03   21.11
  Rigid Bodies ......... 4.8302E+02    4.35     4.8330E+02    4.35
  Time step size ....... 9.8374E+02    8.86     9.8425E+02    8.86
  Rigid wall ........... 6.1652E+02    5.55     6.1627E+02    5.55
  Group force file ..... 1.1944E+00    0.01     1.9311E+00    0.02
  Others ............... 1.8957E+02    1.71     1.8964E+02    1.71
  Misc. 1 .............. 2.1911E+02    1.97     2.1954E+02    1.98
  Misc. 2 .............. 3.2041E+02    2.89     3.1996E+02    2.88
  Misc. 3 .............. 2.9662E+02    2.67     2.9734E+02    2.68
  Misc. 4 .............. 4.0518E+02    3.65     4.0454E+02    3.64
  ----------------------------------------------------------------
  T o t a l s            1.1101E+04  100.00     1.1106E+04  100.00

 Problem time       =    2.0000E-01
 Problem cycle      =    317461
 Total CPU time     =     11101 seconds (   3 hours  5 minutes  1 seconds)
 CPU time per zone cycle  =         18.196 nanoseconds
 Clock time per zone cycle=         18.201 nanoseconds

 Parallel execution with     64 MPP proc
 NLQ used/max               120/   120

 Start time   10/06/2019 23:52:22  
 End time     10/07/2019 02:57:31  
 Elapsed time   11109 seconds for  317461 cycles using    64 MPP procs
             (      3 hours  5 minutes  9 seconds)

 N o r m a l    t e r m i n a t i o n                          10/07/19 02:57:31

You will see that going from 16 to 32 cores, it actually more than halved the time.

But going from 32-64 cores, it was about a 91% reduction (where 100% reduction would be perfectly halving the time it takes to run this model/solve this problem).

This is to give an example of where a program and the problem scaling well going from 32 cores to 64 cores.

re: GROMACS
Despite the processor having the same TDP envelop, and also the difference in core counts and clock speeds, you can still compute what the “100%” scalability line is or should be (your E(X)).

Again, I don’t remember whether GROMACS uses OpenMP or MPI for their parallelization (I think that it’s OpenMP, which in the cases that I show the timing results for above, if I ran that same case using the OpenMP version of the same solver, typically will result in taking between 30-40% longer, working on the same problem.)

But the 31% performance gain going from 32-cores to 64-cores in GROMACS would be cause for concern. It’s too bad that there just isn’t enough people out there who knows how to run HPC workloads, and to be able to tell whether something scales well or scale poorly depending on the application and/or the workload.

(in my normal HPC world, some of our “normal” jobs runs on 256-384 cores by default)

If a particular workload, when split up into 64 pieces (partitions) such that each partition can’t fully load up any given CPU core for a few seconds, then we generally won’t run with that many partitions/breaking up the problem into that many pieces and would much rather run two jobs simultaneously and let each job take a little bit longer, but ultimately be able to finish two jobs with a little bit of a time penalty vs. only running one job, but doing so inefficiently.