Threadripper 2000 series /thread

lapsio · August 31, 2018, 9:17am

Except EPYC is also not better in “absolute” way because it’s again bottlenecked not by memory but by low clock speed. So we have situation when we either have good I/O and worse computing or better computing and worse I/O. It’s actually similar dilemma to 2950x vs 2990x. One doesn’t have I/O bottleneck, has higher clocks but fewer cores, another has potential memory bottlenecks but more raw horsepower.

The problem in fact is not that such situation exists because it absolutely normal. But we need much more benchmarks for specific kinds of loads to determine which CPU is best for “consumer” virtualization (which may include gaming by VGA passthrough), which is best for advanced virtualization, which is best for video, which is best for rendering, which for web servers etc etc. There’s significantly more variables to take into account than it used to be.

For example if you virtualize VDI what do you need in fact? If one VD has lets say 2 cores/4threads, higher clocks make it more similar to actual desktop. But then if we run 16 of such VMs, each with VDI capable GPU passed through (because X399 on ASRock supports SR-IOV so cards like Radeon S7150x2 are totally valid option. I actually bought two Intel X710-DA4 to pair with TR in order to get full 80G aggregated net throughput in VMs using SR-IOV) won’t memory and pci-e be starved? It’s really not obvious.

Before we had simple slider: more cores <> higher clocks. It was simple relation - scalable load = more cores. Non-scalable load = higher clocks. Now we also have memory bandwidth issues so it becomes triangle or even rectangle. With Epycs added to equation it becomes cube of optimization. Suddenly right placement of your use case becomes incredibly difficult without benchmarks testing your exact use case or just buying N machines and testing on your own, which is obviously not viable solution for consumer.

The point is - I as consumer, performing task X. How am I supposed to know if I’m memory starved or not if there’s no benchmarks for my use case? It’s sh*tty situation for end user. For now I still have no clue whether I should get 2990wx or 2950x. I think I’m quite i/o bound due to pci-e passthrough and high speed storage and network? But what I think may be just as well worth nothing because my recent tests shown that I can easily get above 20G throughput on old i7-2600k with 4 cores. If i bought 2990wx and it’d performe worse on i/o than 2950x due to weird design it’d be complete fuckup. If I’d get 2950x and I’d be far cry from being memory bottlenecked it’d also be bad choice.

cekim · August 31, 2018, 3:24pm

I have the same issue lapsio… both vendors are hard coding artificial market segmentation into their chips that makes the decision matrix extremely complex and presumes one task for your hardware.

Consumers and small businesses more so than enterprise are generally going to need to use things for more than one purpose.

In other domains and even other aspects of computers, your decision matrix includes the ability to mix and match components to suit your needs and trade money to a vendor vs your time and risk in constructing your own solutions.

In the long run I don’t feel I need to take what I’m given and be happy it doesn’t cost more, I am going to look to providers that don’t demand this of me. Step one is objectively determining what we have today and that requires this conversation. Step 2 is communicating to vendors publicly, objectively so that they know not only what is being purchased but what might have been purchased had it only existed…

Say for example >8 core processors in the consumer space that “no one needed”.

Marten · August 31, 2018, 6:31pm

Haters gunna hate ?

Marten · August 31, 2018, 6:42pm

Where did AMD put up the hard coded market wall? Selling 16 core CPUs or 32 core ones ? For under $2k
Whats intel selling ? with all the vulnerabilities

Last I heard was 12 core TR chips where on sale for $400 USD. Thats so terrible !

cekim · August 31, 2018, 7:32pm

2 actually:

Locking Epyc memory controller speeds
Locking 2G TR memory channels (4 dark channels) and using ID pins to prevent use in SP3 sockets with 8 channels.

These are artificial choices designed to segment markets.

if the 2990wx performed as well “all around” as its competitor, I’d have nothing more to say… but it does not. It was artificially put into this position by the memory channel configuration which is not a design/die/socket/package/power/cost limitation - it was purely a segmentation choice.

MazeFrame · August 31, 2018, 7:42pm

Servers are meant to run stable, not be water cooled overclocked heat generators.

cekim · August 31, 2018, 8:05pm

Servers are computers… there’s no magic to them… Stability is a matrix of what voltage, cooling and current can be provided within the atomic/molecular stability of the semi-conductor in question. Specs are just “safe harbors” not magical laws that must be followed. They are “least common denominators”.

It’s just that marketing does not know how to sell curves… they need a single number so they say “180W” to make the dumb people happy.

MazeFrame · September 1, 2018, 1:03am

The only component where it is acceptable to publish 40+ pages of spec sheet including several complicated graphs and tables are power electronic components (transistors, diacs, mosfets, etc.).

Saying “your cpu at 20% workload on 3/4 of the cores at 1.2V will result in 179.89W heat” may be more accurate, however nobody apart from a few engineers for thermal solutions is willing to put up with that. And moreover, nobody should be willing to put up with that.

no they are not

cekim · September 1, 2018, 4:04am

I’m gonna try to stick to specifics - bench-marks, issues, and avoid the sacred cows… Sorry for the OT…

Summary

It’s become less necessary in recent years (because semiconductor temp/voltage margins are large and yields are high), but there used to be plenty of cases in hardware design to reach a point where you needed to run a component outside its published specification. Presuming you weren’t asking for something ridiculous, you might contact the manufacturer and say, “we’d like to run this at X??”.

Depending on your volume and relationship with that manufacturer, they might respond:

don’t do that, it will explode
that will be fine, we will warranty that
that should work but you are on your own…

At that point you went to the suits and they and the lawyers told you which option you needed to go with. Including “explode” in the case of the Ford Pinto.

TDP envelops are absolutely “safe harbors”. They are Intel and AMD communicating with systems designers to say, “if you play by these rules, we’ll warranty this performance” with a relatively simple, extremely conservative and predictable (relative to prior generations) set of parameters, so Dell et al can build boxes that work.

The reality is that there are plenty of segments of the market that push these boundaries. Either for performance, environmental (heat/vibration/radiation) or space/physical/thermal concerns.

Sorry, but

specs are just Intel/AMD drawing a box around the “easy button” use of their chip. They don’t actually define the entire market’s use of them and not just hobbyists as I mentioned elsewhere, overclocked, watercooled enterprise servers are absolutely a thing in specific segments.

cekim · September 1, 2018, 7:27am

Well, this is interesting/bad… I disabled SMT and it seems it creates an unexpected numa setup for software designed to use numa…

This only happens when I disable SMT (application failed attempting to call numactl to set affinity)

libnuma: Warning: node argument 3 is out of range
libnuma: Warning: node argument 5 is out of range
libnuma: Warning: node argument 1 is out of range
libnuma: Warning: node argument 4 is out of range
libnuma: Warning: node argument 7 is out of range
libnuma: Warning: node argument 9 is out of range
libnuma: Warning: node argument 10 is out of range
libnuma: Warning: node argument 11 is out of range
libnuma: Warning: node argument 6 is out of range
libnuma: Warning: node argument 8 is out of range
libnuma: Warning: node argument 13 is out of range
libnuma: Warning: node argument 12 is out of range 
libnuma: Warning: node argument 14 is out of range
libnuma: Warning: node argument 15 is out of range

numacfg looks reasonable:

numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32128 MB
node 0 free: 27968 MB
node 1 cpus: 16 17 18 19 20 21 22 23
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 8 9 10 11 12 13 14 15
node 2 size: 32227 MB
node 2 free: 26395 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node   0   1   2   3 
  0:  10  16  16  16 
  1:  16  10  16  16 
  2:  16  16  10  16 
  3:  16  16  16  10
 
numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 
cpubind: 0 2 
nodebind: 0 2 
membind: 0 2

cekim · September 1, 2018, 6:21pm

This application, which I can’t name sorry, has a pile of logic on its front end to try to outsmart NUMA, threads, cores, etc…

Not surprisingly, it’s getting pretty confused here as it evolved in an intel world…

Details on what/why/why not, etc..

This is an openMPI app with relatively heavy inter thread/process communication. It scales reasonably well to 18 cores in the intel world providing 4-6x speed up vs single core. The use case for such relatively ineffecient use of multi core (vs parallel single core regression) is interactive debug of large sims (reaching failure point in hardware or software being simulateda) or long running simulations (think weeks and months) where network uptime becomes prohibitive as a practical matter.

I’ve been testing 16 cores on the 2990wx in various configs. What MSI calls 2+2 provdided the best performance with a 3m8s run of a simulation. That 3m scales well to a multi-hour run as well so % for % 3m or 3 hours reductions were linear on both 2990wx and a dual haswell (ditto for a skylake).

I tried to disable SMT as above to confirm, but I strongly suspect the results I am seeing indicate affinity choices are pinning to threads not cores. I don’t control this logic, it is automated, but I may be able to influence it.

Some quick numbers for reference as to why this is an issue (see summary for details on “why” and linearity, etc…):

2990wx - using 16 threads (4.18 kernel)
bios defaults:

all cores active 5m25s

4.1GHz 3200C14:

all cores active: 3m49s
2+2 (16 cores): 3m8s
4+1 (16 cores): 3m24s

Haswell (2x2696v3 - ucode mod to allow 2x9 = 18 cores to run @3.8GHz - all-core 3.4GHz)

16 threads: 1m34s (corrected for 4.18 kernel - smelt has hurt 3.10+haswell)

7980xe (4.5GHz all-core 3200C14)

16 threads: 1m25s (~same time for 3.10 vs 4.18 - smelt+skylake has been less painful)

thro · September 3, 2018, 4:00am

Then i guess you have a choice to make:

you spend some time (money) to optimise for the TR2 architecture
you try it on EPYC and see how it goes (and also maybe optimise there)
you pay more for intel and save the R&D/development cost

Option 3 may well be the best option for your particular use case, however I would speculate that threadripper/epyc are the start of a new trend in massively multi-core CPUs moving forward.

There’s no shame in buying intel if it is objectively the best bang for buck. And buying specific hardware to suit software quirks is nothing new. It’s why we’re all on x86 today in the first place to a large degree…

TR/EPYC is another option. It will work for some. It may not work for others. If the cost to modify your application is more than you’d just spend on buying more expensive intel processors to make it run fast, then you buy the intel processors…

cekim · September 7, 2018, 8:22am

Hmm, can’t run KVM on MSI MEG.

My bios option shows “SVM” (which I thought was the Secure VM mode that does not work on TR). There does not appear to be another “Virtualization” option. I’ve tried with and without the “SVM” option enabled.

lsmod says there is no amd kvm module loaded. I’ll have to explore further later. Anyone with a an MSI MEG have KVM running?

EDIT: it seems this is THE SVM bug that’s still with us for which the only solution is downgrade bios or patch kernel… guess ill build a kernel this weekend…

cekim · September 8, 2018, 2:07am

Ok, now we are cooking with gas!

Summary

2990WX MSI MEG 4.1GHz all-core turbo, water cooled (will experiment with PBO at some point - but so far this OC has been as good or better for the high-thread-count things I’m doing)

upgrade to 4.19-rc2 kernel to deal with SVM, PCI bugs
create VM (KVM)
pin VM to “near” cores and not threads (virsh vcpupin N 0,2,4,6,8… )
Profit.

Result (heavy IPC, large memory image simulation using tool I can’t name that scales reasonably well to ~16 threads/cores in intel land.

This job runs linearly - so whether 1m or 1hr improvements at this scale can be extrapolated easily.

Baseline: 7980XE - 4.5GHz:
16 cores bare-metal: 1m25s (85s)

2990WX - 4.1GHz all-core:
8 cores bare-metal: 5m14s (314s)
16 cores bare-metal: 3m50s (230s)

8 pinned VM “near” cores: 2m41s (161s)
16 pinned VM “near” cores: 2m9s (129s)
18 pinned VM “near”* cores: 2m4s (124s)
32 pinned VM cores: 2m22s (142s)
16 pinned “far” cores: 2m40s (160s)

So, in absolute terms the 7980xe still 30% faster than the best time the 2990wx can produce…

BUT with the 2990xw now running almost 2x faster on 16 cores - it now has 16 more cores to do “other work” where the 7980xe has only 2… So, I will need to do some experiments where I’m running 2,3,4 of these jobs at once in 8-16 core arrangements.

Ultimately, I’d like to be able to run big/long jobs in the background while gaming in a GPU pass-through VM with 4-6 cores. The numbers above are showing a 24% slowdown when comparing a 16-core job near and far, but I generally run games at GPU bottlenecked settings… So… more experiments are needed!

BookrV · September 8, 2018, 8:53am

The announcement of the X499 chipset is said to be held at CES 2019 at the AMD Tech Day even for journalists. Talks regarding the next-generation EPYC 7nm processors which are expected to ship later next year along with new Ryzen Threadripper CPUs and Navi GPUs are also said to be held during the event.

cekim · September 12, 2018, 7:57pm

LXC container running well enough on all cores:
https://browser.geekbench.com/v4/cpu/9797766
single:5065 multi: 60656

Tried to pin the cores last night and clearly there is more chanting and dancing required to make this work.

config:
lxc.cgroup.cpus = 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30

That seemed to work for make -j16 as the host showed only 16 even cores loaded on the host. However, the problem is that it doesn’t change what the guest/container sees in /proc/cpu.

So, if your program reads those files and tries to figure out what is core and what is thread and uses numactl to pin based on Intel patterns, it still gets the wrong answer. I need to virtualize the CPU info based on lxc.

Saw some mentions of lxcfs, but now we’ve reached the end of my prior container experience.

cekim · September 13, 2018, 8:08pm

Some updates:

Configuration of the lxc container now controls where the application executes by allocating only “near” and “real” cores. (lxc.cgroup.cpus = 0,2,4,6,8,10,12,14,16,28,20,22,24,26,28,30)
To get this working, I had to intercept /usr/bin/numactl and prevent the application from pinning/setting affinity as it was confused by TR architecture and making the wrong choices. I turned its numactl calls into NOPS. So, now the container-ized kernel schedules among the provided cores “natively”

Result:

host load distributed over desired cores…
Performance improved (container vs VM):
16 “near” cores in container: 1m47s (107s)
8 “near” cores in container: 2m24s (144s)

Containers are wonderful if you have not played with them. Particularly on chips of this scale, I suspect there may always be some benefit to them even if there weren’t NUMA issues. It allows you to divide up your work flows into their required compute resources. Those containers can sit idle on your host and you use whichever one is appropriate for the component of your flow at any given moment. Oversubscription works much better this way than with a VM.

In a perfect world you’d be able to have heterogeneous kernel containers to the point of windows on linux, but… that’s a lot of issues that have to be resolved and/or linux-ization of the windows kernel.

FurryJackman · September 15, 2018, 8:31am

Hardware Unboxed just did some OC VRM thermals on the 2990WX if anyone’s curious:

Aorus Extreme didn’t do well, apparently tripping OCP.

SgtAwesomesauce · September 15, 2018, 8:57am

I find it interesting how the vrm are really where the problem comes in on some of this high end crazy gear, rather than a silicon limit. Not sure if this used to be a limit, but it seems to be more prevalent today. (or at least more talked about)

risk · September 15, 2018, 9:06am

… mumble mumble about on chip VRMs with Intel allowing motherboard VRMs to handle less current at higher voltage, mumble mumble