Threadripper 2000 series /thread

cekim · August 31, 2018, 2:06am

To AMD? Yes… yes they are… They spec’d a socket that has them physically already. Done and done.

TR4 = SP3 - same foot-print. differentiated by ID pins and dark fingers only…

I agree with you that AMD likely had to do something like this as I agree with you that they need a cash-cow enterprise product to let them sell us plebes cool stuff cheap.

However the reality with MBs that are $500-600 and have 10, 16, 19 phase VRMs that can deliver 750W+ the cost of 4 additional memory channels not only does not fall to AMD at all, but is easily absorbed in the price-range of the consumer in question.

Intel is about to add an “A” series per the rumors and slides which will have it asking “enthusiasts” to buy hex-channel motherboards that also need 750-1000W VRMs later this year. Clearly there is a market for this insanity (I may or may not be part of it, my pretend lawyer has advised me to say nothing further )

Marten · August 31, 2018, 2:11am

AMD are smashing Intel in the face and bringing great CPU’s out. I can totally understand EPYC leads the way. Threadripper follows with a cut down version and Ryzen gets the mass’s.

Knocking AMD for that is like saying Intels i7 8core non HT cpu is a great CPU.

wendell · August 31, 2018, 2:16am

Fedora 28 kernel 4.15. I can replicate the io degredation in windows and other folks like pcper reported similar possibly issues with stuff like 4 up nvme raid

cekim · August 31, 2018, 2:19am

So, you and pcper are seeing this just in dd if=x of=x oflag=direct sorts of throughput checks?

Windows appears to be gimped generally with scheduling and has always been abjectly terrible at I/O - like it was their job to slow it down. The geekbench 1:1s show that pretty well with nearly doubling many scores.

I’ll try some disk I/O checks - I don’t have any nvme raid on this system though, so it may be I can’t see it.

Update 1:

4100:
dd if=/dev/zero of=test.img bs=4096 count=100000 oflag=direct
409600000 bytes (410 MB) copied, 2.91738 s, 140 MB/s
409600000 bytes (410 MB) copied, 2.90616 s, 141 MB/s
409600000 bytes (410 MB) copied, 2.31302 s, 177 MB/s
PBO:
dd if=/dev/zero of=test.img bs=4096 count=100000 oflag=direct
409600000 bytes (410 MB) copied, 3.20625 s, 128 MB/s
409600000 bytes (410 MB) copied, 3.75663 s, 109 MB/s
409600000 bytes (410 MB) copied, 3.3943 s, 121 MB/s

Ah hah! I think this is a performance governor latency thing - Looks to me like the performance governor is taking longer to spin up in the PBO case…

PBO:
cpupower frequency-set -g performance
dd if=/dev/zero of=test.img bs=4096 count=100000 oflag=direct
409600000 bytes (410 MB) copied, 2.3176 s, 177 MB/s
409600000 bytes (410 MB) copied, 2.64346 s, 155 MB/s
409600000 bytes (410 MB) copied, 2.28784 s, 179 MB/s

Going to grab some larger data samples - 400MB is likely too small if the performance governor is having that much of an effect.

Marten · August 31, 2018, 2:19am

Kernel 4.15 is not that long ago. I know AMD dump a lot of code into the kernel for AMDGPU. Have they being doing CPU tweaks ? Well i guess IO tweaks

wendell · August 31, 2018, 2:21am

Even just Bonnie will show it. Dd should work too. It’s very interesting. Suggests higher clocks lead to errors on if that are silently corrected but things slow down. Most of this testing was tr1 not 2. Rebalancing io across dies helped a lot too

cekim · August 31, 2018, 2:52am

Well, this is interesting… When at 4100, cpupower won’t let me load a governor, but it did with PBO???

At any rate - it appears at least part of the issue is the spin-up latency of the performance governor. I would try on your end to see if you can peg the 4.15 kernel to performance mode and see how it behaves. Here’s what I see between the two with a larger file:

PBO (governor = performance):
dd if=/dev/zero of=test.img bs=4096 count=1000000 oflag=direct
4096000000 bytes (4.1 GB) copied, 26.7552 s, 153 MB/s
4096000000 bytes (4.1 GB) copied, 22.7156 s, 180 MB/s
4096000000 bytes (4.1 GB) copied, 26.7563 s, 153 MB/s
4096000000 bytes (4.1 GB) copied, 22.9166 s, 179 MB/s

4100:
4096000000 bytes (4.1 GB) copied, 23.4658 s, 175 MB/s
4096000000 bytes (4.1 GB) copied, 24.4571 s, 167 MB/s
4096000000 bytes (4.1 GB) copied, 23.3825 s, 175 MB/s
4096000000 bytes (4.1 GB) copied, 22.8003 s, 180 MB/s

Metered eye-ball says those are pretty much the same. It is possible that you are also seeing retry? (at the DDR interface). I would assume internal IF errors would be flagged as machine check, but marginal DDR, being pushed harder with back-to-back bursts of a faster CPU, would just silently retry on CRC error (within limits) and chew up bandwidth.

I’ve (previously - in all the data you’ve seen) bumped SoC voltage to 1.025 in the BIOS. Not sure if I needed to, but I was worried about stability of the fabric at the higher speed… so I gave it a little more juice to work with.

cekim · August 31, 2018, 3:37am

Bonnie++

4100:
Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
xxxxxxxxxxxxxx 126G  1198  99 1782750  94 936796  54  3533  99 2579093  83  2384 106
Latency              7216us   39860us     511ms    2360us    3259us    5657us
Version  1.97       ------Sequential Create------ --------Random Create--------
xxxxxxxxxxxxxxxxxxx -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 20720  39 +++++ +++ +++++ +++ 22853  43 +++++ +++ +++++ +++
Latency               139us     252us     322us      89us       6us     317us
1.97,1.97,xxxxxxxxxxxxxxxxxxxxxx,1,1535687781,126G,,1198,99,1782750,94,936796,54,3533,99,2579093,83,2384,106,16,,,,,20720,39,+++++,+++,+++++,+++,22853,43,+++++,+++,+++++,+++,7216us,39860us,511ms,2360us,3259us,5657us,139us,252us,322us,89us,6us,317us

PBO + cpupower performance governor:
Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
xxxxxxxxxxxxxx 126G  1245  99 1806634  91 923144  54  3577  99 2568785  81  2423 107
Latency              7097us   36473us     496ms    2844us    7283us     515us
Version  1.97       ------Sequential Create------ --------Random Create--------
xxxxxxxxxxxxxxxxxxx -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 22707  42 +++++ +++ +++++ +++ 23312  43 +++++ +++ +++++ +++
Latency               202us     505us     292us     257us      10us    1611us
1.97,1.97,xxxxxxxxxxxxxxxxxxxxxx,1,1535681934,126G,,1245,99,1806634,91,923144,54,3577,99,2568785,81,2423,107,16,,,,,22707,42,+++++,+++,+++++,+++,23312,43,+++++,+++,+++++,+++,7097us,36473us,496ms,2844us,7283us,515us,202us,505us,292us,257us,10us,1611us

MazeFrame · August 31, 2018, 3:41am

TR is AMDs way to sell half broken Epyc dies.
For half the price of Epyc, you get 80-ish percent of the performance and most of the features.

And of course there is some segmentation going on! Expecting litteral EPYC chips for the price of a 2200G is…

cekim · August 31, 2018, 3:47am

There is some “matroxing” going on here I think. AMD has benefited from the “underdog”, “of the people” idea of fighting the Intel beast.

The issue now is that they are doing all the things that Intel did in abusing their position in the market (sandbagging. HARD). So, I suspect that’s what is rustling jimmies.

[AMD flamesuit on]
I’d add that unfortunately the net result of their sand-bagging is that in practical terms, the 7980xe delivered a year ago what the 2990wx struggles to deliver today. In many of its target applications the 7980xe is going to deliver similar or better to significantly better performance than the 2990wx. Some of that may be OS and application tuning, but the bottom line is that over a broad sweep of “what people do with these things”, the 2990xw loses out to the 7980xe in a big way despite a very similar price tag.

I say that owning one of each and running them literally side-by-side… where it beats it really beats… then there is everything else…

thro · August 31, 2018, 5:45am

The interconnect on the package is less complicated.

Either way, marketing a full bandwidth Threadripper with 32 cores would be financially retarded.

Be happy you have a 32 core desktop CPU for $1500. Without the 2990WX, you’d be paying $13 grand for a Xeon 8180.

cekim · August 31, 2018, 5:49am

it may be identical this round given the 4-die config
it actually costs them money to make multiple packages - it is surprising they bothered, but in volume that cost is small (but larger for 2 than 1)
Is this a “just buy it?” post.
See above - in practical terms the 2990wx is not delivering more value than a 7980xe that existed in a world that only had 16 core TR as its competition.

So, it is entirely reasonable to criticize as AMD will need to do more to be more than a “value play”.

StrY · August 31, 2018, 6:06am

I don’t get where this thread is going.

Yes the threadripper is not an epyc chip - wasn’t supposed to be as that doesn’t make business sense.

in practical terms the 2990wx is not delivering more value than a 7980xe that existed in a world that only had 16 core TR as its competition.

I don’t understand this sentence. The 2990wx is $250 cheaper than the 7980x and delivers dramatically more performance in most scenarios. Yes - you can get a 1950x with half the performance at half the price. But if you have to build 2 systems to get the same performance as the 2990wx that’s going to cost you more.

At the end of the day it all depends on the workload you are running what processor is going to be the best for you.

thro · August 31, 2018, 6:07am

“Waah AMD won’t sell me 32 cores and 8 channel DDR4 for beer money”

edit:
just a few short years ago, the i7-6950X was about the same price as the 2990WX.
(actually, $1723.00 - $1743.00 according to ark.intel.com)

cekim · August 31, 2018, 6:11am

That’s just the thing though… it doesn’t deliver “more performance in most scenarios”. It delivers more in some very specific scenarios and dramatically less in many others. This is even true relative to the 2950x.

Personally, I think I can make use of its peculiarities - that’s why I have one, but when you chart up the broad spectrum of HEDT use-cases, its a very, very strange processor.

That was the crux of the discussion for/from me. They’ve made a painful choice for the user in their memory architecture that has dire consequences for the overwhelming majority of use-cases.

thro · August 31, 2018, 6:13am

In lightly threaded scenarios it doesn’t perform. Who knew? Buy a 2700X for those scenarios… or use ryzen master to turn your 2990WX into one, or into a 2950X.

It wasn’t a painful choice, it was a deliberate choice to not cannibalise the high end where they can make profit for the first time in a decade.

edit:
Also, much of the limitation is in the windows scheduler. It’s not all AMDs fault.

At the end of the day, if it doesn’t work for you, buy something that does. This level of capability at this price is a steal.

the 7980 XE doesn’t have the PCIe lanes, if you want to pick and choose things where processors just don’t perform, how about NVME RAID on intel?

cekim · August 31, 2018, 6:14am

This is an answer from the heart, not the data… Look at the comparisons between the 2950x and the 2990wx… There is a LOT more going on here than “lightly threaded” or not. I’m actually not concerned with “lightly threaded”… I like my threads well-done.

At any rate, I’ve clearly tread on dogma here suggesting the chip is odd and perhaps more so than it needed to be… so, I’ll leave it at that.

StrY · August 31, 2018, 6:33am

Saying dogma I feel is going a bit far. I’m trying to look at benches here and making an educated guess here as I don’t own a Threadripper yet.

Looking at this article I can indeed see that on Windows the Threadripper’s performance does drop by a huge margin. Hopefully this can be remedied with a patch at some point.

Still think the Threadripper is going to be a better buy for most users buying this type of processor. But the I9 definitely still has workloads where it is still king just as it is for all the rest of the INTEL/AMD comparisons.

MazeFrame · August 31, 2018, 7:45am

It is.
The raw numbers aside, look at the difference between Linux and Windows! A reviewer could make both CPUs look the same or make one roflstomp the other by switching OS…

Now this is funny:
Compress your data in Windows

And then unpack them in Linux:

I am not going to link the Blender numbers. They show that someone optimized it to eat ALL the cores and it is just “moar cores moar better” with Linux beeing quicker (who knew).

thro · August 31, 2018, 8:06am

Dogma? Please…

The chip is most definitely unusual, you get no argument from me there.
You also get no argument that sure, it is compromised, and memory bandwidth starved in some scenarios (but realistically, has big enough caches for highly threaded desktop style workloads to alleviate that mostly in real world scenarios).

BUT

You have to take into account what else is on the market and what its price is. Also, that it had to fit into existing TR4 boards which are quad channel and originally made for 2 dies, not 4. If you want something that isn’t memory starved, AMD make a variant that isn’t - for this in 4 die 8 channel form, you don’t want X399, you want an EPYC board. This processor isn’t made for that. This processor in that form already exists, it just isn’t called threadripper.

If you need a lot of cores for not much money, and aren’t memory bandwidth starved, it’s a bargain. I feel there are plenty of desktop workloads in that niche.

If you are memory bandwidth starved, buy EPYC, or something else. No one is forcing you to buy TR 2990WX for cheap.