The no BS Ryzen Thread: All official information on Ryzen here

Clocks will probably be about the same as now. It looks like there is a very significant process tech limitation for Ryzen on the Samsung 14nm LPP process. Add to this that the lower core count Ryzen will be harvested dies. There are no special dies for the six and four core parts as of yet. Although the CCX:es can't be too imbalanced, you'll wont get 1+3, rather 2+2. A single CCX could be interesting, but so far not something AMD has said they are doing. We will probably get die harvested APUs much later on (well after the APUs launch) that has a borked GPU but the CCX is ok. Not sure that will be better than die harvested Ryzen though.

Ryzen on Samsung 14nm LPP doesn't seem to clock high at all. The Stilt has a telling diagram in his technical writeup:

The Freq to Vcore ratio is great up to 3.3 GHz (almost linear) where it gets worse. Then you have a second critical point at 3.5 GHz. Then it gets much, much worse after that. 14nm LPP is very efficient up to 3.3, it will simply never clock as high as Intel's improved 14nm, as the latter is a far superior process.

High clocks just isn't for Ryzen, it is more interesting to look at low frequencies. The Stilt does that, very interesting numbers at low clocks. Very impressive performance at very low power consumption. Mobile/laptop chips will be good no doubt. But we won't get any high clocked desktop chips. This is also not something AMD can do much about, as it looks like it is a process tech limitation. Later Zen versions might have improvements, we'll see when Zen2, Zen3 etc comes around.

Also, isn't low power performance more important than really high clocked desktop parts? Laptops tablets etc is probably a much bigger market than high end desktops. Just like AMD seems to be betting Zen for servers too, Naples could make a serious dent in Intel's server sales. Performance per watt is important in server, and so is scaling over many cores. Looks like Naples have all that in spades, or at least it will be interesting again. So many years of only Intel gets borig, lol.

Thanks for linking the PCPer article @Marten , it is a very good one and I do believe Allyn when he writes that it took all day and that the whole team pitched in. Ryan also has added an edit at the top:
https://www.pcper.com/reviews/Processors/AMD-Ryzen-and-Windows-10-Scheduler-No-Silver-Bullet

Editor's Note: The testing you see here was a response to many days of comments and questions to our team on how and why AMD Ryzen processors are seeing performance gaps in 1080p gaming (and other scenarios) in comparison to Intel Core processors. Several outlets have posted that the culprit is the Windows 10 scheduler and its inability to properly allocate work across the logical vs. physical cores of the Zen architecture. As it turns out, we can prove that isn't the case at all. -Ryan Shrout

PSA:

Their findings suggest there isn't any malice in the scheduling or SMT usage. It may be down to the way the architecture handles loads that are crossing over different physical cores, and, especially between two modular cores (4 x 4). So let's say a certain load communicates between core 1 and core 2, that would have detrimental performance, but a load communicating between core 1 and core 7, two cores in either sides of the modular complexes, this would have further detrimental performance. This kind of load would be found in games today that are more closely parrying the multithreaded work load of current gen consoles, especially the Vulkan and DX12 titles.

Furthermore, this could point out to why certain games don't have detrimental performance on the Ryzen. Instead of giving out work load that is co-dependent between cores, those said games only send out specific work load to specific cores (eg. AI on core 1, physics on core 2, etc...).

3 Likes

Turns out when AMD says it is a new architecture, it actually is a new architecture.

By the way: this explains why AMD is in contact with all those game developers.
What was that? 3000 dev kits?

2 Likes

If you have Ryzen, you can try that http://developer.download.nvidia.com/SDK/9.5/Samples/DEMOS/Direct3D9/HLSL_Instancing.zip Draw Call benchmark thing they've used in The Stilt's thread over at Anandtech Forums.

One user there, CatMerc, did a video:

Performance drop significantly when threads spread over the CCX:es (second video shows the expected randomness with thread alloc). Pretty sure this is the problem seen in game benches, just as PCPer thinks. Probably what is responsible for the weird histogram The TechReport got in GTA5:

This is probably a direct consequence of the inter CCX latency PCPer saw in their testing:

Exactly how the Infinity Fabric works, I don't know. I've seen different numbers on how much data per cycle it can manage, plus all the speculation that it is almost like the Hypertransport of old. I don't think it shares much with the old fabric, as AMD has repeatedly said it is different. Same same but different. When Napels launches I think we'll have more on the spec. And much more hardware to get data-points from too.

There is some interesting discussion in Stilt's Ryzen: Strictly technical thread, at about page 19 the PCPer article comes up.

2 Likes

I like that bit at the end, that splitting Ryzen into two NUMA groups probably isn't a good idea. Ryzen simply isn't two NUMA groups as the CCX:es share memory and mem controller. Also there is the question of cache, will your workload benefit from running on two CCX:es to be able to use that extra cache? Adding 8MB of L3 can be substantial depending on workload. It is not an easy fix by a long shot, so I think Ryan is right when he says there is no "silver bullet" fixing this with patching the Windows Scheduler. Ryzen is a design with compromises, this inter CCX communication is part of that.

Makes me interested in how AMD has set up that for the server platform, Naples. 64 PCIe lanes running Infinity Fabric between sockets. I want to see how that works.

Edit: Also, the threaded games probably are the worst case scenario for Ryzen. When there are several threads that needs to communicate. both producers and consumers, the crosstalk over the Fabric increases. Add to this that you have a GPU that wants latency sensitive information and there also is a need to stream in textures etc from disk and memory to the GPU. Lots of load on that Fabric. Compared to multi-thread Blender that is embarrassingly parallel with almost no inter-thread communication, no GPU that want's CPU time RIGHT NOW etc.

Look at the Firestrike scores, Ryzen does good in the Physics Score as expected, it can do a parallel workload without much interruption. Then there is the combined test, those scores are not high, there is the Fabric crosstalk coming into play. CCX:es need to talk to each-other and to the GPU at the same time. Same in many other benches that uses a GPU in tandem with the CPU. Just like games.

Right on point.

2 Likes

Thanks for sharing kind sir :)

Once the efficient scheduler has been patched in I don't think it should be a problem anymore in most cases. As long as for instance a game has most of its workload that involves frequently sharing resources running inside four of the cores it will be able to run even a little bit faster than with a "flat" eight core setup. We know that four cores/eight threads are still enough for almost any game, and a game with tons of physics work or whatever could offload that to some of the spare cores.

This kind of 4x4 core setup has already been up on the major consoles for a while now, so it's not something unusual or unfamiliar to the game studios.

Let's just hope the updated scheduler gets it "right" the first time.

1 Like

huh, cool.
well shit
the temp sensors are 20C off on ryzen
K
https://community.amd.com/community/gaming/blog/2017/03/13/amd-ryzen-community-update?sf62109582=1

Thanks for that link Cavemanthe0ne.

Hmm, I see the following unexpected statement in there:

Based on our findings, AMD believes that the Windows® 10 thread scheduler is operating properly for “Zen,” and we do not presently believe there is an issue with the scheduler adversely utilizing the logical and physical configurations of the architecture.

That would mean there's in fact no scheduler update in tow. Not sure what to make of that since there's been so many indications of some kind of problem. Now I'm a bit worried...

couldnt it be that those 128bit fpu´s play some role in this?

I have a hunch that the new 'Gaming Mode' for Windows 10 that is coming this year might solve this solution. I've not looked into it much but it sounds like it sets processor affinity and priority for games so in theory could be configured to recognise 8C/16T Ryzen and lock the game to 4 Threads of the last 8.

Actually not unexpected at all. From the PCPer article:

In fact, though we are waiting for official comments we can attribute from AMD on the matter, I have been told from high knowledge individuals inside the company that even AMD does not believe the Windows 10 scheduler has anything at all to do with the problems they are investigating on gaming performance.

And the official comment was as expected. Nothing wrong with the win scheduler. Ryzen has compromises, that's about it. You can't magically fix the inter CCX latency problem, only work around it. And that is best done on an application level as the OS has no idea what kind of logic your app will use.

I don't think so. Games in particular don't use AVX2, no game to my knowledge does. Do please correct me if I'm wrong. Ryzen can do 256bit operations (AVX2) but gets a hefty perf impact as it needs two cycles to do one such operation. Effectively cutting the throughput in half compared to Haswell and later.

That sounds like a really bad idea, don't it? You waste a lot of your CPU resources. Also modern games like many cores, think DX12/Vulkan. Again, this isn't something you can solve on the OS level, it needs to be done on application level. Every game needs to optimize for Ryzen.

Best solution I've found so far is to lock core affinity on a per process basis to one CCX. I will post benchmarks next week once I have the time to do some thorough testing.

2 Likes

Are you using Processor Lasso for that?

Nope just plain Windows 10 Task Manager as of right now. It works.

Not really, gaming is just a side show distraction for me. The reason I loved the FX 8320 was that it gave me the ability to game whilst my machine did other useful tasks without either hindering the other if I set processor affinity correctly.

To me wasteful would be buying an R7 just for a gaming machine that does nothing else. I could see streamers making good use of them, but just for gaming? Nah I think the 4C/8T R3's will have that covered nicely. But each to there own :-)

1 Like

You're right I should have read their article right away instead of their video and getting bored by their being ill prepared to present their findings. If indeed the sheduler refrains from bouncing threads between the CCX'es when it doesn't have to, then I'm good. It's just that before this article with their one test on the subject, I've read several accounts claiming findings of the opposite. But at least I have a bit more hope now and am eager to get into getting my own Ryzen system up and running in a day or two.

I also plan to play around with Process Lasso a bit. After all most games still can't use more than 4c/8t and wouldn't necessarily lose much from being limited to such. Or if given access to five-six cores rather than eight, there just might be a chance to lower the amount of threads bouncing between the CCX'es at least a bit. It'll be fun testing at any rate and I'm already familiar with the software anyway.