The no BS Ryzen Thread: All official information on Ryzen here

Last PCPer podcast had some interesting Ryzen discussion:

(Pay no attention to the man with the console)

Ryan talks a bit about his thing on 1700 overclocking at https://youtu.be/4aEw3e-je9w?t=6m22s

Then at https://youtu.be/4aEw3e-je9w?t=35m23s they start get on with more Ryzen stuffs. Alos talk about XFR (an often misunderstood feature), that Biostar mini-ITX and a bit about Naples.

Some interesting bits, Josh Walrath notes there about the difference in production and gaming benches. I think he is on to something about it being a limitation in the infinity fabric connect between the CCX:es. Other have written about it:

Explains about the fabric the CCX:es lives on.

The Data Fabric is reponsible for the core’s communication with the memory controller, and more importantly, inter-CCX communication. As previously explained, AMD’s Ryzen is built in modular blocks called CCX’s, each containing four cores and its own bank of L3 cache. An 8 core chip like Ryzen contains two of these. In order for CCX to CCX communication to take place, such as when a core from CCX 0 attempts to access data in the L3 cache of CCX 1, it has to do so through the Data Fabric. Assuming a standard 2667MT/s DDR4 kit, the Data Fabric has a bandwidth of 41.6GB/s in a single direction, or 83.2GB/s when transfering in both directions. This bandwidth has to be shared between both inter-CCX communication, and DRAM access, quickly creating data contention whenever a lot of data is being transfered from CCX to CCX at the same time as reading or writing to and from memory.

Edit: Not sure that his numbers are right, AMD has specified that the Fabric runs at half memory speed. I guess it is possible to run RAM at 3200 though.

Witch takes us to the point:

Memory Scaling

To put things into perspective, communication to and from cores to L3 cache inside the same CCX happens at around 200GB/s. Combine this with the massive difference in latency you would expect from having to go through extra hoops to reach the other CCX, and the communication speed between CCX’s is simply not anywhere near intra-CCX communication speeds. This is why changing the Windows scheduler to keep data from a single thread in the same CCX is so important, as otherwise you incur a performance penalty, as observed in gaming performance tests.

As observed in multiple tests, Ryzen appears to scale quite well with memory speeds, and this knowledge sheds light on why. It’s not necessarily the increased speeds with the DDR4 kits themselves, though it certainly helps. Rather, it’s the increased internal bandwidth for inter-CCX communication, which alleviates some of the performance issues when threads have to communicate between CCX’s.

Due to this, if you’re picking up a Ryzen system, it’s highly recommended to get a decently fast memory kit, as it will help performance more than you would otherwise expect.

So I certainly think Josh Walrath is on the right track. Question is what can be done about it. We're still working out exactly how it works.

Allyn also touches on som oddities he saw when doing storage benches on NVMe drives. On single thread and high queue depths, Ryzen only has about half the maximum IOPS compared to Intel. This is a bit odd, but not something a desktop user has to care about as you never hit those maximum queue depths on desktop (16-32) [1]. Still it is an oddity that might be able to shine some light on specific performance, maybe it scales with memory speed too?

When talking about Naples Ryan knew that the the CPU sockets will talk over 64 lanes of PCIe using Infinity Fabric for the protocol. I hadn't heard that yet, interesting. Charlie over at Semiaccurate have written a short article on it:

I find this interesting, Ryzen might be a more complicated animal than we initially thought. Maybe the big thing will be to get memory as high as possible to get the infinity fabric running as fast as possible?


[1] Allyn ha talked about this before, how for desktop it is important to test at qeue depths of 1-2. Those really high IOPS number that are stated in the specs are almost always only at a QD of 32. Aka only for server loads, you'll never see this on a desktop. Typical BS marketing numbers.

I found Linus 1080Ti review interesting because he put an 1800X platform vs 7700K at 4K with it.

Ryzen held up ok if you ask me.

The Fury held up ok if you ask me. :P

Oh, btw: Doom benched only on OpenGL .... totally legit.

3 Likes

I stopped watching Linus clips, just because of the click-bait titles. :P

2 Likes

2018 MUST WATCH! I stopped watching Linus clips AMAZING just because of the YOU WON'T BELIEVE IT click-bait titles. :P

1 Like

OH yeah... 1080Ti... Amazing... performs right where we expected it to and costs more than it's worth. But it's amazing because Nvidya

To be fair, what do you expect?
Honestly, i was surprised they lowered the prices with no competition. HOWEVER... In the last year AMD gained half the market share they already had. An year ago they had 20%, now they have 30. With no high end competition to Nvidia. All they have is 470 and 480, and they still gained market share.
So Nvidia can release 1090 if they want. The 100 people, that will buy it will be grateful. I am grateful, that i can find RX 480 at 210,- Euro...

2 Likes

I've stopped watching most tech channels on reviews but I am very interested in seeing Ryzen tossed into the mix. Linus was the only channel I watched that bothered to. Everyone else is still 7700K only.

Honestly, i would say that is the better approach ATM. Especially with all the initial bug fixing period, that any new platform have. I would say Ryzen can be put into all kinds of crap in about a couple months. Probably R5 release will be the real Ryzen release, since most issues will be fixed, the boards will be stable by that time... It will be just plug and play.

Things are getting interesting.

Same but whenever something picks my interest then I watch multiple ones, Level1Techs is the only one I'm subscribed to but I follow many tech tubers through Twitter and/or Instagram.

Subscription feed is super messy like I can't get rid of the videos without just removing that subscription, basically its easier to follow everything elsewhere and try keep YouTube tidy by being very selective for what to subscribe.

~my feelings

2 Likes

Very true. When I plan to buy I belly ache over anything I can find. I decided to go for RX 480 over 1060 and now I will have that card for years. Im deciding on Ryzen or Intel. Im pretty sure I am Ryzen but I have to wait and see what R5 brings because I will most likely stick with that for years too.

When I not buying I am usually just interested in how it works and what companies are doing that makes sense.

Well, at least you can read the Eteknix link I posted ~15 posts above.

Thanks for that. I did miss it.

2 Likes

More info about Ryzen windows scheduling from PCPER which via the comments went to another even more interesting video

https://www.pcper.com/reviews/Processors/AMD-Ryzen-and-Windows-10-Scheduler-No-Silver-Bullet#comment-318806

Ryzen is faster than intel at thread switching in one CCX and as the video explains at the end a 4 core Ryzen R3 i guess will be faster than intel but what will the clock speed be. Well an 7700K equivalent would be 1 CCX's. Interesting times.

3 Likes

Clocks will probably be about the same as now. It looks like there is a very significant process tech limitation for Ryzen on the Samsung 14nm LPP process. Add to this that the lower core count Ryzen will be harvested dies. There are no special dies for the six and four core parts as of yet. Although the CCX:es can't be too imbalanced, you'll wont get 1+3, rather 2+2. A single CCX could be interesting, but so far not something AMD has said they are doing. We will probably get die harvested APUs much later on (well after the APUs launch) that has a borked GPU but the CCX is ok. Not sure that will be better than die harvested Ryzen though.

Ryzen on Samsung 14nm LPP doesn't seem to clock high at all. The Stilt has a telling diagram in his technical writeup:

The Freq to Vcore ratio is great up to 3.3 GHz (almost linear) where it gets worse. Then you have a second critical point at 3.5 GHz. Then it gets much, much worse after that. 14nm LPP is very efficient up to 3.3, it will simply never clock as high as Intel's improved 14nm, as the latter is a far superior process.

High clocks just isn't for Ryzen, it is more interesting to look at low frequencies. The Stilt does that, very interesting numbers at low clocks. Very impressive performance at very low power consumption. Mobile/laptop chips will be good no doubt. But we won't get any high clocked desktop chips. This is also not something AMD can do much about, as it looks like it is a process tech limitation. Later Zen versions might have improvements, we'll see when Zen2, Zen3 etc comes around.

Also, isn't low power performance more important than really high clocked desktop parts? Laptops tablets etc is probably a much bigger market than high end desktops. Just like AMD seems to be betting Zen for servers too, Naples could make a serious dent in Intel's server sales. Performance per watt is important in server, and so is scaling over many cores. Looks like Naples have all that in spades, or at least it will be interesting again. So many years of only Intel gets borig, lol.

Thanks for linking the PCPer article @Marten , it is a very good one and I do believe Allyn when he writes that it took all day and that the whole team pitched in. Ryan also has added an edit at the top:
https://www.pcper.com/reviews/Processors/AMD-Ryzen-and-Windows-10-Scheduler-No-Silver-Bullet

Editor's Note: The testing you see here was a response to many days of comments and questions to our team on how and why AMD Ryzen processors are seeing performance gaps in 1080p gaming (and other scenarios) in comparison to Intel Core processors. Several outlets have posted that the culprit is the Windows 10 scheduler and its inability to properly allocate work across the logical vs. physical cores of the Zen architecture. As it turns out, we can prove that isn't the case at all. -Ryan Shrout

PSA:

Their findings suggest there isn't any malice in the scheduling or SMT usage. It may be down to the way the architecture handles loads that are crossing over different physical cores, and, especially between two modular cores (4 x 4). So let's say a certain load communicates between core 1 and core 2, that would have detrimental performance, but a load communicating between core 1 and core 7, two cores in either sides of the modular complexes, this would have further detrimental performance. This kind of load would be found in games today that are more closely parrying the multithreaded work load of current gen consoles, especially the Vulkan and DX12 titles.

Furthermore, this could point out to why certain games don't have detrimental performance on the Ryzen. Instead of giving out work load that is co-dependent between cores, those said games only send out specific work load to specific cores (eg. AI on core 1, physics on core 2, etc...).

3 Likes

Turns out when AMD says it is a new architecture, it actually is a new architecture.

By the way: this explains why AMD is in contact with all those game developers.
What was that? 3000 dev kits?

2 Likes

If you have Ryzen, you can try that http://developer.download.nvidia.com/SDK/9.5/Samples/DEMOS/Direct3D9/HLSL_Instancing.zip Draw Call benchmark thing they've used in The Stilt's thread over at Anandtech Forums.

One user there, CatMerc, did a video:

Performance drop significantly when threads spread over the CCX:es (second video shows the expected randomness with thread alloc). Pretty sure this is the problem seen in game benches, just as PCPer thinks. Probably what is responsible for the weird histogram The TechReport got in GTA5:

This is probably a direct consequence of the inter CCX latency PCPer saw in their testing:

Exactly how the Infinity Fabric works, I don't know. I've seen different numbers on how much data per cycle it can manage, plus all the speculation that it is almost like the Hypertransport of old. I don't think it shares much with the old fabric, as AMD has repeatedly said it is different. Same same but different. When Napels launches I think we'll have more on the spec. And much more hardware to get data-points from too.

There is some interesting discussion in Stilt's Ryzen: Strictly technical thread, at about page 19 the PCPer article comes up.

2 Likes

I like that bit at the end, that splitting Ryzen into two NUMA groups probably isn't a good idea. Ryzen simply isn't two NUMA groups as the CCX:es share memory and mem controller. Also there is the question of cache, will your workload benefit from running on two CCX:es to be able to use that extra cache? Adding 8MB of L3 can be substantial depending on workload. It is not an easy fix by a long shot, so I think Ryan is right when he says there is no "silver bullet" fixing this with patching the Windows Scheduler. Ryzen is a design with compromises, this inter CCX communication is part of that.

Makes me interested in how AMD has set up that for the server platform, Naples. 64 PCIe lanes running Infinity Fabric between sockets. I want to see how that works.

Edit: Also, the threaded games probably are the worst case scenario for Ryzen. When there are several threads that needs to communicate. both producers and consumers, the crosstalk over the Fabric increases. Add to this that you have a GPU that wants latency sensitive information and there also is a need to stream in textures etc from disk and memory to the GPU. Lots of load on that Fabric. Compared to multi-thread Blender that is embarrassingly parallel with almost no inter-thread communication, no GPU that want's CPU time RIGHT NOW etc.

Look at the Firestrike scores, Ryzen does good in the Physics Score as expected, it can do a parallel workload without much interruption. Then there is the combined test, those scores are not high, there is the Fabric crosstalk coming into play. CCX:es need to talk to each-other and to the GPU at the same time. Same in many other benches that uses a GPU in tandem with the CPU. Just like games.