AMD Ryzen: Part 1 The Chip, The Myth, The Legend | Level One Techs

The thing I find a bit disingenuous is that all reviews I've seen have stated if you want gaming, then 7700k is better. If you want productivity Ryzen is better. That seems to be the mantra even at places like here at Level1Tech.

My gripe with that is that it's not even that clear cut. There are quite a lot of productivity tasks that don't use more cores particularly well. Even really "highest end" professional programs like Lightroom and Photoshop. (And Lightroom at least should be able to use multiple cores fairly easily.)

I do hope this will mark a turn and that we will see more effort being put into multi-core work now. It's certainly about time. But I think the productivity uses more core angle is a bit over-hyped. If you really want more cores now I think the best bet might be to get a 1600x and then possibly upgrade later when future versions of Ryzen rolls out. At least if you want "bang for your buck".

EDIT: If you are doing something where you know you will have use for more cores then do that. But if you're not sure then I would find some benchmarks before jumping in if price/performance is important to you.

I agree, run a single series of benchmarks with the highest end GPU and then announce to the world that Ryzen Gaming is terrible, even though it is still better than 98% or all computers on earth doesn't do too much for their credibility. Wendel has not taken that sensationalist path though.

Reporting an observation and investigating further is not the same thing as drawing a conclusion from minimal and selective data. Ryzen, from all the benchmarks does indicate it is the best price/performance productivity machine going at the moment but that statement doesn't make any judgement on other uses of the PC.

In gaming, absolute frame rates are down compared to 7700K but in the scheme of things, 7700K is probably at about 200% of what is required and Ryzen is only at 180%. They both exceed requirements so it is quite a ridiculous thing to be worried about.

It does have one flaw/issue that could, in the long term, disadvantage the entire family of chips if AMD cant address it without re-engineering the whole architecture. That is the bottleneck they have created to the memory controller which is causing the slower gaming performance. If they have some way of increasing the Data fabric in addition to increased memory frequency such as making it 1:1.75 of the memory frequency instead of 1:2 will alleviate the bottleneck somewhat but I don't know if that is a microcode/uefi thing or if it is a physical hardware limitation.

More multi core development is inevitable I think. The challenge to get there are not the tools but, to actually find developers that have learned how to think in terms of parallelism to write the appropriate code in the first place. Thinking that way is not something anyone really learns by accident.

I certainly agree with researching before jumping in

It is still the age long problem. Software and programs have to be written to take advantage of moar cores. Still a very long road ahead. Maybe now ? Developers with learn to start leveraging moar cores. I think of all these years of playing with moar cores and how little effort has been made in that regard. We should of already address this a long time ago :(~

@wendell
Where can I get those Ryzen docs you mentioned. Can't find 'em anywhere except that OC doc.

There are a number of whitepapers and non marketing things out there. I did a quick search and all I got was https://www.google.com/url?sa=t&source=web&rct=j&url=http://32ipi028l5q82yhj72224m8j.wpengine.netdna-cdn.com/wp-content/uploads/2017/03/GDC2017-Optimizing-For-AMD-Ryzen.pdf&ved=0ahUKEwj-yNvEuPjSAhUM7iYKHUohAawQFggmMAE&usg=AFQjCNGk-6M7kngZFMaeqeHbj1duvuzvjA&sig2=nlA65s4mYqwcnSQjBG9xvA which h is a start

3 Likes

For some reason AMD have never published a diagram that actually shows how Ryzen is actually architected as a whole. For some reason they have only ever published partial diagrams that omit details that are relevant such as how the memory and PCIe connects to both cores with interconnects that could potentially create a bottleneck.

I decided that it would be interesting to put the jigsaw together and produce a data flow diagram that may help everyone understand how Ryzens IO works in its entirety.

As demonstrated in a Hardware Unboxed video yesterday, the performance issue is not caused by the inter CCX thread switching as was widely believed.

This diagram does suggest to me that the gaming slowdown is because of a bottleneck created at the 32 Byte/cycle (18.75GB/S with 2666Mhz Memory) interconnect to the Memory controllers as it tries to service traffic from the combined 96 Byte/cycle bandwith from both CCX modules and the PCIe controller when they are all running under load. Like when you are gaming.

5 Likes

Yes, the inter CCX latency can be a problem for latency sensitive things sure, but he Fabric bottleneck is probably what is the main bottleneck of the design. Josh Walrath over at PCPer said this weeks ago on their podcast. He's probably right.

the inter ccx latency is in the region of an extra 60 nano seconds for 12.5% of all thread switches that each happen many milliseconds apart. The impact on performance is less than 1%. Not the 20% that 7700K comparison bench marks indicate. There is a 5% performance boost simply by enabling Message Signalled Interrupts and reducing DPC latency.

By the way, I was the one who told Josh Walraith the week before he mentioned it on PCPer podcast. I did notice that he didn't bother to mention where the idea came from.

Given a proper look at the benchmark data it should never have been concluded that CCX thread switching was the primary cause of the bottleneck. That conclusion was drawn from poor initial research resulting in a lack of understanding on what was actually being tested, shoddy and incomplete test methodologies and relying on was more assumptions than what was safe to draw a conclusion.

2 Likes

That people on forums isn't smarter than actual engineers isn't really shocking is it :)

I agree that the latency probably isn't a huge problem, and it certainly isn't well understood.

That inter CCX latency was something Allyn noticed while testing the theory that the scheduler was somehow at fault. Then people have run with it ofc, making all kinds of claims. Still plenty of conspiracy theories about the win scheduler, the Linux scheduler etc.

I don't find it weird at all that the 7700k is ahead, it has better IPC and higher clock after all. Typical gaming loads don't scale much, there are only a handful of games that scale at all beyond a three-four cores. Games are not embarrassingly parallel, probably never will be. IPC is still very important. The Fabric being a bit of a bottleneck is just icing on the cake, making Ryzen scale a bit worse in games.

There are rumours of the Fabric running on higher clocks in Naples and/or Zen2. I guess we will know when Naples have properly launched.

1 Like

Very true. I'd like to know what, though, programs like aida64 were doing under the hood with respect to memory latency. Testing myself in c++ in linux a 1-way trip was on the order of 30ns. Round trip I'd expect to be double that but with aida64 and several other programs it was reporting 100ns+

3 Likes

AIDA released a new(?) version today:
https://www.aida64.com/news/aida64-v590-amd-ryzen-benchmarks-latency-cache-speed

Optimized 64-bit benchmarks for AMD Ryzen “Summit Ridge” processors

AIDA64 CPUID Panel, Cache & Memory Benchmark panel, GPGPU Benchmark panel, System Stability Test, and all cache, memory and processor benchmarks are fully optimized for AMD Ryzen “Summit Ridge” high-performance desktop processors, utilizing AVX2, FMA3, AES-NI and SHA instructions. Detailed chipset information for AMD Ryzen “Summit Ridge” integrated memory controller. Preliminary support for AMD Zen server and workstation processors.

I like that last bit most I think :-D

1 Like

I guess I cheated. 27 years of Computer system architecture, engineering and problem solving experience maybe gave me an unfair advantage ;-)

The CCX does have an observable performance degradation under heavy CPU+GPU load, The switching transport is reliant on the Data Fabric to switch between modules, If there is contention for usage of the DF, whatever needs to use it has to wait until a slot opens up to make use of the transport. CCX thread switches is just one of the many things using the DF that all have to wait for available slots if the Fabric is fully loaded. Even so the degradation from delays to switch a thread is still only in the 60-100 of nano seconds with each switch. Putting it in context, access times to an SSD is in the region of 100,000 nano seconds and access time to a spinning rust disk is about 8,000,000 nano seconds.

The problem that I have with almost everything related to Ryzen coverage (Wendell aside, who has reported his observations and reserved judgement so far) is the widespread suspension of rational, logical thought and the wholesale unthinking use of cookie cutter process. CCX thread switching/fast GPU only stresses CPU cores has been applied without much thought or understanding of what is actually being tested.

Here is another example of what I am talking about. The "Linux/Windows scheduler" issue that you mentioned.

We discovered that Geekbench multicore Linux benchmarks get better scores on Ryzen than the equivalent Windows geekbench scores. We hear "That's evidence of a windows scheduler problem on Ryzen!!!".
See these Ryzen results sorted by multicore scores.
https://browser.primatelabs.com/v4/cpu/search?dir=desc&q=ryzen&sort=multicore_score

But wait a bit.....

This took me 5 seconds to type 6900K in the search bar and hit enter and then sort by multicore scores.
https://browser.primatelabs.com/v4/cpu/search?dir=desc&q=6900k&sort=multicore_score

oops....

How the hell did something so easy to disprove get so much traction that it became a "fact" that was used as unquestioned evidence for a windows scheduler problem in Ryzen?

What is it about Ryzen that has turned almost everyone into Chicken Little? Or am I getting it wrong? My sore head and the Acorn on the ground really is proof that the sky is falling.

AMD communication on how Ryzen and the data fabric actually works and what the bandwidth potential is has certainly been incomplete, and even contradictory in some instances. Publishing the diagrams that only show partial elements in isolation are guaranteed to cause confusion as the vast majority of people looking at it cant piece the puzzle together for themselves and rely on the tech media "experts" to tell them. Maybe Jim Keller finished up this time and no-one bothered to ask him how the chip worked.

I suspect that the confusion of about the fabric comes from this article on Fudzilla in 2015

this diagram in particular that certainly doesn't match up with Ryzen

3 Likes

I am pretty certain that the Aida tests contain a large element of on die cache in the bandwidth results.

If that benchmark is truly bench marking the actual DDR4 ram, I can't work out how it can report that it can have 40GB/s memory write speed when the Memory controller with 3200Mhz Ram installed only connects to the fabric at 25GB/s. Assuming the AMD Ryzen presentation materials are to be believed.

60ns latency sounds more in line with what 3200mhz memory should be. Have you tried the SI Sandra memory benchmarks to see how they compare?

Maybe there is some patent applications from AMD on Infinity Fabric? It is proprietary after all. I think Intel has some patent on how to run QPI on PCIe.

Data Fabric is a common design element of System on a Chip designs. There may be Patents related to elements of the overall design but conceptually it is not that different to an office LAN that all connects through a centralized switch. If you want to grow the network just run another cable and plug it into the switch.

QPI on the other hand, does a similar job but is more like connecting devices with point to point crossover cables instead of putting a switch in the middle meaning that the next chip design, while it may reuse elements of the preceeding design, has to be laid out again from scratch and connected up with its own QPI.

There is nothing intrinsically wrong with either approach. QPI, because it is point to point doesn't have any competition to use the connection. As long as the pipe is big enough to support that single need, away you go.

The AMD approach though you need to take component use and traffic into account, much like if you have a Gigabit network and everyone wants to read and write large files to a storage server. If the connection to the server is also 1Gbs, performance is impacted because the moment more than 1 person tries to simultaneously transfer files to and from the storage server, they each only get 1/2 the bandwidth. To address that, connect the storage server with a 10G Ethernet connection and then, assuming the disks in the storage server can read fast enough, the network can support 10 people simultaneously transferring files before seeing degredation in performance.

If the AMD marketing materials are correct, It looks like the interconnect to the memory controller has been under specified just like the 1Gbs connected storage server in my example above. It certainly explains why in some games SMT on ends up being slower than SMT off. The 16 threads block up the memory interconnect more than 8 threads do leading to more contention and slower performance as the cores memory access and GPU memory access has to wait in queues for the requests before it to finish being serviced. This stuff is all pretty obvious if you are dealing with it all the time. I dont understand why AMD made the design decision that they did and did not add a second interconnect to the memory in their design

2 Likes

mmm interesting.