PCPer talk about their investigation into RyZen odd performance benchmarks where they found some interesting information

TL;DW: They checked Windows Scheduler and Core numbering to be sure it wasn't mis-numbering the cores and thereby causing issues with latency on memory access, or mis-handling threads causing performance degradation.

They found it did not do that, but when the set of 4 cores on one CCX has to communicate with the 4 cores on the other CCX, an extra 100 ns of latency is added to the average of about 40 ns. Windows has no way to know that and it's not intelligent enough to see that is a re-occurring issue and to work around it.

Personally, I found those results weird. In most gaming benchmarks, one in particular that was mentioned was for BF1, RyZen tends to have higher lows in terms of FPS than the Intel equivalent. You'd think having some added memory latency between CCXs would make those worse, not better.

3 Likes

There seems to be a lot to all of this. Some chance involved also.
Drawcall testing of CCX communication penalty.

3 Likes

As the Game Dev industry is incredibly standards allergic, it's entirely possible that there may be games (like BF1, ostensibly) that don't stress this limitation of the windows scheduler, and others that do.

We're going to see a lot of optimization on commercial platforms and engines as time goes on, Open source will probably be first to market on proper utilization though.

4 Likes

Basically with 10 thousand words they said something, Wendell said like a week ago...
It's an issue of an internal communication between the CCXs. Thread X is buffered in the cache of CCX1, while it needs to be processed by CCX2, so it needs to be moved and there you have a slow down, because the rest of the system needs this thread executed before it could continue with anything else. It all depends on how much interconnected the threads are with each other. Higher dependencies they have to each other - larger the performance hit could be. It is also why some reviews show performance difference of 40% while others barely 10%...
Heavy workload doesn't show that issue, cause there is always an independent thread, that can be executed while another one is moved from CCX cache 1 to CCX cache 2.
It is just a design flaw. Wendell also gave the solution... Make Windows see every CCX as different CPU. That way all threads will be buffered in the correct cache pool.
That is why i also believe the quad core Ryzens will be the true gaming processors. There will be no issues like that and the gaming performance will be a bit higher.

so basically
-memory actually matters on ryzen
-the scheduler was not messing with things contrary to popular idiocy
-..........
-win10 is strange but should be getting updates to make things better soon

Yup... Helps moving data from CCX to CCX ... for some reason...

2 Likes

PCPer is still the king when it comes to detailed testing. They're not just farting from their mouth like many others and tried their best (and succeded) to get to the heart of the question.

All of this data is really really interesting. Looks like they achieved such a CPU like Intel did with their first quad core CPUs, bonding two dual cores toghether. I guess that this kind of testing will show similar results in a Q6600.
I wish many reviewers see this video and redo all the testing with just half the cores enabled to see if the IPC of this CPUs is at the same level as the Intel ones (my bet is that it is).

not sure that you can disable cores yet in most UEFIs

So after hearing what I want to hear about that memory, I tried to find DDR4 memory speed results for gaming and these numbers are a bit odd.

2x 980ti's running games at 1440p, memory is tested from 2133 to 4000MHz which shows Witcher 3 averages rising from 83 to 102fps.


Last page has final Witcher 3 test with just one 980ti where the improvement is just +1fps with each memory speed jump, 58 to 62fps with minimums staying 44fps.

Bonus 1x 980ti oddity, Witcher 3 720p results rising similarly to that 1440p 2x test with 4k showing no improvements.

These tests don't really show whats the point where fps starts to rise other than when its just easier to run, which probably also depends from game. So for now at least I'm guessing that even though Fallout 4 shows improving, any setup what runs game past 60fps also benefits from faster memory.

Simple enough, I wonder if DDR4 also kicks in when lowering game settings? For example with Andromeda squeezing Ultra 40fps and by lowering it to High it could run 72fps as its going past 60fps. :D

Waiting 3466MHz Flare X kit,
~ #yolo

1 Like

I don't own an AM4 motherboard and a new CPU so I can't test it.

Also now I'm wondering if the memory controller, PCI lanes and CCX share the same bus that's probably why high frequency memory can be an issue for this CPU and higher clocked memory gives it a little more humpf.

All this considered I think this time AMD did a really good job with CPUs.

The issue currently is really quite simple. Ryzen 7 is essentially like two CPU's with 4 cores on each block connected to their own 8Mb of L3 cache and these two cores are then in turn interconnected via the 'infinity fabric'.

Essentially it is like a dual socket Xeon system in one chip. A NUMA (non uniform memory access) architecture. And windows has support for this, but because these cpu's are inside one chip, windows is treating it as an SMP (Symmetric Multi Processing ) architecture instead of a NUMA one. This is where the problem begins. This means that windows does not know to treat the L3 cache as essentially two different caches leading to cache misses.

Once windows gets patched to use a NUMA scheduler for Ryzen it will get a lot better. Once games get updated to take advantage of NUMA architectures and 8 cores effectively it will be a whole different story. Currently it would be possible to get great performance by setting Process affinity on Ryzen to only 4 cores on one CCX Core Complex within Ryzen, effectively cutting the cache in half, but avoiding cache misses.

This Video brings some perspective as to what is happening:

2 Likes

I don't have Ryzen so can't test but a simple fix for this is to just set processor affinity for certain programmes.

I used to do this a lot when I first had an AMD FX processor and I was playing games whilst converting video.

I believe the same methods/commands work for Win 7/8/10.

The pain in this if I am understanding it is that you have to manually configure this every time you start an application.

Is there a way to script this so that a one time set up will result in it always loading onto the same cores and running the same way.

If so I imagine that in my head, you could set it up for core 1 and 2 with 4 threads collectively could be set to OS and low level tasks like browsing and general running of things in the background.

That leaves 6 cores and 12 threads to assign as you like, so another 1 or 2 cores and their respective threads held in reserve for heavier duty tasks that need their own CPU time like say running OBS to stream games. So now you have one full CCX doing its thing, no messing around.

So the whole other CCX and its cores/threads 4/8 can be set to just deal with games and that's all they will do.

This sounds like it would solve all then problems and get close to maximum usage out of the CPU or at least make it so at any time firing up some extra stuff here and there has its space and can be launched with out worry for the drop in performance.

Yes, that guide I put a link to shows how you can create a shortcut to a program so it always launches with processor affinity configured. I found this very effective if I wanted to be able to play games whilst my FX 8320 assigned four cores to video conversion and 4 cores to a game. The FX series also benefited because 2 integer cores shared cache and an FPU - best to keep certain processes together etc.

There is software out there for Windows that does this for you. (I have never used it though):

Edit: Interesting... (It shows Process Lasso in use)

2 Likes

That was on an Intel system, not ryzen. There has yet to be similar testing but so far it looks like Ryzen will have a significant performance impact with faster memory.
Now how much that difference is ... Is yet to be discovered but still.

1 Like

The clock for the Infinity Fabric, aka the Data Fabric connecting the two CCX and everything else, runs at half the effective RAM clock. Faster RAM = Faster communication between CCX:es and pretty much everything else. Problem is that even 3200 MHz RAM seems to be a bit of a challenge do get working well on Ryzen. This might have to do with clocking the IF too high, but I don't know. We'll see down the line. Maybe AMD will give us a separate multiplier for the Fabric, it could be possible, but could also not be. AMD has been very silent on the specifics of the Data Fabric they call Infinity Fabric. It is proprietary and secret.

There was a leaked slide on hardware.fr:

The Fabric operates on half the effective mem clock (2400 RAM would mean 1200 MHz). I've seen this info on other places as well, but as I said, seekreet. Hardware.fr also state that AMD told them the bandwidth between the CCX:es was 22 GB/s. Some have taken this as proof that the Data Fabric is only 22 GB/s. That is not necessarily the case though. It could mean you get that one way bandwidth on the lowest std JDEC DDR4 speed, i.e. 1600MHz. Most analysis I've seen think the Fabric operates on full duplex, seems plausible from that slide. We also have from that slide the figure of 32B/cycle. Say you have 2400MHz ram that would mean 32B * 1200 MHz or about 38 GB/s (there is probably overhead). This will then be used for inter CCX traffic, all Memory traffic and IO.

Charlie over att Semiaccurate had an early article on IF:

On the surface it sounds like AMD has a new fabric to replace Hypertransport but that isn’t quite accurate. Infinity Fabric is not a single thing, it is a collection of busses, protocols, controllers, and all the rest of the bits. Infinity Fabric (IF) is based on Coherent Hypertransport “plus enhancements”, at a briefing one engineer referred to it as Hypertransport+ more than once. Think of CHT+ as the protocol that IF talks as a start.

He goes on talking about how IF is split between Control and Data, controls power management, security, reset/initialization, and test functions. It is probably fairly low bandwidth but also low latency. And if it is scaleable, the bandwidth can be scaled up. Latency will be important for Naples, IF connecting the sockets for cache coherency etc. AMD has also said that IF will be in GPUs like Vega.

Also we don't know if IF is point to point, ring bus or something else. It looks like Vega is using a mesh topology for IF, for example. AMD seems proud of the granularity and the scalability. Vega could have, say thousands of IF controllers on chip for that granularity.

It could be the case that Ryzen has Infinity Fabric-light or something. Full blown thing will be in Napels though, so I'm eagerly waiting for that to release to get more details.

1 Like

Seems Charlie post in the RWT Forums, he added this insight the other day:

Two things stand out. First is the old Hypertranport was both hardware and protocol. THe new IF is physical layer agnostic which should provide a lot more flexibility. It works on-die, between MCM chips, and between sockets. I don't know if it also can work for inter-system or inter-rack comms, I will ask if I get in front of the right people.

This is fairly interesting, the works-on-any-hardware definately feels like a good thing. Inter-system would be very interesting, like Infinband etc. If it works, that is.

The other thing is that IF is far more granular than HT ever was in that it isn't a chips to chip protocol or even a core to core protocol. From what I gathered from my chats with AMD personnel, there are multiple IF endpoints on every die, with multiple being a large number, not single digits. The idea is to both transport data between blocks and to have a separate control fabric as well. How this is exactly laid out and controlled, much less exact capabilities, hasn't been revealed yet.

Double digit endpoints on every die. No wonder Napels looks good in AMD's own benches. This could mean that MCM solutions like a ZEN CCX and a Vega GPU could be interconnected on the block level. I.e. the Vega GCN cores could be addressed directly by the CPU etc. Getting more interesting :)

It was strongly hinted at that a block can target another block directly for a transfer. My educated guess is for HSA type workloads and pointer passing, a CPU core can pass data directly to a shader on an APU that needs it for the nest instruction. I may be very wrong on this, but I suspect this is the long term goal of the system.

I am trying to find out more but getting anything more than bullet points is tough at the moment.

Interesting times indeed. Maybe the real interesting functions aren't thre yet but is planned. We'll see how far down the rabbit hole AMD has gone I guess. Charlie is one of my favorite tech journalists. Anyone who dresses up in a bunny suit to go to IDF and ask the Intel manages questions can't be bad now can he? lol

1 Like

This sounds VERY interesting. AMD taking a completely unorthodox approach to things is not unusual for them being the underdog.

1 Like


1700 & GTX 1080

Seems like my Shit Senses have once again pointed me to right places.

2 Likes

So at 1080p it matters (and I would guess lower res than that as well)
But anything more than 1080P and it doesnt matter at all.
K. Cool.

2 Likes