Why Ryzen's performance is so different: Cache and Compute Complexes

This Video brings some perspective as to what is happening:

TLDW

The issue currently is really quite simple. Ryzen 7 is essentially like two CPU's with 4 cores on each block connected to their own 8Mb of L3 cache and these two 'Compute Complexes' are then in turn interconnected via the 'infinity fabric'.

The Die:

Essentially it is like a dual socket Xeon system in one chip. A NUMA (non uniform memory access) architecture. And windows has support for this, but because these cpu's are inside one chip, windows is treating it as an UMA (uniform memory access) SMP (Symmetric Multi Processing ) architecture instead of a NUMA one. This is where the problem begins. This means that windows does not know to treat the L3 cache as essentially two different caches leading to cache misses.

Once windows gets patched to use a NUMA scheduler for Ryzen it will get a lot better. Once games get updated to take advantage of NUMA architectures and 8 cores effectively it will be a whole different story. Currently it would be possible to get great performance by setting Process affinity on Ryzen to only 4 cores on one CCX Compute Complex within Ryzen, effectively cutting the cache in half, but avoiding cache misses.

Benchmarks

Thanks to roybotnik for testing this

Essentially the scheduler is assigning threads randomly, so depending on which threads/cores your application happens to get launched the performance may randomly vary dramatically.

TLDR Likely meaning that ALL benchmarks done up to now are next to useless.

6 Likes

What this finally means is that Windows should just prioritise threads to run as much as they can on the same CCX. And games also have to take threads that don't have to communicate together as much as they can to one CCX. This will lead to a substantial increase performance by avoiding communication across the much slower L3 and infinity fabric.

1 Like

The latest related videos from NerdTechGasm makes a lot of sense, and if he's right then the scheduling should be easy to improve as Microsoft should already have the needed type of core sheduler ready made for pre-existing system designs. Anyhow I expect they're already testing/validating suggested code from AMD so all we ought to do is wait a little.

I don't think it will make a world of a difference though overall, but it could be significant enough in some of the cases where Ryzen so far performs oddly.

I instantly subscribed to NerdTechGasm for these no nonsense videos.

1 Like

it will make a big enough difference.

the cpu is getting something like 80% cache misses and will be like a speed bump of 500mhz from the amount of cache the cpu has

3 Likes

Probably worth asking one of the mods to combine threads...

I think alot of this is also now in the long-running Ryzen no-BS thread.

4 Likes

Now I am wondering. This thread random assignment problem should get sorted out which is fine.

But the R5's are 6 core 12 thread. Cut down R7's. I wonder if it will lead to the prices being all over the place depending on which cores were cut from what CCX.

For games it might suck ending up with a 3+3 cut down chip as then you will have a CCX with only 3 core 6 threads and some games are making use of 4 cores now. It could be advantageous to get a 2+4 style R5 if such a thing is going to exist as then you can have 2c/4t for whatever OS level stuff and a separate 4c/8t for the games to run on.

I wonder how these are going to break down on release. If they cut the chips before going on the fabric interposer or after. If the cut them before they might end up all being 3+3. But if after they might salvage 2+4 as that is still a reasonable layout for the CPU.

I wonder if they do end up mixing them will they bin the chips after cutting the dies and have the low end R5's as the 3+3 models knowing it could cause problems with games that might later want a full 4 cores because sharing across a CCX will hurt performance. Or if they will be divided up random and a 6 core is a 6 core no matter what. That could lead to people buying them finding out the layouts themselves and reselling with better information to people who want them for specific tasks that would benefit from different core groupings.

It also might have an impact on XFR and OCs as ryzen only turbos one or two cores. Well a 2+4 might get better XFR boosts as one half of the chip will run cooler with only 2 cores in and thus push them harder. Where as 3+3 would balance the heat and result in a possibly lower XFR push, or they could just make these the non X variety and avoid that all together.

Slightly further again and back in the land of reselling. If the 2+4 with the potentially better XFR is arranged so the 2 cores CCX is diagonally opposed cores on the CCX thus would distribute the heat even better and push them a little harder.

So speculation of the R5 line up based in my theories.

Non X R5 1500 is a 3+3

1500 X R5 is a 2+4 with the 2 side being cores right next to each other thus having slightly worse thermal loading and lower XFR

1600 X R5 is a 2+4 with the 2 side being diagonally opposed cores for better thermal loading (less sinking heat directly into the core next to them as the spaces are effectively blank) so higher XFR

According to The Stilt, it is not possible to make a Six-core as 4+2, only 3+3 is possible due to how the CCX:es work. There needs to be symmetry or zero, so can't do 4+2 for Six-core, must be 3+3.
Four-core can be 4+0 or 2+2 (not 3+1).

I have not read what you have or even know what the stilt is. I wonder why this has to be so if windows implements the NUMA scheduling effectively treating it as two CPUs.

I am guessing it is something something infinity fabric cannot have dead ends with no core on the far side. So a 2+4 would result in 2 of the 4 on the 4 side not being able to communicate with anything.

But then if they can make a 4+0 potentially work there is all dead ends and no inter CCX infinity fabric to begin with. If it is essential like that than a 4+0 should not work at all?

The 6 core Ryzen is likely from the same process as a 4/8 core ryzen but one of its cores failed tests so is disabled? (plus they disable 1 more to make it 6core) I mean that is how you maximize yield/profits rather then making a separate 3+3 chip and having 1 of those cores fail and then having to be binned or cut down to a 2+2 part which makes no sense..

Yes, what info we have from AMD is that the lower core count Ryzen is the same die as the eight-core. Summit Ridge is the code name for that die. AMD will harvest the Ryzen 3 and 5 from defective Summit Ridge dies.

The Stilt is a somewhat well known figure for people interested in CPU design and AMD design in particular. He used to only post on some Finnish forums, but in later years has started to post on more English ones. He recently did a write up: Ryzen: Strictly technical.

His information is good, easy to talk to etc. Matthias Wauldhauer (Dresdenboy) has corroborated the info, as has AMD reps. The thread has plenty of good discussion.

I don't have the in depth info in how the Core Complexes work, compared to these guys I'm just an idiot lol. I've been planning to ask about it though. But the number of cores active on each CCX can be easily confirmed by people who have a Ryzen board. You can't configure 4 + 2 cores, only 3 + 3 if you want a six core. 4 + 0 should work as you can do that with a Ryzen today. You lose half the L3 cache however. AMD has stated that the Ryzen 3 will be 8MB L3 I believe.

Edit: Summit Ridge, not Zeppelin.

Is Linux any different from Windows in scheduling and managing cores properly?

It's bit different yes. And it is very complicated today. Patching the OS scheduler on either Windows or Linux is not "an easy fix" by a long shot. And the peculiarities in Ryzen can't really be "fixed" on an OS level. The scheduler in your OS has no clue what kind of logic you're running. Will your workload benefit from spreading over both CCX due to bigger L3, more cores etc? Or will it be hurt by it due to massive[1] inter thread communication? Not an easy problem the OS can solve for you, needs to be done on the application level. The OS can provide the apps with the right tools to do this.

About the Linux scheduler, there was an interesting thing posted last year I remember:

Edit: Microsoft has documentation on MSDN:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms685096(v=vs.85).aspx

Edit: [1] More like too much inter thread communication leading to too much inter CCX communication. And how much is too much? How would the OS have any clue about it?

2 Likes

In depth as fuck. Much obliged.

I was hoping that maybe there was a chance of Linux actually gaining an edge over Microsoft for new hardware, but it's probably too much to hope for.

Does anyone else have any HD die images of the i7 7700k? I want to compare dies.

Thanks!