Do games even use the instruction that those core to core latency charts were testing (the “fast” compare and swap that is normally used for lock free data structures and other exotic things)?
It’s kind if misleading to say that it is as slow as going to memory because the point of the instruction is that it is atomic and either modifies the memory location completely or it doesn’t do anything. It seems that the regression was unintended, and that it only impacts the 9950x. But in an all core workload, you won’t notice it because the speculative hardware will hide that latency. I doubt most games are bottle necking in a way that makes that latency under a light load matter.
Similarly, I don’t know why people put so much stock in Cinebench. What is it actually benchmarking? What is the correlation between scores in that benchmark and an actual workload people care about?
It seems to me that two things happened with Zen 5. First, games are memory bandwidth intensive. And the best you could ever do would be to read every memory location exactly once and write it exactly once. A 7800x3d is probably pretty close to that already. So Zen 5 has less potential headroom.
It doesn’t get nothing because it has more cache bandwidth, better branch predictors, and a deeper instruction window, all of which let it send out those memory requests sooner. And it has 3 full multiplier units to do address calculations fast enough to take advantage of the larger instruction window.
But even with more sophisticated prefetchers and the rest, ultimately, you are limited by memory bandwidth more than anything the CPU is doing in most cases. Maybe with the x3d chips, capturing more of the working set will be enough to get a bigger than proportional boost vs the 7800x3d. But even then, there are limits.
And there are a few unexpected limitations too. The latency on single cycle vector instructions is now 2 cycles because of an unexpected pipeline hazard. This shouldn’t matter if you are saturating the vector unit and don’t just have long chains of dependent instructions. But maybe games are sensitive to this. I don’t know how they use vector code.
Similarly, it seems impossible to maintain more than 5 integer instructions per cycle in practice despite there being 6 units, perhaps due to porting / forwarding limitations.
And the uOp cache is slightly smaller and slightly slower to fill. So that too could be hampering the uplift.
All of this could be stuff they found out late and that pulled the performance down from what they expected when they finished the high level design ~18 months ago.
But, more fundamentally, if you aren’t using AVX-512, you are leaving 50% of the performance on the table, both computationally and in terms of utilizing cache bandwidth. Zen 5’s vector performance is “use it or lose it”. And no game that I’m aware of uses it because there’s never been a good hardware implementation simultaneously from both Intel and AMD. Hell, most games don’t use avx-256.
If you are playing Minecraft and using a JVM with support for AVX-512, you probably see an uplift. But I doubt that was a serious bottleneck for that game.
Corner cases like that aside, to really get the benefits of the chip, it needs developer support in the various libraries and tools that the applications consumers are using rely on.
So, in a sense, Zen 5’s performance is aspirational. If / when software is written / recompiled to support it, there is substantial room for uplift. You can see that now on Linux with the difference between using Clear and Fedora.
But AMD is not well regarded when it comes to software support, and even if they change that, by the time most consumer applications have support to take advantage of it, Zen 6 will be out and well be having a very different discussion. (I.e. If they get AVX-512 added to Unreal today, you’ll see the benefit in 2-3 years when games using the newer version of Unreal start hitting the market.)
If people are using something that benefits from Zen 5 like statistics work, the kinds of stuff Spec WS tests, or even just lots of code compilation, then it performs more than Zen 4 by enough to justify the cost. And given the bandwidth and power needs of the vector units, the HEDT threadripper might even be a more compelling value than it was last gen.
But if you are just running normal windows apps and gaming, I don’t see a benefit and am legitimately confused as to why AMD wanted to market huge gains. Yes, they made up their performance gap vs Intel on most older titles. And they are often matching 14th Gen despite having less in the way of speculative resources and clock speed.
But there’s nothing in the hardware that fundamentally changes what you need for gaming : a very large cache.
And they should have realized this well before the various issues that new architectures always have started being discovered.