Hey there! Is anyone running a 7950x3D or 7900x3D on Linux and with enough free time to run a quick benchmark to measure the latencies between CCDs?
I’ve seen lots of those for Milan(-X), Genoa(-X), regular Ryzen 7000, Ryzen 5000(x3D), but haven’t seen any for the 7000 models with the extra cache. I’m wondering if the cross-CCD latency gets worse due to the extra cache, or if it’s similar to the non-x3D model.
Chips and Cheese mention that the zen4 vcache implementation takes 4 extra cycles to get data from L3 + extra latency of lower clocks over the non-vache processor models.
Cross CCD latency is kinda high regardless of vcache on zen4 consumer platforms, its about the same as going to RAM.
4 extra cycles do not sound like it’s much at all, but I’m curious to see it in practice if it’s basically the same as a non-X3D Ryzen CPU.
I remember reading about that 5950x3D prototype and one of the reasons AMD mentioned it didn’t end up as an actual product was the cross-CCD latency.
On Milan-X those latencies are not much higher than regular Milan, same goes for Genoa vs Genoa-X, but I’d like to confirm that’s still the case between a CCD with extra V-cache and one without.
Can someone explain why this number matters? What is the relevant workload?
All the tests I’ve seen of this are measuring it using a compare and exchange instruction. (And it has been confirmed that the ones posted publicly are using the “fast” one and not the ones that are known to be slower on Zen 5).
But a compare and exchange does a lot more work than just going to ram would. And it can also have that latency hidden by speculative hardware in a multicore load. Plus, you probably shouldn’t be writing code that needs to use that instruction in a cross ccd context.
So, it is a regression, but I’m not sure it’s one that 99% of people are going to see. You need atomic memory instructions synchronization instructions between threads that can’t be on the same ccd and that only happen under light enough load that the speculative hardware isn’t able to hide the latency.
A place where this cross CCD latency really shows up is FEA; over in the main thread discussing it, AMD underperforms compared to the amount of compute it should theoretically have; to the point were dual 9374F’s are only 13% faster than a single w5-3435x.
In a perfect world code paths would respect the CCD boundaries (NPS can help in this regard by giving the OS hints on how to schedule), but this isn’t possible with many compute problems.
Speculative hardware can only speculate on execution, it still needs to wait for the data to arrive. There are a bunch of memory prefetch operations happening in modern CPUs, but they can only go so far, especially when an architecture is memory starved from the start (there is currently a discussion in other places about AVX-512 on zen 5 being severely memory starved).
Er, no. That’s basically saying write only single CCD, rather than all core, code. I do think you have a point in that, in my experience, xadd is used an order of magnitude more than cmpxchg. It doesn’t appear xadd’s been measured yet.
Also, I noticed you’ve mentioned high cross-CCD cmpxchg latency being 9950X specific. Could you please link the 9900X measurements confirming this? I’ve searched some and, so far, have been able to find only interpretations of circumstantial evidence in this direction.
Locked cmpxchg and xadd are for interthread communication when fences aren’t needed. xadd’s the usual basis for a parallel for loop, for example.
Haven’t seen the main thread for that. Sounds like the BLAS needs updating?
Funny you should mention that, the FEA benchmark can switch between using the the relatively new AOCL 4.1.1 that AMD made for Zen4 and the 2022 version of the Intel MKL, and the Intel BLAS outperforms the AMD BLAS more often than not even on EPYC/TR.
This is the thread:
To be fair the expected performance difference isn’t entirely because of cross CCD communication bottlenecks, amdahl’s law is somewhat to blame too.
Multiple people tweeted at Anandtech about this and it came up in multiple reddit threads. They said it seems like that’s right at first glance but they are going to do a deeper dive to try to root cause it and figure out why.
I wouldn’t say it’s “confirmed”, but it’s still unclear why there’s a regression at all for that matter.
So I’m waiting for a more thorough examination that looks at a wider range of instructions.
Edit: to be clear about the “shouldn’t be coding like this” thing. My intention is that instead of doing one cache blocking pass and treating all cores as “the same”, there should probably be an extra layer of blocking and loop splitting so that you are effectively treating each CCD like you would separate sockets in a NUMA configuration. In principle, AMD’s BLAS and other libraries should be doing this, but it sure doesn’t seem like they do.
Thanks! No need, though, I think, as those are the same things I’ve been seeing.
There’s an AGESA option to make CCDs distinct NUMAs which I’d be hesitant to second guess from app or library code. If increased cross-CCD latency’s a general Zen 5 issue, rather than a cmpxchg specific one, AMD should probably be defaulting CCD is NUMA on rather than off so that code which ought to do NUMAy things can detect that it’s supposed to. I don’t have a 9900X or 9950X yet to take a look, though.
My personal expectation is AMD launches before they’re ready, so I’m not too surprised-concerned yet. Hopefully this is more of an oops, left STAPM turned on, than a substantial microcode change or silicon stepping. I also write pretty async parallel code, so for the workloads I’m most concerned with even a 100 ns processor wide stall on every thread synchronization won’t show in the benchmarks.
I don’t want to excuse AMD here but, also speaking from a third decade of performance oriented programming, if you’ve got code that falls apart because of NUMA misreporting probably it’s too chatty between threads and needs a rethink.
Yes. @twin_savage, I’ve read bits of the CFD thread in the past but will have to catch up. As to your most recent post, I regularly push 5950Xes to DDR limits. Have had two with quad DIMMS overclocked to DDR4-3200 pinned on that most of the weekend, actually. If the tasks are small enough to fit it’s not that uncommon something like a 5900X with 2x32 GB at DDR4-3600 will be slightly faster.
Cross-CCD latency doesn’t help but my experiences with Ryzen and the pattern of Zen 3, 4, and 5 benchmarks across 8, 12, and 16 cores definitely leaves me with the feeling AMD’s pushing the useful limits of p-cores per DDR channel with dual CCD.