A place where this cross CCD latency really shows up is FEA; over in the main thread discussing it, AMD underperforms compared to the amount of compute it should theoretically have; to the point were dual 9374F’s are only 13% faster than a single w5-3435x.
In a perfect world code paths would respect the CCD boundaries (NPS can help in this regard by giving the OS hints on how to schedule), but this isn’t possible with many compute problems.
Speculative hardware can only speculate on execution, it still needs to wait for the data to arrive. There are a bunch of memory prefetch operations happening in modern CPUs, but they can only go so far, especially when an architecture is memory starved from the start (there is currently a discussion in other places about AVX-512 on zen 5 being severely memory starved).