What is the theoretically optimal SIMD width for CPUs?
I would guess this varies based on use case, but I’d be curious what people’s thoughts on this are.
Considerations
Use case (duh, not everything on CPU is SIMD heavy)
At some point, moving the data to a GPGPU or ASIC probably becomes more efficent
Partly based on a February Lounge post by @wendell, in response to me asking about the Talos II board he is testing; maybe the SIMD width of 128 bits on POWER9 is a limitation? But Zen’s SIMD is also 128 bits wide too…
and also information I’ve been thinking about from these two threads:
in the FMA thread, I asked this question specifically about Power but knowledge on that is a bit more scarce, so I’m curious what information/discussion a more general question will turn up.
So this uses ARM’s new Scalable Vector Extension (SVE), but the spec itself doesn’t specify a width, and SVE isn’t meant to replace NEON.
SVE is a complementary extension that does not replace NEON, and was developed specifically for vectorization of HPC scientific workloads.
Rather than specifying a specific vector length, SVE allows CPU designers to choose the most appropriate vector length for their application and market, from 128 bits up to 2048 bits per vector register.
So, Intel isn’t the only one going for 512-bit SIMD, but in both cases (AVX-512 and now SVE) it has started out as a supercomputer technology.
So it sounds like AMD is putting their money on 256-bit as the optimal CPU width. Maybe 512-bit really is only worthwhile for supercomputer use… and for those uses AMD will probably rely on Radeon Instinct silicon connected via Infinity Fabric, much like how IBM offloads SIMD to GPUs via NVLink/OpenCAPI.
So in terms of supercomputing tech, the focus looks like it’s:
internal SIMD
external SIMD
ARM
SVE (Ex: A64fx)
AMD
Infinity Fabric
IBM
NVLink/OpenCAPI
Intel
AVX-512
Which really makes it weird that Intel is choosing now to go into full-on external GPUs, but they make chips for much more than just supercomputers, so I probably shouldn’t read too much into it.
I guess it’s also worth thinking about how Gen-Z and CCIX might fit into the picture; Intel will probably make its own protocol if it decides to focus on external SIMD, but ARM (or maybe even RISC-V eventually) might choose to use Gen-Z, CCIX, or maybe even OpenCAPI to connect to off-chip SIMD engines.
To be clear I would treat AMD’s plan to shun AVX-512 as a rumor for now. I think I have read it somewhere as a statement from an AMD rep, but I cannot recall where.
It does make sense though, AVX-512 is not just a straightforward widening of AVX/AVX2 instructions, it adds a LOT of extra complexity. IIRC it has all sorts of masking options, loads of extra registers.
X86 instruction decoder frontends are already large, power hungry, slow and no doubt a nightmare to design.
I can fully understand why AMD decided to go NOPE when faced with this, and decided to spend all the transistors saved on the decoder, registers, datapaths and execution units on cramming in more cores instead.