In “Wendell Reveals HYGON AVX2 & Performance Mysteries! (Poking At Chinese Servers Pt. 2)”, at about 5min.
AVX512 means there’s 512 bits in a register (called ZMM*), that is 64 bytes, 16 floats, or 8 doubles.
There are 32 ZMM registers (ZMM0-31) in amd64 (aka x86-64) mode.
Looking at AVX2 it appears to be about shuffling data around, while AVX mostly just added 256bit registers to SSE instructions (YMM registers, extending SSE-s 128bit XMM).
Shuffling data around can be very useful in some cases, in others it doesn’t matter.
Note:
There’s 16 YMM registers in AVX on amd64, same for XMM in SSE on amd64.
More then that, they are the same registers (XMM0 is the lower half of YMM0 that is the lower half of ZMM0, for example)
Why is SIMD great ?
Because moving data around takes energy and time.
One XMM register can hold 4 floats, and one instruction can, for example, multiply two sets of 4 floats together.
A cache line on a modern cpu is 64 bytes. And even though that probably goes all the way up to the L1d cache, the data still needs to go to the registers (and then to the ALU/FPU and back) (this is internal cpu stuff; i can’t know how exactly an AMD or intel cpu does stuff).
In testing i did a long time ago, i remember SSE giving 2x the performance of non-SIMD code (not x87, but SSE; since SSE is the “default” way of processing floats on amd64).
If you got questions, feel free to ask.
I’d also be willing to teach you some assembly, and/or set up some code to faf about (benchmark) with SIMD and such.
There was actually supposed to be text on the screen to this effect; editing fail I guess. I actually had about 3 years of x86 assembly. Somewhere when we were working on making stuff on linux faster, I found that memcpy() was oddly slower on amd than intel. turns out in microcode the copies through 32 bit i386 registers with 32 bit programs is optimized by microcode to use simd instructions even when you think you’re not… but not in AMD’s microcode. At least when I was poking at it. Thanks though.