FMA4 on Zen: Forgotten Instruction set, but not yet gone | Level One Techs

ALSO: forgot to mention this guys blog. it's awesome. https://www.agner.org/optimize/blog/read.php?i=838


This is a companion discussion topic for the original entry at https://level1techs.com/video/fma4-zen-forgotten-instruction-set-not-yet-gone
10 Likes

Wendell you gave me a flashback to 1988. I’m sitting in OS design and my hands are sweating {bites lip} thinks, “how am I going to remember all of this” (dread) …

This video was sooooo awesome!!!
I am so happy you share these fringe topics and findings with all of us. Also, I did not know about OpenBLAS before, this might be useful information for me in the future.

I have the broadest smile on my face right now. This is the content I love.

Thank you Wendell and team!

Pretty near stuff!

A different guy did a talk at defcon last year on finding undocumented instructions in x86

It seems Intel and Via are pretty notorious for this sort of thing.

4 Likes

I wonder if it would be possible to look at the instructions a programm sends out, then figure out a faster way to do it and next time the programm needs those instructions, intercept them and que up the faster ones.
Would be a pretty neat thing to speed up old games that are still stuck in duo core days.

FMA4 is neat but technically redundant.
Its just more complex to implement in silicon.

FMA3 can run the same operations, albeit requiring more specific instructions to describe where the result goes.

Where on FMA4 you can do:

A = round(B * C +D) 

On FMA3 you need to do one of the following:

A = round(A * B + C)
A = round(B * A + C)
A = round(C * B + A)

So on FMA3 the destination register(result) is shared with one of the operand registers, those overwriting one of the input values.
While on FMA4 the destination register can be separate from the operand registers and won’t be overridden.

Which is why FMA3 has 3 times more instructions than FMA4 because the instruction determines the operand ordering and into which operand register the result goes.

The primary advantage of FMA4 is easier programming since you have less more generic instructions and don’t overwrite on of the registers with your result. Additionally in some niche scenarios where youre running tight loops you can also save some clock cycles with FMA4.

I hope that made it approximately clear enough.
As to why FMA4 was ‘removed’, I’m not sure why. In testing it works fine, but there likely are edge cases with unexpected precision loss and se edge cases when hammering specific instructions in a sequence.

2 Likes

This video unfortunately go some things wrong about OpenBLAS. It does not use auto-tuning at all, it uses hand-crafted assembly for the heavy lifting. So every CPU has a separate set of source files, and the tuning is done (more or less) by hand.
There is a library called ATLAS, that tries to get around that, and uses auto-tuned parameters, but AFAIK it is not very effective, and significantly slower than the assembly in OpenBLAS or MKL.

2 Likes

Did anyone try sandsifter from that video to find undocumented instructions on their own system?

make quits on me with an error on Fedora 28.

There are two inaccuracies /wrt compilers:

  • Compiler writers do not assume that newer is better; and
  • Compilers won’t “just” use newer instruction sets if it is unprofitable to do so.

While it is true, that compilers, unlike OpenBLAS, cannot figure out stuff at runtime (for obvious reasons), they also are not as naive as portrayed in the video, either.

What really happens is that instead for each CPU family the compiler will have a table of instructions and their latencies, throughputs, dependencies, register clobbers etc (usually contributed by the manufacturers themselves). Then, given an operation and a set of possible translations of such an operation, it will attempt to pick the best one based on the table(s)’ contents and the constraints in that particular area of the program for the CPU family requested by the user. This involves no assumptions whatsoever.

On the other hand, it is true that compilers will tend to rely on some rules of thumb as well. However, those tend to not be influenced by the specified CPU family at all. A good example of such assumption is that the more vectorisation the compiler is able to do, the better. The compiler will gladly vectorise the code, even on targets which do not support vector instructions, assuming that a later pass responsible for actual generation of machine instructions will have more options to choose from (it is easy to generate non-vector code for vector operations, but not vice versa).

2 Likes

This is not 100% true. Memcpy for example, depending on how you statically link it, or the system library, will actually do a lot of work figuring out what instruction sets are available to do the memory copy. At run time.

So the compiler may not figure out stuff at run time but certain concessions have been made in the compiler to help smooth some of these edge cases where basic ops like memory copy can make huge differences in performance. The concession being that the compiler sticks in stuff in the binary where there could actually be more than one code path (in the name of “optimization”) depending on the system that you’re running on. It’s a bit more nuanced than march, for example.

To make things even weirder, microcode can sometimes rewrite things for performance reasons. memory moves through eax/edx is, I believe, optimized by microcode on intel cpus to be something else. That’s not the compiler, though.

The line is a bit blurred between choices the compiler designer has made and the person putting together the app (or library), though, depending on a ton of factors. There is some runtime shenanigans for such things. So it’s not as if everything is totally runtime check free.

Thinking about it on the linux kernel side, a lot of stuff is done via macros too… thats compile time, but not the compiler’s design per se… I think the line can only blur more farther in the future though.

1 Like

I was going to mention the error if no one else had by the time I got back from doing non computer stuff.

When I looked at OpenBLAS, Zen was mapped onto a different CPU’s codepath because the ‘Zen optimized’ code was slower.

I did compare ATLAS against MKL and found ATLAS to be quite a bit faster.
https://openbenchmarking.org/result/1809087-RA-2990WXBLA90

2 Likes

Intel MKL does runtime CPU detection, and likely uses a different code path when running on non-Intel CPUs, that may result in impaired performance. To the point of sometimes being accused of having a “CrippleAMD()” function.

1 Like

Really love seeing you at your desk. I think it makes a great set. I started watching the original Tek Syndicate in 2013 and seeing your "lair"brings back a lot of nostalgia.

2 Likes

I have one more thing to add here, regarding BLAS performance.
I have ran the official Intel MKL Linpack benchmark on a Skylake-X 8 core, and I have made a patched binary that has all CPU vendor detection subverted, and ran that on a 1700X. (both CPUs stock, 2400 ECC RAM on Ryzen and either 2400 or 2666 on SKL-X, cannot recall the timings tho)
While I do not have the exact numbers, the 1700X pushed ~200 GFLOPS and the 8-core Skylake-X pushed somewhere in the neighborhood of 500 GFLOPS (my memory is fuzzy here, but it was somewhere between 450 and 550)

So when it comes to linear algebra, the current Intel CPUs do have a rather massive hardware advantage, due to having full 256-bit (or even 512-bit on SKL-X) wide floating point units for AVX, while Ryzen only has 128-bit wide vector units, and all 256 wide vector instructions are decoded to two consecutive 128 wide operations internally.

AMD has made a very significant design choice here, they are saving a huge amount of transistors, die area and power by not having 256 wide execution units, internal buses, everything. There is a reason Intel CPUs come with severe negative AVX multiplier offsets, the 256 and 512 wide AVX units consume massive amounts of power when active.
So AMD has decided to sacrifice raw FLOPS, to be able to cut down on power use, and allow them to pack lots of cores onto small, cheap dies. This is probably a very good idea on the general purpose server market, but it does have a negative impact on HPC (-ish) folks like me for example. Right now, for the “ghetto HPC” Beowulf-cluster type of usage, the choice is between good AVX performance and ECC support, both of which is rather important.

As a fan of Power Architecture stuff, I took a look at this chart in the Power/PowerPC thread and the AltiVec/VMX/VSX page on Wikipedia and it looks like while IBM was slow to add SIMD instructions to their POWER chips, they never have had registers wider than 128-bit. The XBox 360 chip had an extended version called VMX128, and while that refers to its 128 registers I haven’t seen anything to suggest the registers were wider than 128-bit.

ARM’s NEON SIMD system also seems to be using 128-bit wide vector registers.

How much of an advantage is the 256-bit or 512-bit wide vector register?
I can understand if ARM has a different use-case, but with one of IBM’s target markets specifically being supercomputers, I don’t understand why Power wouldn’t have expanded the VSX width, especially if it were helpful for Linpack (TOP500).
Is a wide vector register somehow more beneficial on CISC architectures?

FMA4 error

Following the links from Wikipedia citations, the reported error on FMA4 was with “Asteroids-FMA4-APP” which a reddit user says is Asteroids@home. The original post is in German on the Planet3DNow! forums:

Onkel_Dithmeyer
Die Asteroids-FMA4-APP erzeugt sofort Rechenfehler. FMA4 scheint auf Ryzen also nicht sinnvoll nutzbar.

Google Translate spits out:

The Asteroids FMA4-APP generates calculation errors immediately. FMA4 does not seem useful on Ryzen.

256 bit wide vector operations are super useful, and can get you a ton of extra speed. There are diminishing returns in widening the ISA though, especially if the CPU has to stay within a TDP budget. AVX-512 is faster than regular AVX/AVX2, but the CPU has to drop the clock speed to stay within TDP, so the speedup is often modest, and sometimes using AVX-512 can even result in a performance loss, due to the clock speed drop.
As for why IBM have not done it, one can only guess.

1 Like

Reminds me of the unsafe math operator in GCC. I used to enable it to slightly improve performance. Practically addresses the same thing in this case.

Long time later follow-up has anyone done testing on Ryzen 2000 and perhaps (but unlikely Ryzen 3000)?

I’m really curious to see if it’s still stuck around. @wendell maybe you can give it a go?

3 Likes