Threadripper 2 Update! ECC Support, Linux, and More! | Level One Techs

CrayZeApe · September 8, 2018, 7:55pm

I’ve had a chance to compare the results of couple different BLAS libraries when tested in the Phoronix HPC Challenge test. I had planned to test more but found a couple rabbit holes along the way that needed exploring.

I compared the Intel MKL with a Zen optimized ATLAS, the results are interesting.
https://openbenchmarking.org/result/1809087-RA-2990WXBLA90

What about the Zen optimized ATLAS?

ATLAS stands for “Automatically Tuned Linear Algebra Software” and it lives up to it’s name. At compile time, it finds and exercises all the available instruction set extensions, performing hours worth of precision tests to inch out every little bit of performance.

Interestingly, the ATLAS tuning routines found the presence of the FMA4, which passed sanity check and was included as one of the codepath choices for later tests. In some of performance tests, the FMA4 codepath won by a huge percentage and was selected.

If you want to build ATLAS for yourself, I’d suggest setting the Threadripper to single die / sixteen threads to remove timing variance introduced by NUMA when jumping from die to die. You’ll also need to disable everything related to AMD’s “catch me if you can” clock scaling mechanism, you need a fixed CPU clockspeed for the ATLAS tests to make sense.

I came across FMA4 again, this time in some OpenBLAS github chatter. The ZEN target in OpenBLAS has never worked, it currently cheats a bit by aliasing ZEN to HASWELL and setting NO_AVX2=1. Much of this particular rabbit hole is covered here:

One of the more interesting comments (down the bottom) on that github page is:

For future reference: it turns out that Ryzen does support FMA4, it just does not document this through CPUID flags. That’s why the Excavator kernels didn’t crash with SIGILL.

Wikipedia sheds some more light on ZEN and FMA4, though the refferenced AMD pages are “down for maintenance” which was not the case earlier today, perhaps they’re adding the newest Threadripper series, previously missing.

When the AMD pages were up I saw that FMA4 is the first item listed under “Features” for the 2400G and some other ZEN based chips. It was not however listed on the Threadripper 1950X ( latest listed model at the time) page.

So what’s the go? Political decision? Business decision? Either way, I don’t think I care for it, my fix, patch the kernel. I quickly came up with this:

--- a/arch/x86/kernel/cpu/amd.c	2018-08-28 18:07:45.148373843 +1000
+++ b/arch/x86/kernel/cpu/amd.c	2018-09-08 20:40:52.820838835 +1000
@@ -821,9 +821,13 @@
 	/*
 	 * Fix erratum 1076: CPB feature bit not being set in CPUID. It affects
 	 * all up to and including B1.
+	 *
+	 * FMA4, while present and working on Zen up to and including B1, is not
+	 * exposed by the CPUID instruction, let's fix that here.
 	 */
 	if (c->x86_model <= 1 && c->x86_stepping <= 1)
 		set_cpu_cap(c, X86_FEATURE_CPB);
+		set_cpu_cap(c, X86_FEATURE_FMA4);
 }
 
 static void init_amd(struct cpuinfo_x86 *c)

Patch is attached below (you don’t need it for ATLAS, which just ignores CPUID alltogether).

With that fixed:

cat /proc/cpuinfo

processor : 0
vendor_id : AuthenticAMD
cpu family : 23
model : 8
model name : AMD Ryzen Threadripper 2990WX 32-Core Processor
stepping : 2
microcode : 0x800820b
cpu MHz : 1878.563
cache size : 512 KB
physical id : 0
siblings : 64
core id : 0
cpu cores : 32
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt fma4 tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips : 5999.20
TLB size : 2560 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

Now, back to the BLAS comparison.

Something interesting is shown in the results, the optimized BLAS really grows wings when the CPU is let loose, while the other seems held back somewhat, not scaling as well. The optimized BLAS actually runs colder, Zen notices this and bumps up the CPU frequency to use up some of the headroom, and actually still manages to run cooler.

On Zen, the thermal output of any CPU instruction really is related to the top boost frequency the CPU will do at the time. Thermally optimized code is an interesting prospect. I wonder if gcc will ever get a -Ot switch

enable_fma4_on_zen.patch (587 Bytes)

wendell · September 8, 2018, 8:01pm

You da real MVP. Holy crap. Thank you. Welp you’re getting your own video. FMA4 is kind of a big deal.

CrayZeApe · September 8, 2018, 8:15pm

Thanks Wendell

I wonder if Windows is already accounting the “CPB feature bit” problem also fixed in linux. If it is, then the hack is to add the FMA4 bitpattern to the existing pattern in whatever binary the system uses to interface with the CPUID instruction. Windows is not my strongest area, but I’m sure many games might be choosing slower code paths than they otherwise could.

edit: I was just thinking. Windows running in a VM should pickup the CPUID support from Linux once enabled. If only there was someone all set up for gaming in a Windows VM, ohhh, wait, I know… Wendell

wendell · September 8, 2018, 8:35pm

I don’t think windows is enabling FMA4. This can be overridden with a device driver, but now requires driver signing (bleh). Not fun.

CrayZeApe · September 10, 2018, 12:03pm

The AMD pages are back up, though none of the new Threadripper models yet.

This is from AMD’s 2400G page.

https://products.amd.com/en-us/search/APU/AMD-Ryzen™-Processors/AMD-Ryzen™-5-Processor-with-Radeon™-Vega-Graphics/AMD-Ryzen™-5-2400G/243

And this from the Threadripper 1950X page.

https://products.amd.com/en-us/search/CPU/AMD-Ryzen™/AMD-Ryzen™-Threadripper/AMD-Ryzen™-Threadripper-1950X/177

It’s a strange duality. Perhaps AMD’s version of Schrödinger’s cat. FMA4 is both supported and not supported at the same time.

I had a quick look at GCC and that looks like it’ll need a few small patches to connect the FMA4 intrinsics to the znver1 target.

sceps · September 10, 2018, 5:11pm

And for Ryzen 7 2700X:

Certainly it does have Virtualization, etc.

I think the key word is “Key Features”… a subjective and non-exhaustive marketing term. Rather useless table if you ask me.

StrY · September 10, 2018, 5:29pm

So AMD unveiled 2 new Xen CPUs:

https://www.phoronix.com/scan.php?page=news_item&px=AMD-Ryzen-2300X-2500X

Think it was about time

CrayZeApe · September 10, 2018, 5:38pm

“Key Features” is exactly my point though, looking specifically at FMA4. While it’s present on Zen, AMD have officially denied support for it and not provided a CPUID bit for it. Because of this, no compiler will currently use the perfectly good FMA4 code path, and if FMA4 code is encountered with the compiler target set to znver1 or native, the compiler will stop with an error.

If AMD are now promoting FMA4 as a key feature on Zen, one might expect to at least be able to use it. At the moment you can’t make effective use of the “Key Feature”, even on the chips that list it, like the 2400G given as example.

sceps · September 10, 2018, 6:14pm

Meanwhile FMA4 is listed as a key feature on 9 CPUs and 162 APUs, including pre-Zen devices. Does it work on any of them?

CrayZeApe · September 10, 2018, 7:09pm

Yes, Bulldozer and Piledriver support FMA4, expose the CPUID bit, and have native support for it in their C compiler targets.

This is the output of a benchmark compiled for the Bulldozer target, but running on threadripper. Bulldozer has FMA4 and AVX so they are compared.

Single-Precision - 128-bit AVX - Multiply + Add
GFlops = 42.192

Single-Precision - 128-bit FMA4 - Fused Multiply Add
GFlops = 66.816

Double-Precision - 128-bit AVX - Multiply + Add
GFlops = 20.904

Double-Precision - 128-bit FMA4 - Fused Multiply Add
GFlops = 33.408

Single-Precision - 256-bit AVX - Multiply + Add
GFlops = 66.72

Single-Precision - 256-bit FMA4 - Fused Multiply Add
GFlops = 66.816

Double-Precision - 256-bit AVX - Multiply + Add
GFlops = 31.776

Double-Precision - 256-bit FMA4 - Fused Multiply Add
GFlops = 33.408

Edit: Link to benchmark source.

Marten · September 11, 2018, 1:42am

Is this a bow to intel market share so code compiled on AMD systems runs on Intel as well ?

CrayZeApe · September 11, 2018, 2:12am

It’s possible, though probably not the case. It’s not something they’ve done it in the past, so there’s probably no reason to do so now. It’s also unlikely they’d mention FMA4 as a feature on any Zen chips at all.

It’s possible AMD wanted to phase it out, but still have code compiled for their earlier architectures remain compatible. This doesn’t sound right though as some other instructions from older CPU’s are still missing (supposedly).

It’s also possible that there is an undisclosed bug when crossing CCX boundaries, but this is based solely on FMA4 being an advertised on single CCX chips.

Without AMD coming out and clarifying the situation, the above possibilities remain well and truly in the realm of speculation.

FurryJackman · September 11, 2018, 2:48am

For OEMs only. You cannot buy these parts, unlike the Ryzen 3 1300X or the Ryzen 5 1500X. The cheapest AM4 CPU from the Ryzen 2000 series is officially the 2600.

wendell · September 11, 2018, 3:23am

I am starting to think the week 25 segfault bug and the fma4 bug are related, possibly. I suspect 1st Gen produced after week25 are ok fma wise. I read the German post about error results but I didn’t get errors testing on a pre week 25 ryzen 1800x but I might not have understood how to do the test.

I’ve been super busy but that’s probably the last thing to do to close the loop… put together a little validation program and see what happens on 1st vs 2nd Gen TR. If Raven ridge supports fma4 then surely tr2 would.

CrayZeApe · September 11, 2018, 3:48am

The bug in early Ryzen occurred while executing FMA3 code so it could be related. It’s logical that FMA4 and FMA3 processing overlaps in the hardware design with many gates shared between the two.

I’ll see if I can either find or invent a validation test, though it’ll be a bit later. Some other things to tend to at the moment.

mutation666 · September 11, 2018, 11:58pm

@Wendell do you know how much performance loss exists for ECC over non ECC ram? I only really want to buy ram once for my box as later in life it will want to have ECC but not really required right now. Do you have any good references for ECC performance vs like 3200 regular unbuffered ram? Looking at the 2666 ECC as 2933 does not exist right now.

wendell · September 12, 2018, 12:02am

Literally doing that right now.

Dual rank can be a speed boost. As can multiple sticks per channel. Depends on what youbare using ram for.

So slower clocked ram but more of it can be about the same “gigabytes per second” the faster ecc is not that much of a per loss surpsingly.

Comparing 128gb of 2666 ecc w/custom timings to to 128gb of 2933 for the vid

wendell · September 12, 2018, 12:03am

No FMA4 support in flags for Raven Ridge / Ryzen 5 2400G

processor : 7
vendor_id : AuthenticAMD
cpu family : 23
model : 17
model name : AMD Ryzen 5 2400G with Radeon Vega Graphics
stepping : 0
microcode : 0x810100b

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca

ajc9988 · September 12, 2018, 7:49pm

As a potential to get around the driver signature enforcement to test this.

nikzy1 · September 23, 2018, 4:00pm

What is the name of the song playing in the background? I have gotten completely hooked on it. I have tried to scavange https://incompetech.com for it, without luck.