Return to Level1Techs.com

[SOLVED] AMD Threadripper 3970X under heavy AVX2 load: Defective by design? (Nope! But there was a bug)

Hi,

I’m the perplexed owner of an AMD Threadripper 3970X CPU on a GIGABYTE TRX40 Aorus Xtreme motherboard.

I’m writing this thread as a kind of followup/answer to @DerAlbi’s post here (“3970X - Prime95 stability?”) and to my own thread over at Reddit (“Prime95 errors with AMD Threadripper 3970X build”). Sorry, I’m new here and I can’t use links in my post yet…

TL;DR:

It appears that several (many?) 3970X parts —and possibly 3960X and 3990X parts too— are unable to perform correct AVX2 computations under heavy load (presumably heavy current, certainly not heavy temperature) when the data set fit in L2 cache. We need to start a conversation with actual engineers at AMD and/or GIGABYTE, and we need to get precise answers: Is there a systemic issue with the Threadripper 3000 CPU series? Or a high rate of defective products?**


Update #1

Let’s ask the question: Is anyone able to run Prime95 with min and max FFT sizes set to 16K, in-place FFTs, 64 threads, with AVX2 on an AMD Threadripper 3970X? If so, what motherboard are you using? What settings (frequency and voltage)?


The Story

I ordered the 3970X a few days after it was announced back in November and I built my rig in January. Full current specs of the system below [1].

I immediately had a host of issues, including errors in Prime95 and even blue screens under Windows. I reported my experience over at Reddit [2]. Long story short: I determined by trial and error that one core seemed to be defective (!?). The system became stable again when I disabled core #32 in Windows, and I was then also able to pass Prime95. I ended up RMA’ing the CPU and, in order to avoid too long of a downtime, bought a second CPU while waiting for the first one to be reimbursed…

At the time I wrote in the Reddit thread (in a post titled “Update #5”) that with the new 3970X part, my system was fully stable and that I was able to pass Prime95…

An Unfortunate Development

I recently found out that’s not entirely true. The system is fully stable, but I still can’t pass Prime95. Here’s a quick summary of all the tests I’ve done so far:

  • MemTest86 v8.3, 4 passes (~8 hours), RAM at 3466 MT/s [PASS]
  • MemTestPro 7.0 (paid version of HCI MemTest), 700% (~5 hours), RAM at 3466 MT/s [PASS]
  • AIDA64 6.20.5300 System Stability Test (full system test except local disks) [PASS]
  • IntelBurnTest v2.54 (based on Intel Linpack), maximum stress level, all available RAM, 20 runs [PASS]
  • OCCT 5.5.3 test, large data set, 64 threads, AVX2 [PASS]
  • Google stressapptest 1.0.9, all available RAM [PASS]
  • Prime95 v29.8 build 6 torture test, 64 threads, min/max FFT = 4K, in-place FFTs, AVX2 (~1 hour) [PASS]
  • Prime95, min/max FFT = 8K, in-place FFTs, AVX2 [PASS]
  • Prime95, min/max FFT = 16K, in-place FFTs, without AVX2 [PASS]
  • Prime95, min/max FFT = 16K (in-place FFTs or not), with AVX2 [INSTANT FAIL]

I’m also using this system fairly heavily on a daily basis, and at times even punishingly so (I’m a rendering software engineer doing large C++ builds and heavy CPU, GPU or CPU+GPU renders) and as far as I can tell, it’s completely stable: I never had a blue screen or any kind of hard lockup, no weird errors in Event Viewer, no application crashes.

Yet, the system fails on Prime95 at any FFT size >= 16K if AVX2 is enabled.

What’s interesting is that Prime95 passes when min and max FFTs sizes are set to 4K or 8K, but it fails instantly (in less than a second) when they’re set to 16K.

In Prime95, a 8K FFT is using 8 * 1024 * 8 bytes (presumably sizeof(double)), i.e. 64 KB. A 16K FFT is using 128 KB. So as soon as >= 128 KB of data are accessed in place (i.e. a data set that fits comfortably in the 512 KB L2 cache) under heavy AVX2 load, the CPU produces incorrect results. What’s interesting is that it passes with 8K FFTs (which also don’t fit in the 32 KB L1 data cache).

Provisional Conclusion

At this point my conclusion is the following:

At soon as 128 KB or more of L2 cache are accessed under heavy, all-core AVX2 load, one or several of the compute units of the 3970X (AVX2? AVX? probably not ALU) produce incorrect numerical results.

It’s not a thermal issue since it fails within a second. It seems likely to be a power delivery issue. Something Vdroop something something? This is where my hardware knowledge currently stops, sadly.

Power and Temperature Considerations

A note power consumption: My system draws around 195-240 W at idle (as measured by my PSU) with the CPU alone drawing about 70-80 W [3]. I’ve configured Windows’ power plan for maximum performance but I’ve set the minimum processor state to 70 % in an attempt to reduce power draw at idle. It’s possible that there are better settings for maximum performance under full load with best power economy at idle: I’m all ears!

Finally, a note on CPU temperatures: At idle the CPU hovers around 39-50 °C and tops around 72-78 °C under full load. I’m using the best air cooling setup I could think of and get my hands on, but it’s still air cooling, and my system is installed in a closed case (but with extreme attention to airflow).


Appendix

[1] Full system specs:

  • AMD Ryzen Threadripper 3970X
  • GIGABYTE TRX40 Aorus Xtreme (BIOS version F4a, latest as of Feb 2020)
  • Noctua NH-U14S TR4-SP3 + additional NH-U14S fan
  • HYPERX Fury RGB 4x16 GB DDR4 3466 MT/s (InfinityFabric at 1733 MHz)
  • NVIDIA GeForce 2080 Ti Founder Edition
  • Corsair HX1000i PSU (1000 W)
  • Dark Base Pro 900 Black
  • Windows 10 Professional version 1909, build 18363.657 (fully patched)
  • Everything running at stock, no overclocking (apart from the XMP profile)

[2] Reddit: “Prime95 errors with AMD Threadripper 3970X build”

[3] HWiNFO64 at idle:

2 Likes

Which Prime 95 version are you using ?

Edit Found it :v29.8

Assuming you are stock clocks?

Yeah, Prime95 v29.8 build 6 (64-bit), stock clock.

1 Like

I know it’s not quite the same setup as you, but I’ve been finding that my 3950X is completely stable with Precision Boost Overdrive and 3600CL16 XMP, with the sole exception of Prime95 Small FFT. If I run P95 with any more than 24 threads, the machine instantly black-screen crashes. If I run with a larger FFT size, it’ll do the full 32 threads all day.

I’m on an Asus Crosshair VIII Hero board, but I too suspect some kind of power delivery problem, but I haven’t been able to narrow it down with any of the VRM settings.

I’ve had stability issues with a 3960X under similar conditions. Haven’t tried the AVX2 tests you did, but symptoms include USB devices stuttering or disconnecting, lower than expected performance, audio dropouts and on one particular benchmark an instant hard lock – the reset button didn’t work.

I also found a solution. It’s Vdroop, yep. Here’s some ideas that might help you:

  • Disable all spread spectrum options in the bios. Disable VRM powersaving. This was enough for me.

  • Reduce the memory clock to default settings; don’t use DOCP/XMP.

  • Use the schedutil CPU scheduler, not performance or ondemand. It ramps up frequencies more slowly.

  • In extremis, slowly increase the load line calibration setting.

Already tried playing around with XFR limit settings?
I doubt that this is a vrm issue.

I haven’t. What do you suggest I play with exactly?

nvm:

i’m just reading trough a reddit topic, on which people seem to have similar issues,
then you are encountering.
It might actually be an issue with the cpu itself or just prime95.

Please post your findings in this thread at Mersenne Forums, where the author of Prime95, George, hangs out. Prime95 has had similar bugs in the past, and there’s a good chance you’ve found one.

I doubt this configuration has been tested, and it’s unlikely George has such a high core count system. Those cores are severely memory bandwidth starved for Prime95 for its intended purpose (8 to 12 cores would be a better fit for 4 channels of fast memory for Prime95).

6 Likes

Maybe @wendell could do a quick run to test if he might be able to re-produce,
this as well.
I mean if more people encountering this problem, then it might indeed be an issue,
with prime95 itself.

2 Likes

@Lt.Broccoli

Great idea, will do now!

Just to be clear: here are the settings I chose and the sample output. Is this what you’re trying to run?

Seemingly no issues on my 3970x + ASRock TRX40 Creator. This board does have “SRC Spread Spectrum” disabled by default claiming enabling it reduces stability.

image

These are the correct settings indeed.

How long have you been running the test?

I think it’s also disabled by default on my motherboard but I’ll need to check next time I reboot.

I let it run for > 20 minutes.

1 Like

Be aware that some spread spectrum settings can be hidden behind other options. For example, on the Asus motherboard I’m using, I couldn’t disable VRM spread spectrum without first disabling VRM power-saving – but disabling spread-spectrum is what stopped the instability.

Thanks for posting. Please have a look at my thread, i posted an update: the VRM is missing a switching cycle every 4ms actively discharging Vcore so a voltage dip occurs. Oscilloscope screenshot here:
https://forum.level1techs.com/uploads/default/original/4X/e/9/8/e98317514c2e8f565a8c6c76ce078779581b98bb.png
My comments on it in this post:

3 Likes

Well I think the fix can be done via BIOS update. I am running a 3960x with a Aorus Master. On the F4h bios I could run the test for about 10 minutes before failure. When I upgraded to the F5a BIOS, the test fails immediately with spread spectrum enabled or disabled.

I am running Prime on Linux (Siduction (debian sid)) v29.8. The F4h bios was missing the spread spectrum settings.

Well i can remember that Gigabyte actually had some,
bug with i believe the F4 bios on the master.
It caused the vrm to run a bit abnormaly warm,
even when it actually has a active cooling fan.
The weird thing is that the Extreme did not seem to have the said issue.
Even though the vrm’s are physically the same.
I believe Gigabyte fixed the problem with the F5 bios on the master.
But did not really state what the actual issue might have been.

But still i kinda doubt that vrm is cullprit when it comes to avx2 heavy load failure.
I kinda wonder if this issue might not be related to the cpu,
actually hitting its powerlimit.

Hi,

Sorry you are having issues…can you reach out to me @ [email protected] and we can get you connected with some folks over here to help.

Thanks-
Drew

10 Likes

Hey everyone, we (mods) are in the process of confirming @dprairie_AMD’s identity. We advise holding off on reaching out to him in the meantime.

Assuming you’re legit, welcome @dprairie_AMD. We’re happy to have an AMD rep among us.

2 Likes