AMD Threadripper 3970X under heavy AVX2 load: Defective design? (No, but there is an issue)

Kinda off topic but i’d like to see someone use the Ice giant thermosiphon on this workload vs a 420 or 480 or heck even a 560mm radiator. I wanna know if either the radiator size is not good enough or can water not transfer enough heat to dissipate it

Hi do you mind changing to take the “by” out of the title, so it looks like a mistake, rather than accident that the chip fails on one task?
Or if I change it?
I’m not going to argue that the chip is defective, just the intention is all

I’m not opposed to remove the “by” (another option would be to put it in parentheses) but note that it is followed by an exclamation mark, so this is really just me questioning whether AMD (or GIGABYTE, or both) decided to spec and implement razor-sharp safety margins on the CPU VRM in order to favor better performances in realistic workloads at the expense of possible stability issues on exceptional and extreme workloads (pretty much only Prime95 with aggressive settings).

1 Like

Yeah that is kinda what i was refering to above.
Maybe it’s just hitting its power limit with pbo enabled?

Maybe manually playing around with pbo´s power limits ppt, tdc and edc.
could have some effect there?

Did you also tried the p95 test with PBO disabled?
Just to see if it still fails in the same circumstances.

I tried with PBO enabled and disabled (didn’t fiddle with the PPT, TDC and EDC settings). Crashes immediately with either.

Funny enough I tried option 16 and 2 worked fine. Its seems like its an issue with this specific test.

Did a test on my 3970X with MSI Creator TRX40.
This board also uses XDPE132G5C, so I think it’s a good data point.

16K Min/Max FFT Prime 95, in place FFT=off, no errors (ran for about 20 minutes).

HWInfo details:
Core frequency: 3858-3860 MHz steady across all cores
CPU PPT: 399.98 W very steady near configured limit
CPU Core VID: 1.219 V
Vcore (SVI2 TFN): 1.087 V (-0.0375 V from offset, and -0.0945 V from Vdroop)

I’m running a modest PBO-Advanced OC:
PPT limit raised to 400W
TDC current limit raised to 300A
EDC current limit raised to 400A
Vcore offset -0.0375 V
LLC calibration set to lowest (max vdroop / lowest voltages at load)
Spread spectrum option is missing in BIOS (although it is in the MSI printed manual).
128 GB RAM at 3333 MHz

BIOS is 7C59v13 (AGESA CastlePeakPI-SP3r3 1.0.0.3 Patch A)

2 Likes

@kroyl Thanks for the feedback. Out of curiosity, why are you running LLC calibration at its lowest? What’s the rationale?

16K Min/Max FFT Prime 95, in place FFT=off, no errors (ran for about 20 minutes).

You should try with In Plane FFT = on for maximum load on the cores and caches.

Retested with in-place FFT on, 16K min/max FFT size, test ran for 40 minutes.

No errors, and since I’m hitting a manually-set 400W limit - there is almost no difference.
The average clock speed is about 10 MHz higher, and the Vcore is 1 step higher (1.094V).

The CCD temps reached 71.0/73.0/77.0/72.0 C - this is with 2x 280mm rads and ambient temp of around 27C.

CCD5 seems to consistently use about 10% more power at the same clocks, especially Core 22, which is ranked 28 in the boost preference list.
What’s interesting - when the load stops, that CCD5 quickly becomes cooler than the others - probably because the thermal grease is “burned in” better there.

As for the reasons for using the lowest LLC setting - this is a way to apply a large negative Vcore offset without impacting the CPU performance/boost capacity for low-current workloads.

2 Likes

Is there any sort of update as to what the problem is, or more specifically a fix or a workaround?

I’m going to be purchasing about 100 units of new development workstations. My CIO had originally spec’d out GIGABYTE + Threadripper 3960X, but I’m now leaning towards Intel (and something other than GIGABYTE) to be safe. I need to make a decision this week unless there is some sort of traction from either AMD or GIGABYTE on this defect.

I will post an update in a few days (this week).

2 Likes

You could go for TR but choose a different motherboard vendor, AFAIK only Gigabyte boards suffered from this defect. But yikes, with 100 units on the line, I understand if you would rather not risk it.

I cannot reproduce this behaviour at stock settings with my 3960X and AsRock TRX40 Creator. If I lower the core voltage too much, I can observe the instant-fails at 16K FFT AVX2. So I guess AsRock boards are not affected?

2 Likes

Maybe…

Most complaints from what i read are Gigabyte board users,
mostly the Master and the Extreme boards in particular.

1 Like

Looking forward to any updates regarding this. About to boot up my 3970x and aorus master soon. Was AMD able ro reproduce this issue?

1 Like

I’m seeing a fail within minutes on workers 19 and 20 with avx2 on and 16k ffts. Runs for hours with avx2 off.

Asus Zenith II Extreme Alpha + 3970x

1 Like

Hi,

A couple of us are working with AMD to investigate what exactly is going on. At this point the problem is still not fully understood and more tests need to be carried out. Hopefully we’ll have more substantial news in the following days. Please watch this thread for further updates.

Franz

1 Like

You still working with @dprairie_AMD or other people from AMD?

Drew and a few engineers at AMD.

1 Like

Is this something that is fixable via AMD agesa update? Seems to be more widespread than just gigabyte motherboards at this point

I think this is something Buildzoid can actually test. The “murder” FFT for Prime95 varies between CPU series and it does look like this is gonna be one of those for Threadripper. 8086K is 120K FFTs in-place.

It’s about suppressing the noise from the CPU, so those boards with more capacitors might actually do better.

2 Likes