[SOLVED] 3970X - Prime95 stability?

DerAlbi · February 21, 2020, 6:06pm

Hi, i have not read the post (in detail) yet, but this is getting very interesting. Last night (night for me) I had a session with Wendell who was kind enough to spend time with me probing his board, testing Prime95 and so on. He does not have issues.

However i am willing to share another screenshot with you of my oscilloscope findings:

What you see is the voltage Vcore-dip on the bottom. On top i probed 3 of the 16 phases. Here, only the waveform counts, the way i probed to phases does not give me correct absolute values. View it just as indication of what happens!

THE VRM IS INDEED MISSING A CYCLE.
And its worse: it is not only a missing cycle due to a protective feature. I can distiguish this, because the low-side mosfet of the stepdown is turned intentionally ON by the multiphase-controller causing the dip. All phases at once. The CPU does not have a chance.

Now, i have had 2 boards with the exact same issue.
I have to change the CPU first to really double check that it is not the CPU, but it makes absolutely no sense that the CPU would cause that issue. PMBus-Communication is far to slow to misconfigure the VRM for only one cylce.

My current guess is that the XDPE132G5C has some sort of configuration issue (which can be fixed by bios update) or it is actually a hardware fault inside the chips - as if there is a bad badge or something from Infineon.

PLEASE tell me that Franz is a German name? Where do you live? Send PM.

In your post people suggest it is a Vdrop issue. Trust me, the boards copper layout is solid. Those boards do NOT generate excessive drop with huge variations (meaning the Load line calibration in bios stock settings should be fine). If someone hints at a drop, i honestly think it is the VRM fucking up. We need more oscilloscopes among the people.

It is very unfortunate that Gigabyte is not taking me serious. I have no power to get through to someone where a technical discussion about intricate details about the VRM is actually fruitful.

FranzB · February 21, 2020, 10:14pm

Hi,

Thanks for sharing your findings. I agree, this is getting really interesting indeed! I’m so glad that we got the conversation going.

I’m French actually Franz is how most people call me, in part because I lived many years in Germany (Berlin), in part because it’s much easier for everyone to spell my name (my actual name is François: no weird ç in Franz!)

I cannot agree more. My dream scope is the Rohde & Scharz RTB2000. Maybe it’s time I shell out the money…

My hope is that we can start a conversation with GIGABYTE. Your scope analysis should go a long way in showing that we’re dedicated to finding out what the duck is going on here.

DerAlbi · February 21, 2020, 10:33pm

I wish it was true. I gave them the findings. The very first scope pictures i described “as if the stepdown stops switching”. It is obvious for anyone with experience in VRM design. (second option is an intermittent short circuited load)
The problem is that the frontmen at Gigabyte tech support are not expert enough to recognize the seriousness, so they dismiss me as the usual idiot customer giving me advice like increasting load line calibration which - thats obvious from the pictures - is not the issue at all.

My best guess:
Basically what we look at is the misconfiguration of the VRM. It seems like the Threadripper goes to idle for a very short time on all cores, that causes a voltage spike to which the VRM overreacts by skipping a cycle. (This is not uncommon behavior for stepdowns)
Those VRM controllers are programmable. The voltage-spike when you go from 100% -> 0% load is normal and usually the capacitors on the board swallow that transient if sized correctly and in accordance to the VRMs control loop.
It seems like they have programmed in wrong time constants.
I think this is fixable by bios update, but we need to get the word out.

Also interesting is, why only some people have the problem.
If they recognized the problem quietly , the newest bios should fix it. If they changed the hardware slightly and forgot to change the settings… thats an oopsie

Again, it could still be the CPU. We have to eliminate the possibility to make really strong claims. Right now, its guessing. I cant investigate deeper, since i need my PC and the thing is just too expensive to dissect for fun.

Check your PMs.

wendell · February 21, 2020, 11:16pm

@deralbi and I spent several hours trying to reproduce this on an aorus extreme and the designare on both windows and Linux, and we were not able to. My vrms showed drops and peaks but nothing that looks as bad as what deralbi has and prime 95 ran and ran and ran. The artic tower cooler system got really toasty but was fine.

I am investigating.

Edit: what temps do you have from your vrms and core? The designare is rather balmy and my thought-it-was-overkill cooling is not doing as well as I thought it would

DerAlbi · February 22, 2020, 12:00am

Temps are fine. They are in the 60s (°C) when i adhere to Package power limit. Have a 480mm rad…
The voltage issue is unlikely a temeprature thing. Temperature is an extremely slow process compared to the 500kHz switching. That will never cause one missing cycle but only hard shutdowns.

My current working theory might relate to capacitor tolcerances. The short voltage spike before the dip might trigger an overreaction of the VRM-controller. This spike shold be swallowed by the capacitors (and likely is). But depending on the capacitor tolerance (can have as bad as -20% to -30%) this voltage spice can be more pronounced or better handled. If i have had 2 motherbaord with bad tolerances and an AMD chip that is especially susceptible to voltage, then this could explain it. The VRM may be misconfigured in those edge cases where the load decreases within a microsecond. Then the problem amplifies it self when the VRM skips a cycle, basically over-discharging the capacitors for compensation.

You can try that, by ripping out 20-30% of the capacitors on your board (its enough to desolder the THT caps near the inductors) But desoldering on those multilayer PCBs may be impossible without really good tools. (i have them, but cant risk my board / warranty)

Jimster480 · February 22, 2020, 12:28am

Intersting to see the outcome of this. Subbed

happyluckbox · February 22, 2020, 7:36am

Also curious about this. Apparently an AMD rep may have responded in the other thread about this issue…

DerAlbi · February 22, 2020, 8:55am

Not sure AMD can do much about it except sending out samples to help us diagnose the issue and/or put pressure on Gigabyte to respond.
I am happy if they lend me two known good CPUs for my known bad motherboard.

MisteryAngel · February 22, 2020, 2:04pm

Well i suppose that AMD likely has those boards in house,
so they could try to reproduce the problem.
I cannot remember that i have read many other complaints,
from users that use different motherboards.
Mainly only Gigabyte Aorus Master and Extreme users.
So this is kinda interesting.

chris719 · February 26, 2020, 10:16am

Any update on this? Trying to figure out if I should buy a Gigabyte or ASUS board for a 3960X.

DerAlbi · February 26, 2020, 1:08pm

Not much definite stuff that I can say right now - AMD is looking into it; big company is a bit slow. In my opinion, the issue i have is Bios-fixable, if you will even have it.
Also, complaints are mainly about the MASTER and EXTREME - Designare seems fine (Wendell tested it) although the sample size is not large enough for any kind of prediction.

For now, i suggest you buy ASUS, if you find a board you like and it is the same price. If ASUS is a compromise for you, then buy Gigabyte. This thread is only a hint to an existing problem, by no means does it guarantee that you will suffer.
In general, you will only suffer from not having a PC if you RMA your parts. Now, since you are aware of the issue, you can test this right away and simply use your 14-day return right with a pretty good money back guarantee.

FranzB · March 3, 2020, 12:08am

FYI, someone with an ASUS Zenith II Extreme Alpha and a 3970X just reported a similar issue:

TheAlmightyBaconLord · March 3, 2020, 1:17am

This is getting interesting with every new post! Keep all of us updated!

DerAlbi · March 10, 2020, 12:48am

Guys, an update:

Basically this whole Prime95-stability issue has been resolved on a technical level, but it is not out yet. So if you were uncertain to buy a Threadripper or Gigabyte-MoBo, dont be, its fine.

The VRM issue was real. Gigabyte has a fix. (as expected)
The P95-stability issue was not only due to the VRM-flaw but also because of an AGESA-flaw.

BOTH issues cause P95-instability. This was confusing to diagnose since people with- and without the VRM problem may have experienced problems.
I first believed that all systems that crashed must have had the VRM-issue - but that wasnt the case. I initially thought it was a measurement mistake or just a hard-to-measure level of VRM-misbehavior. (If crashing systems were diagnosed with “good” VRMs)

To anyone who observed the Gigabyte Bios releases, this was a funny to watch chaos over the last few days.

Basically any Bios version F4x is flawed (old AGESA, no VRM fix; which does not mean your system is inherently unstable, but it could be )
Then Gigabyte came out with F5b which fixed the VRM bug. But to the anger of AMD this was an uncoordinated release and did not include the new AGESA version which would have fixed all problems at once. So Gigabyte pulled F5b and pushed F5c (which is currently the most recent Bios online). This Bios has now the new AGESA to fix P95, but funny enough, it does not include the VRM fix.

I am pretty sure, that quite soon, there will be a F5d-bios or something that will include both fixes and “everything just works”.

Once this is out, the Vcore will look like this:

THIS IS PERFECT! (captured with the short-lived F5b bios; P95 was still crashing due to AGESA-problems)

As a reminder, a load decrease (the voltage jump you see above) caused missing switching cycles before, during which the low side mosfets were turned on, resulting in a violent discharge of Vcore. Here is some detail:

(Captured with the current F5c-bios)

I am currently staying on the F5c bios since the new AGESA version does make my PC more stable in light-load scenarios. I am currently running 60h without crash (the age of F5c) - this was unthinkable before.
However under high load the VRM bug still causes CPU-failure.

Again, i think this will be fixed very soon (it should be trivial). So everything is good.

What was it like to work with AMD?
Well, there was no working WITH or FOR them. Its a big company with big policies - it is pretty much a one way communication. No juicy details leave AMD, which at times was quite annoying; specially as an interested engineer, i would have preferred some details-but-not-details, basically a summary for idiots at least. The one way communication left me quite frustrated at times - especially when AMD hinted that they werent completely sure if they addressed the right issue or not (they did). But it sparked the urge to help or double-check, which was impossible. (Those issues are all about replicating the original problem which can be extremely hard, specially if the problems are statistical in nature and only communicated via email in written form in English as a 2nd language)
AMDs motivation to solve the issue was/is crazy high. People really care to provide a good product and a good experience. The only thing that bothers me is how broken the customer support is - i guess it is fine when you have a normal RMA but true technical issues dont come through.
The fact that this issue was solved so quickly seems like a big coincidence. (Thanks forum!)

It boggles the mind what AMD can do with software configuration to their processors. Those systems are incredible if you even try to understand the details that must go into them.

All in all a super cool experience

MisteryAngel · March 10, 2020, 12:52am

Did Gigabyte made any statements about the actual,
vrm bug, or lets say the bios bug?

Because i´m kinda curious about that.

DerAlbi · March 10, 2020, 12:56am

No, there was no communication to Giagbyte at all. AMD was my/our interface the whole time.
However you can ask me details. I am pretty confident that i am understanding the VRM problem good enough to give educated answers.

Oh an in case you want to know:
Bios F5b (which fixed the VRM) was described along the lines… “memory compatibility improvement” yeah, right.

MisteryAngel · March 10, 2020, 1:02am

Well since this is not actually the first time that Gigabyte,
was having vrm issues with TRX40 boards.
So i’m kinda currious what the actual bug is.

DerAlbi · March 10, 2020, 1:18am

To my best knowledge:
This is a simple configuration failure. Those VRMs alone are incredibly complex systems (from a control-theory point of view) They are very customizable and even their regulation response time can be programmed (to some extent) and stuff.
Those VRMs have so many features that it is easy to misconfigure them.

The particular effect you see here is a result of missing switching cycles (or cycles with a duty cycle of 0%). The failure always occurs after a load decrease (you can see the voltage rising slightly before the failure).
The missing Power-stage pulses can be caused by 4 things:

a over-voltage protection based on the voltage-slope. Maybe if Voltage rises too quickly, Vcore gets discharged violently in order to protect the hardware (preemptively). I am not sure if this feature exists in the chip since its datasheet is not open to the public)
The load reduction triggers a power-saving feature of the VRM. The missing pulses could be explained by a varying switching frequency (which is enabled evidently, since it can insert pulses during recovery - effectively reaching 1MHz switching frequency). So when load decreases sufficiently, the switching frequency could be decreased to 250kHz. This would explain it perfectly.
The VRM has an “adaptive control loop”. Such a control loop algorithm seeks to optimize load change response. This is a very complicated control-theory topic and there is much that can go wrong here. It is very well possible that this adaptable control loop just fucks up during load decrease. If the adaptability isnt contained within certain constraints, things happen.
there are actually more things, like power stage current sharing failures, which could cause a VRM - reset/restart which could manifest this way

I am strongly in favor of option 2)

going to bed now. sorry

MisteryAngel · March 10, 2020, 1:47am

Yeah it´s probablly a miss configuration between the bios,
and the actual pwm, the Infineon XDPE132G5C.
Option 2 sounds plausible.
I believe Gigabyte normaly run their switching frequency at about 400khz.
But if in the case of the said bios it’s dynamicaly changing,
then it could definitely be a reason why it would skip cycles when load decreases.

But of course also AMD acknowledged that there also is an issue on their behalve.
So it´s likely a combination of both.

DerAlbi · March 10, 2020, 9:16am

Halving switching frequencies is not a safety feature, it is about VRM efficiency. Every switching cycle creates some amount of heat from the switching alone, this is why having a lower switching frequency is inherently beneficial for the mosfets. However lower switching frequency increases the RMS-current in the inductors which is causing more heat loss in them. It is a delicate balance to find the overall minimum heat loss.