[SOLVED] 3970X - Prime95 stability?

TheAlmightyBaconLord · February 19, 2020, 10:17pm

Don’t have experience with the 3970x, but that seems really high for a system on idle.

What’s your full spec list? Are HDD’s plugged in? GPU?

To get the best measurement, use a current-clamp (you likely have access to one) and tell us what you get on your 12v CPU rail. Try it in the bios and booted into an O.S.

Regarding your mention of measuring the VRM output, why can’t you remove the heatsink from the VRM’s? Also, have you tried measuring the voltage stability of the DIMM power rails? Also I guess you tried varying memory configuration in different slots as well (even if they aren’t configured optimally)? I know it seems like it may be a CPU issue, but it could also be a DIMM issue.

Also, BIOS update (didn’t see it mentioned)?

Edit: Also, maybe also try disabling CPU cores

DerAlbi · February 19, 2020, 10:51pm

Full spec is: 3970x, 5700XT, Aorus Mater TRX40, 64GB 3600CL16 (4 sticks), 4x 2TB (ssd) @ SATA, 2x 1GB @ NVME, some fans + a pump. 850W Power supply, full load is 600W-700W. (So there is actually room for graphics card)

I unfortunately do not have a dc current clamp. However i measured the 12V rail voltage-vise and they are rock solid as are all the other rails. Power supply is fine. (I understand, you wanted me to measure cpu power draw, but I cant, so I answer this instead )

Removing the Heatsink of the VRMs seems sketchy. I know its possible, but i also so not want to void warranty more than necessary. Also measuring on a live stupidly expensive system does not feel good. I havent the right setup here to so a professional job.
The voltasge stability of the other rails (those that have measurement points) are all fine. It is only Vcore.

Bios is updated.
disabling CPU-Cores helps, but it only lessens the chance. I ran a while with 8 cores (via bios), it still happens. Also i can disable all cores in Linux that fail in stressapptest or prime, but when i run the test again, other cores just fail. I see on the oscilloscope that the depth of the glitch is load dependend. This is why disabling cores helps.
This behavior is the same on both motherboards i now own.
The Vcore glitch happens like clockwork every 4ms. It is not related to VID-changes.
I disabled C6 via zenstates.py. No change under load.
i also tried different memory speeds. The glitches also happen when Memory is at stock 2133MHz.
I also removed and/or shuffled memory sticks around. No Change.Memory is fine, when Mem-Testing it is the CPU that fails the comparison or hash/crc computation, (stessapptest retests a failed memory segment, and retesting is often successful)
I also exchanged my 5700XT with a 1080, although the system ran more stable, the glitches on Vcore are still present.

The CPU is the only thing left to change.

To me, it makes no sense. at all.
It would be a tremendous help, if anyone could hook up an oscilloscope to their Threadripper3000 system. As far as i know, Gigabyte uses the same VRM on many boards. If the glitch is part of normal operation, then i have another problem. But not knowing what to expect on Vcore is what holds me back from buying another 3970X. It is no fun to spend 2000€ on a guess.

Methylzero · February 20, 2020, 11:08am

Have you tried changing the VRM settings in the UEFI? The controller can probably do things like dynamic switching frequency, phase shedding, etc. to save power. But if thats buggy you might get more stability with all phases active all the time at a fixed switching frequency.

DerAlbi · February 20, 2020, 5:25pm

Have tried it, does not help at all.
I dont expect that anything helps anymore either. If two boards behave the same, that would mean that there would be a systematic problem with the VRM. AFAK the VRM design is shared across multiple models. I think they would have addressed that by now. This is nothing that can be solved by BIOS-settings or all Gigabyte-TRX40-boards would be unstable out of the box.

My focus is on the CPU right now.

FranzB · February 21, 2020, 1:33pm

Hi @DerAlbi,

You’re not alone! I have posted my story in this forum, it’s quite similar to yours.

Look for the topic “AMD Threadripper 3970X under heavy AVX2 load: Defective by design?”. Sorry, I’m new here and my posts can’t have links yet.

DerAlbi · February 21, 2020, 6:06pm

Hi, i have not read the post (in detail) yet, but this is getting very interesting. Last night (night for me) I had a session with Wendell who was kind enough to spend time with me probing his board, testing Prime95 and so on. He does not have issues.

However i am willing to share another screenshot with you of my oscilloscope findings:

What you see is the voltage Vcore-dip on the bottom. On top i probed 3 of the 16 phases. Here, only the waveform counts, the way i probed to phases does not give me correct absolute values. View it just as indication of what happens!

THE VRM IS INDEED MISSING A CYCLE.
And its worse: it is not only a missing cycle due to a protective feature. I can distiguish this, because the low-side mosfet of the stepdown is turned intentionally ON by the multiphase-controller causing the dip. All phases at once. The CPU does not have a chance.

Now, i have had 2 boards with the exact same issue.
I have to change the CPU first to really double check that it is not the CPU, but it makes absolutely no sense that the CPU would cause that issue. PMBus-Communication is far to slow to misconfigure the VRM for only one cylce.

My current guess is that the XDPE132G5C has some sort of configuration issue (which can be fixed by bios update) or it is actually a hardware fault inside the chips - as if there is a bad badge or something from Infineon.

PLEASE tell me that Franz is a German name? Where do you live? Send PM.

In your post people suggest it is a Vdrop issue. Trust me, the boards copper layout is solid. Those boards do NOT generate excessive drop with huge variations (meaning the Load line calibration in bios stock settings should be fine). If someone hints at a drop, i honestly think it is the VRM fucking up. We need more oscilloscopes among the people.

It is very unfortunate that Gigabyte is not taking me serious. I have no power to get through to someone where a technical discussion about intricate details about the VRM is actually fruitful.

FranzB · February 21, 2020, 10:14pm

Hi,

Thanks for sharing your findings. I agree, this is getting really interesting indeed! I’m so glad that we got the conversation going.

I’m French actually Franz is how most people call me, in part because I lived many years in Germany (Berlin), in part because it’s much easier for everyone to spell my name (my actual name is François: no weird ç in Franz!)

I cannot agree more. My dream scope is the Rohde & Scharz RTB2000. Maybe it’s time I shell out the money…

My hope is that we can start a conversation with GIGABYTE. Your scope analysis should go a long way in showing that we’re dedicated to finding out what the duck is going on here.

DerAlbi · February 21, 2020, 10:33pm

I wish it was true. I gave them the findings. The very first scope pictures i described “as if the stepdown stops switching”. It is obvious for anyone with experience in VRM design. (second option is an intermittent short circuited load)
The problem is that the frontmen at Gigabyte tech support are not expert enough to recognize the seriousness, so they dismiss me as the usual idiot customer giving me advice like increasting load line calibration which - thats obvious from the pictures - is not the issue at all.

My best guess:
Basically what we look at is the misconfiguration of the VRM. It seems like the Threadripper goes to idle for a very short time on all cores, that causes a voltage spike to which the VRM overreacts by skipping a cycle. (This is not uncommon behavior for stepdowns)
Those VRM controllers are programmable. The voltage-spike when you go from 100% -> 0% load is normal and usually the capacitors on the board swallow that transient if sized correctly and in accordance to the VRMs control loop.
It seems like they have programmed in wrong time constants.
I think this is fixable by bios update, but we need to get the word out.

Also interesting is, why only some people have the problem.
If they recognized the problem quietly , the newest bios should fix it. If they changed the hardware slightly and forgot to change the settings… thats an oopsie

Again, it could still be the CPU. We have to eliminate the possibility to make really strong claims. Right now, its guessing. I cant investigate deeper, since i need my PC and the thing is just too expensive to dissect for fun.

Check your PMs.

wendell · February 21, 2020, 11:16pm

@deralbi and I spent several hours trying to reproduce this on an aorus extreme and the designare on both windows and Linux, and we were not able to. My vrms showed drops and peaks but nothing that looks as bad as what deralbi has and prime 95 ran and ran and ran. The artic tower cooler system got really toasty but was fine.

I am investigating.

Edit: what temps do you have from your vrms and core? The designare is rather balmy and my thought-it-was-overkill cooling is not doing as well as I thought it would

DerAlbi · February 22, 2020, 12:00am

Temps are fine. They are in the 60s (°C) when i adhere to Package power limit. Have a 480mm rad…
The voltage issue is unlikely a temeprature thing. Temperature is an extremely slow process compared to the 500kHz switching. That will never cause one missing cycle but only hard shutdowns.

My current working theory might relate to capacitor tolcerances. The short voltage spike before the dip might trigger an overreaction of the VRM-controller. This spike shold be swallowed by the capacitors (and likely is). But depending on the capacitor tolerance (can have as bad as -20% to -30%) this voltage spice can be more pronounced or better handled. If i have had 2 motherbaord with bad tolerances and an AMD chip that is especially susceptible to voltage, then this could explain it. The VRM may be misconfigured in those edge cases where the load decreases within a microsecond. Then the problem amplifies it self when the VRM skips a cycle, basically over-discharging the capacitors for compensation.

You can try that, by ripping out 20-30% of the capacitors on your board (its enough to desolder the THT caps near the inductors) But desoldering on those multilayer PCBs may be impossible without really good tools. (i have them, but cant risk my board / warranty)

Jimster480 · February 22, 2020, 12:28am

Intersting to see the outcome of this. Subbed

happyluckbox · February 22, 2020, 7:36am

Also curious about this. Apparently an AMD rep may have responded in the other thread about this issue…

DerAlbi · February 22, 2020, 8:55am

Not sure AMD can do much about it except sending out samples to help us diagnose the issue and/or put pressure on Gigabyte to respond.
I am happy if they lend me two known good CPUs for my known bad motherboard.

MisteryAngel · February 22, 2020, 2:04pm

Well i suppose that AMD likely has those boards in house,
so they could try to reproduce the problem.
I cannot remember that i have read many other complaints,
from users that use different motherboards.
Mainly only Gigabyte Aorus Master and Extreme users.
So this is kinda interesting.

chris719 · February 26, 2020, 10:16am

Any update on this? Trying to figure out if I should buy a Gigabyte or ASUS board for a 3960X.

DerAlbi · February 26, 2020, 1:08pm

Not much definite stuff that I can say right now - AMD is looking into it; big company is a bit slow. In my opinion, the issue i have is Bios-fixable, if you will even have it.
Also, complaints are mainly about the MASTER and EXTREME - Designare seems fine (Wendell tested it) although the sample size is not large enough for any kind of prediction.

For now, i suggest you buy ASUS, if you find a board you like and it is the same price. If ASUS is a compromise for you, then buy Gigabyte. This thread is only a hint to an existing problem, by no means does it guarantee that you will suffer.
In general, you will only suffer from not having a PC if you RMA your parts. Now, since you are aware of the issue, you can test this right away and simply use your 14-day return right with a pretty good money back guarantee.

FranzB · March 3, 2020, 12:08am

FYI, someone with an ASUS Zenith II Extreme Alpha and a 3970X just reported a similar issue:

TheAlmightyBaconLord · March 3, 2020, 1:17am

This is getting interesting with every new post! Keep all of us updated!

DerAlbi · March 10, 2020, 12:48am

Guys, an update:

Basically this whole Prime95-stability issue has been resolved on a technical level, but it is not out yet. So if you were uncertain to buy a Threadripper or Gigabyte-MoBo, dont be, its fine.

The VRM issue was real. Gigabyte has a fix. (as expected)
The P95-stability issue was not only due to the VRM-flaw but also because of an AGESA-flaw.

BOTH issues cause P95-instability. This was confusing to diagnose since people with- and without the VRM problem may have experienced problems.
I first believed that all systems that crashed must have had the VRM-issue - but that wasnt the case. I initially thought it was a measurement mistake or just a hard-to-measure level of VRM-misbehavior. (If crashing systems were diagnosed with “good” VRMs)

To anyone who observed the Gigabyte Bios releases, this was a funny to watch chaos over the last few days.

Basically any Bios version F4x is flawed (old AGESA, no VRM fix; which does not mean your system is inherently unstable, but it could be )
Then Gigabyte came out with F5b which fixed the VRM bug. But to the anger of AMD this was an uncoordinated release and did not include the new AGESA version which would have fixed all problems at once. So Gigabyte pulled F5b and pushed F5c (which is currently the most recent Bios online). This Bios has now the new AGESA to fix P95, but funny enough, it does not include the VRM fix.

I am pretty sure, that quite soon, there will be a F5d-bios or something that will include both fixes and “everything just works”.

Once this is out, the Vcore will look like this:

THIS IS PERFECT! (captured with the short-lived F5b bios; P95 was still crashing due to AGESA-problems)

As a reminder, a load decrease (the voltage jump you see above) caused missing switching cycles before, during which the low side mosfets were turned on, resulting in a violent discharge of Vcore. Here is some detail:

(Captured with the current F5c-bios)

I am currently staying on the F5c bios since the new AGESA version does make my PC more stable in light-load scenarios. I am currently running 60h without crash (the age of F5c) - this was unthinkable before.
However under high load the VRM bug still causes CPU-failure.

Again, i think this will be fixed very soon (it should be trivial). So everything is good.

What was it like to work with AMD?
Well, there was no working WITH or FOR them. Its a big company with big policies - it is pretty much a one way communication. No juicy details leave AMD, which at times was quite annoying; specially as an interested engineer, i would have preferred some details-but-not-details, basically a summary for idiots at least. The one way communication left me quite frustrated at times - especially when AMD hinted that they werent completely sure if they addressed the right issue or not (they did). But it sparked the urge to help or double-check, which was impossible. (Those issues are all about replicating the original problem which can be extremely hard, specially if the problems are statistical in nature and only communicated via email in written form in English as a 2nd language)
AMDs motivation to solve the issue was/is crazy high. People really care to provide a good product and a good experience. The only thing that bothers me is how broken the customer support is - i guess it is fine when you have a normal RMA but true technical issues dont come through.
The fact that this issue was solved so quickly seems like a big coincidence. (Thanks forum!)

It boggles the mind what AMD can do with software configuration to their processors. Those systems are incredible if you even try to understand the details that must go into them.

All in all a super cool experience

MisteryAngel · March 10, 2020, 12:52am

Did Gigabyte made any statements about the actual,
vrm bug, or lets say the bios bug?

Because i´m kinda curious about that.