Stability Issues, 3970X + Zenith II Extreme Alpha

Build:
AMD TR 3970X
Asus Zenith II Extreme Alpha
G.Skill F4-3200C16Q2-256GTRS 8x32GB kit
Corsair HX1200i PSU
ASUS 2080 Ti (ROG-STRIX-RTX-2080TI-O11G)

I’ve had stability issues with this that I’ve been trying to figure out for the last 2-1/2 weeks. I’m calling uncle. I’m looking for some help. I’m not sure if its RAM, CPU, PSU, or motherboard that is causing my issues. I’m fairly confident I’ve got a couple bad sticks of RAM, but I think there is something else going on too.

I’m game to try any test that anyone thinks might determine what is causing the issues. I don’t have compatible spare parts lying around, but I might buy some ram or a different PSU to try that.

Here is what I’ve tried so far:

  • Memtest86+ V5.01 @ 3200Mhz 16-18-18-38 1.35V (MB setting). Resulted in ~ 8 errors in Test 9 over 2 passes

  • Memtest86 V8.3 Single core Mode @ 3200Mhz 16-18-18-38 1.35V (MB setting). Resulted in ~ 64 errors, 1 in Test 5, 63 in Test 7. Did not complete an entire pass (report attached)

  • Re-seated memory and blew out sockets with air, re-tried the above test. got errors in test 7 and stopped the test.

I saw the voltage to the ram was reading low compared to the set point. Setting @ 1.35 yielded ~1.325 on Channel AB and ~1.315 on Channel CD according to the motherboard. The motherboard has test points for the ram voltages, so I used a calibrated DMM to read the voltages. Read~1.32V for each channel. Set BIOS to 1.38V and re-checked the readings on the DMM. AB:~1.355V CD:~1.345V

  • Memtest86 V8.3 Multi core Mode @ 3200Mhz 16-18-18-38 Voltage from above via DMM. Resulted in errors in Test 7 and stopped the test.

  • Memtest86 V8.3 Multi core Mode @ 2666Mhz 20-19-19-43 (SPD) Voltage AB:1.176V CD:1.168V (motherboard reading). First pass OK, Second pass 8 errors in test 7. Stopped the pass during test 13. All errors were on core 26.

At this point, I had been in contact with G.Skill since it was looking like a RAM issue. They recommended the obvious, check each stick, are you running the latest BIOS, etc. Apparently I needed to hear the obvious because I hadn’t even check if there was a newer BIOS. Armed with the new BIOS (0807) I went back to testing…

  • Memtest86 V8.3 Multi-core. RAM at DOCP settings (3200 MHz 16-18-18-38 1.35V set-point) Ran just test 7 for 19 passes, got 68 errors at 2 different memory locations.

  • Memtest86 V8.3 Multi-core. RAM at DOCP settings, testing each stick individually with 32 passes of test 7 (refereed to by the last 2 digits of the serial number) Failed on 13 and 19, sticks 14, 15, 16, 17, 18, and 20 passed.

  • Sticks 14, 15, 16, 17, 18, and 20 tested with Memtest86 V8.3 Multi-core. RAM at DOCP settings, set for all tests 4 passes, locked up after several hours run time (found it that way several hours after it locked), no errors.

  • Sticks 14, 17, 18, and 20 tested with Memtest86 V8.3 Multi-core. RAM at DOCP settings, set for all test, 4 passes, Error in test 5 on the 2nd pass. Ran Prime95 (blend) and it dropped 2 of the workers after a few hours (don’t remember how long).

  • Sticks 17 and 20 Memtest86 V8.3 Multi-core. RAM at DOCP settings. Test 5 32 passes, no errors. Set for all tests and it locks up after ~6 hours run time, no errors. Prime95 (blend) passed OK running for ~20 hours. Windows extended memory test locks up as well (sits at 21% for 10 hours??). Ran stressapptest and it locks up at the “resume work threads for power spike”. Tried it in Ubuntu18.04 and the latest Mint, same result.

  • Repeated tests on stick 19 to check if errors persisted. Errors in test 7 again. Increased ram voltage so DMM read 1.35V (motherboard set-point 1.38V). Errors in test 7 again (in about the same amount of time).

  • Sticks 14 and 18 Memtest86 V8.3 Multi-core. RAM at DOCP settings. Set for all tests and it locks up. Tried it several times, and it would lock up at different tests. Ran Prime95 with small FFTs (4k-192k) and it will drop a worker in ~ 1hr.

  • Somewhere in the last few tests I put a Oscilloscope on the 12V and 5V lines to see noise and ripple. I don’t really have the right probes for high-frequency work, but I saw a pretty good amount of noise on both lines. And it would get noisier when under load. I can pull some images from the scope if someone wants.

So here I sit, wondering what to RMA? Should I send back the CPU? or is it a power delivery issue (how can you tell?). I could try re-seating the CPU, because that is quick and free. Maybe the RAM issues are a result of an underlying CPU or power delivery issues?

Thoughts? Advice?

Thanks!
Charlie

IMO you must try it. In several YT videos it was mentioned about the need of re-seating the CPU. And there is also a reason that we got the special screw driver with the CPU. Also while doing that make sure you keep the screw order.

Yeah. I’ll try that this week.

I got memtest working without locking up. The log file had gotten pretty long from all the tests I’ve run, so I cleared it and now memtest works? Odd, but I’ll take it.

I saw this thread: AMD Threadripper 3970X under heavy AVX2 load: Defective by design?

And tried my setup out. Workers 19 and 20 fail within a few minutes with avx2 on. With avx2 off it’s good for hours.

Hi,

If you can, check the individual RAM sticks, or replace them all by known good ones.

Reseating the CPU is definitely something to consider.

I had to RMA my first 3970X which had one defective core, leading to failures in P95 and to general instability despite RAM passing all tests (you can follow this adventure on Reddit: https://www.reddit.com/r/buildapc/comments/ekfsux/prime95_errors_with_amd_threadripper_3970x_build/).

My second 3970X also has issues (I’m the author of the thread AMD Threadripper 3970X under heavy AVX2 load: Defective by design?). You might be having the same issue.

Franz

I did check the individual sticks, that took a looong time haha! I think I’ve gotten some of my issues figured out now. All for different reasons.

Updated Stressapptest to latest on github (1.0.9?) and now it runs fine. Ran 4 sticks with that for 8 hrs with no problems

clearing the log on memtest86 allowed it to run without instantly locking up, but it would lock up after a while. Log showed that it was spending time waiting for response from some of the threads and that it recommended running single thread because of it (maybe this is related to the P95 issue?) ran memtest86 in single thread mode for ~30hours on the 6 sticks that tested out good individually.

Prime95… I was the one with the Asus board that posted in your thread. I seem to have only one core drop (both workers on it). I need to do some more testing to confirm that is the case, run it longer with those workers stopped. I did run it for 4 or 5 hours and only had that one core drop. Maybe a bad core? I also want to try the re-seat just for the heck of it.

-Chuck

Definitely try reseating. That said I did have a 3970X with a single defective core that I eventually RMA’d. As for MemTest86 locking up, I had that too on another rig (MSI x99 motherboard). This is apparently due to a bug in some BIOSes. As long as it’s stable on a single thread then your sticks are fine.

FYI:

Re-seating the CPU didn’t fix my prime95 issues.

@FranzB Do you know what AGESA version got the prime95 fix? The asus board shows “Update AGESA BIOS code to the latest PI 1.0.0.3 patch A” for the latest bios.

May have just answered my own question…

Gigabyte Aorus Xtreme latest bios (F4d) shows “Update AMD AGESA 1.0.0.3 B”

F3 shows “Update AMD AGESA 1.0.0.3 for Threadripper 3990X support”

So I’m guessing ASUS is not on the latest AGESA that fixes the Prime95 issue. Guess I’ll try ASUS and see if they are going to push that update.

I think AGESA 1.0.0.3 B is indeed the one that fixes Prime95. However, as far as I understand (@DerAlbi please correct me if I’m wrong) that version doesn’t fix some VRM configuration issues (presumably) that affect the Aorus Master. They did at some point release a BIOS version that would fix that problem, but a BIOS version that fixes both issues has yet to be released.

Conclusion post:

Updated to my BIOS to 0902.
Now I can run prime95 stable @ the 16k with AVX2 on and at stock clocks. No OC needed to be stable.

Re-checked bad sticks of ram with new bios, problem still existed.
Bought 2 more sticks of RAM: F4-3200C16D-64GTRS

Tested the 2 new sticks alone with

  • 32 passes of Test 7 in Memtest86 - Passed.
  • 2 passes of all Memtest86 tests - Passed.

Tested the 2 new sticks with the 6 good old sticks

  • Stresstestapp for 8 hrs - Passed
  • 3 passes of all Memtest86 tests - 1 error on test 12, 3rd pass.
  • 24 passes of Memtest86 test 12 - Passed.

Since I couldn’t re-produce the one error I had over ~3 days of testing I’m calling it good.

RMAing the 2 bad sticks of ram, so when they get back I guess I’ll have 2 extra sticks of ram. :thinking: Perfect excuse to do a mITX build…

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.