[Solution Found] I NEED HELP! 3960X Seems to never be really stable!

Hako · August 6, 2021, 9:47pm

Hello,
I have been ripping my hair out trying to figure out why my system is unstable for months now.
I don’t know what to do anymore PLEASE, PLEASE HELP ME.
Please bear with me, I have a lot to unload and tell, hopefully I even remember everything.

Since upgrading to my 3960X on an ASUS Zenith Extreme II (Non Alpha) this system has constantly been unstable, sometimes I have a few months where it is stable and then suddenly it’s gotten unstable again for no reason that sounds plausible.
I believe I tried everything in the book to check what it is, but the BSODs just keep coming in absolutely random and non-reproducable ways. (While gaming, not while gaming, under load, und no load)

The BSODs all say NMI_HARDWARE_FAILURE, I tried checking on it on the web but there is absolutely no clear thing that would point to a specific hardware thing.

To the story with my problems in order:
I upgraded from an X399 Gigabyte Designare, 1900X to the above mentioned combo, while taking everthing with it, the G.Skill RAM, PSU and so. Everything was rock stable for all the time I had it.

First the BIOS kept not seeing one NVMe SSD my ADATA SX8200NP on almost every restart. Which got better with a BIOS update.
After that installing my OS on another SSD worked and I got everything in working order, it seemed fine. BUT now I got random freezes when doing things or sometimes not doing things, turns out the G.SKILL RAM was somehow not compatible with this new CPU + MOBO (G.SKILL 3200 CL14 B-die).
Got myself new RAM after watching a ton of Buildzoids and L1T and various other techtubers got some Curcial Ballistix Elite Micron E-Die tath seemed to be very compatible from all I read and would also be fine under more heat, which was maybe the issue with the other RAM though it only got to around 55C°.
This I have made a seperate thread for that problem Here

PROBLEM SOLVED? Seemed like it, the system seemed stable now for a few months.

Next I wanted to install a second OS (Win) on another SSD which seemed to start the trouble? The new OS seemed stable at first when about 3 or maybe 4 weeks in I got freezes again and also BSODs suddenly.
All the troubleshooting started again and I couldn’t find a hardware issue so I though maybe a software issue was the case and tried to narrow it down by going backwards through the programs I use on a dauly basis which I updated in the process of switching to the new OS. NOTHING. NOTHING helped.
Thinking it might be the OS itself which got corrupted I switched back to the first OS, which was now the backup OS.

PROBLEM SOLVED? Seemed like it again.

Just when I thought everything was fine again, I could use my hard earned PC again, when suddenly today I got BSODs again, two this time about an hour apart.

I recently got a new GPU (RTX 3080 Ti FE) which I was happy to have got a decent deal on.

I can run every benchmark, every stress test I could think of, (Furmark, Prim95 for 24H, Memtest to 2000%) and always pass and no freeze or BSOD happens. I can’t under any circumstances get the problem to occur while testing.

Temps seem in the limits of every hardware piece:

CPU: Max. ~85°C
RAM: No thermal sensor onboard so the external sensor tells me about 51C° environment
GPU 1: Max ~85°C
GPU 2: Max ~72°C
Chipset: Normally about 78°C Max ~90°C

Things I tried, always only changing ONE thing at a time (This will be long):

C-States
Various other BIOS settings in all possible combinations
BIOS update
BIOS downgrade
Replace every hardware component I have on hand (PSU, SSDs, GPUs, Soundcard, Networkcard, RAM)
RAM tmings (This took forever)
RAM speed on lower or higher speeds
Different OS (On different SSDs)
Re-seat CPU
Re-paste CPU thermal paste
Disconnect everything from the MOBO and reconnect it again
Try different drivers for everything where there are different drivers for
CPU on stock OR overclocked
CPU, RAM, IOD, SOC voltages in the safe values I know about, always checked very manu different sources for this
Everything stock in BIOS without any adjustments whatsoever.
PCIe GEN adjustment everything to GEN 3, in desperation
Software downgrade to KNOW GOOD versions, I know they are stable.
With OS encryption and without.

I probably tried even more things, which I just can’t remember atm. I currently don’t have enough money to change the motherboard or the CPU, so I can’t replace them to rule both these things out.
I’m not rich and am currently really devastated that this really great system isn’t working properly and I don’t know what to do anymore I’m at the absolute end of my wits here.

I know shouldn’t probably do this but I have to ask, @wendell maybe you have heard of weird things happening with this board or CPU?

My whole system (With current settings as far as I remember them):

CPU: AMD TR 3960X (Undervolted by 0.12V)
RAM: Crucial Ballistix Elite 32GB 3600 4 Sticks (3600, CL 16-18-18-38)
GPU 1: 3080 TI FE (Stock)
GPU 2: RX580 (Stock)
Every NVMe and S-ATA port populated (If you need to know every make and model I will include them)
Creative SB ZxR
Broadcom 4port GBe NIC
1200W Silverstone PSU

Dutch_Master · August 6, 2021, 10:01pm

Do you still have the old mainboard and CPU? If you do, try swapping them back in, add the 3080+RX580 and see if it remains stable.

One thing you may want to investigate a bit further is the VRM’s on the Asus mainboard. They may need additional cooling, a standard 80mm fan should do, at least for testing.

Hako · August 6, 2021, 10:06pm

Thank you for answering, sadly I don’t have it anymore.
The board itself has two 40mm fans for the cooling of the VRMs, so that should be adequate, I guess?
From what I can read out in the BIOS, iirc, they aren’t that hot though. I will check though.

wendell · August 7, 2021, 1:40am

Asus has a default setting iirc that’s a slight overclock. You prolly want to turn that off.

Memory at 3200 for now might also be a thing to run with for a while

gnif · August 7, 2021, 2:07am

The NMI in NMI_HARDWARE_FAILURE stands for Non-Maskable Interrupt, that is, an interrupt that can not be disabled and are usually triggered by non-recoverable chipset errors, ram corruption and data corruption on the PCIe bus.

As such I would do as @wendell suggested and slow down your ram, perhaps even under-clock it to see if this resolves the issue. If this doesn’t work I would turn my attention to the PCIe devices, removing them one at a time until the problem goes away.

Next if still not working well, I would underclock the CPU to eliminate it as a possible source of the problem, or better yet if you have a spare compatible chip (not necessarily the same model), swap it in.

Finally, if none of this helped I would turn my attention to the PSU, if it’s underrated or delivering noisy power (if it’s gone faulty) it could trigger such behaviour also.

Hako · August 7, 2021, 11:58am

My guess would be that that is the “Performance Enhancer” option right?
Afaik it’s turned off but I will check this next.

I think I already tried that, but since I can’t fully remember everything what I have done over the months I will try that as soon as I get the next BSOD. Thank you very much

You wouldn’t believe how I searched for an answer like this, thank you very much for that explanation.

I will do that, maybe it’s time to invest in ECC RAM, though I really don’t have any money atm.

That’s really tricky, as I use every PCIe card all the time and the BSODs can have days or even weeks inbetween them, but I will check if I can swap them out with something that would still fullfill the purpose I need them for, would that be ok too?

I undervolt it already so in a sense it is underclocked? Or do you mean even under the 3.8GHz base clock?

I would love to do that but I don’t have another and also can’t buy another, the aformentioned money issue.

My PSU is not really old maybe 1 year or a bit above that, it’s a 1200W Silverstone SST-ST1200-PTS.

Thank you very much for those suggestions and explanation I will try it one-by-one and report back.

gnif · August 8, 2021, 7:21pm

Undervolting is not underclocking, it’s basically starving the CPU of power because you are running it below the specified voltages set by AMD. This could very well be the cause of your issues.

Unfortunately that doesn’t exclude you from a faulty PSU, if you know someone with an Oscilloscope it might be worth asking them if they can have a look and see if the PSU is noisy.

Hako · August 8, 2021, 7:39pm

Hrm, interesting. I would think though in this case that Prime95 could expose this? That just jumped in my mind.

I still wish I knew someone that would have that as it would have helped me on so many projects, alas I don’t

For now I disabled “Performance Enhancer” and “Core Performance Boost” in the BIOS and will wait, though how long is kinda undetermined as it can be that I can run it for a few days to weeks to even months it seems, argh I so wish it would be reproducable.

Thank you for informing me further

gnif · August 8, 2021, 7:43pm

It could but don’t forget that modern CPUs do far more then just crunch numbers, they are the PCIe bus masters, RAM controller and SMI all rolled int one. Prime95 can stress the CPU and RAM controller, but not the other features in the CPU which under load in certain circumstances may be failing.

Think about this, if you turn a single tap or two on in your house, you may not notice a deficiency in water flow between the two taps… but turn on a third and there is a dramatic drop in water pressure. The same is applied to electricity, but in a CPU these “taps” are getting turned on/off to varying amounts millions of times a second. If a bunch of them are on at the same time suddenly there is not enough power for the CPU to run stable.

Remember, verifying a CPU can run 100% stable out of specification is something only AMD can do… they want to sell more efficient lower power consuming CPUs, it makes their products more attractive. They don’t set these numbers at random, they set them based on the literal minimum requirements of the CPU + a small percentage for margin of error in silicon quality.

Hako · August 8, 2021, 7:55pm

Would the same hold true even under light load?

Yes very true, that’s kinda what I thought too hence my thought to start with only baseclock and see what happens. I hope I don’t err here, my thought was that if even baseclock doesn’t work and throws errors then I would have to check something else?

gnif · August 8, 2021, 7:58pm

Yes, perhaps your CPU and RAM controllers are happy to run at a lower voltage, but what about the PCIe bus controller when there is a burst of traffic from multiple PCIe devices at once (PCIe devices are always talking to the controller, even when the system is idle). It might be starving of power when this happens causing a bus error.

Go back to factory voltages too.

Hako · August 8, 2021, 8:07pm

Of course that’s what I’m doing

I only ever decreased the Vcore not IOD or CCD (Sadly there is no tool that can read those out ), those run on Auto. I however gave SOC a little more so it runs actually on 1.1V, measured on the measuring points with a multimeter on the board. Well I don’t know the relation inside the CPU and SoC maybe Vcore will also starve some other component inside, like PCIe controller or IF, like you said.

Thank you for the explanations btw. those help me understand everything more clearly it’s very interesting

gnif · August 8, 2021, 8:09pm

You’re most welcome, I hope you find a solution to your problem. I know what it’s like as I have been there too… months of running stable and suddenly a bunch of random crashes, it’s super infuriating.

Took nearly a year to discover that my RAM clocks that would test stable under memtest86+ and prime, etc… were not stable when it came to real world general usage. Slowed things down and suddenly the system is rock solid running 24x7 for months at a time.

Hako · August 8, 2021, 10:08pm

Thank you, well glad and not glad to hear that someone else has also run into problems like this.

If you don’t mind me asking what RAM did/do you use and from what speed/timings to what speed/timings did you have to drop down to? Glad to hear you could sort it out for you.

gnif · August 8, 2021, 10:49pm

Sorry not much I can give you will apply, I am using a 2970WX Threadripper with quad channel configured to use NUMA. A very different kettle of fish.

Hako · August 8, 2021, 10:57pm

I just wanted a reference even if it doesn’t apply just to get an idea, to what I might have to do. If even 3200 doesn’t work, if RAM is the cause of the problems, then I might get an inkling of what to expect.

gnif · August 8, 2021, 11:00pm

I already gave you what you should try. You can’t compare anything of my setup to yours, there are many extra variables to consider that do not apply to a normal desktop PC.

If your PC wont run stable on the default auto configured clocks & voltages, without XMP profiles (XMP is overclocking btw), then it’s not a clock/voltage issue.

MisteryAngel · August 9, 2021, 1:49am

Yeah pretty much and it’s likely not a vrm issue.
Also i wouldn´t recommend undervolting either.

First thing i would start with in this case when the system,
doesn´t even want to run stable at stock.
take one of the gpu´s out, and test with just a single gpu.
I wouldn´t be suprised if the said bsods could be related to that.

Hako · August 9, 2021, 12:51pm

That would suck really bad, as I’ve been using two GPUs now since Threadripper came out what 2017?
For the longest time it was a 1080 TI and the RX580. Ran always stable. Would seriously suck though…

Hako · August 14, 2021, 9:58am

Ok I just had another BSOD happen.
What did I change:

Disabled “Performance Enhancer” in UEFI
Disabled “Core Performance Boost”
Removed the undervolt

Would that remove the possibility that it’s the CPU and/or VRM? @gnif
As the CPU only ran baseclock and stock voltage.

Also a small list of what I had running:

VMWare Workstation 14 (2VM about 5-6GB RAM load and almost no CPU load)
Browser
Chromium (Watching a YT video)
Different instance of Chromium
Notepad
PuTTY
Wallpaper Engine
A game running on my second GPU (RX580)

Maybe that can help narrow it down, if even by the slightest margin.

I read something recently in a driver update from AMD, under know-issues:

Driver timeouts may be experienced while playing a game & streaming a video simultaneously on some AMD Graphics products such as Radeon™ RX 500 Series Graphics.

Can this be the cause?