Workstation Mystery Crashes (warning: wall of text)

CPU: Threadripper 3960x
Mobo: TRX40 Gigabyte Aorus Pro Wifi
GPU: Gigabyte 3090 Gaming OC
RAM: 256GB (8x32GB) G.Skill Ripjaw V 3600 (F4-3600C18D-64GVK)
PSU: 1000W Corsair RM1000x 80 Plus Gold
UPS: Cyberpower 1500VA/1000W
OS: Windows 10 Pro


The fast few months I’ve been having these mystery crashes.

The timing of the crashes is unpredictable, sometimes it’s during a game, sometimes when rendering in Blender or editing in Resolve, sometimes just watching Youtube.

Up until now they’ve been sparse, once every couple weeks. But yesterday it reached a critical mass, with over a dozen black screen shutdowns as I tried to troubleshoot. I’ve been doing my best to track down the source, but I’m hoping someone can offer any insights that come to mind.


Potential clues:
-It rarely Bluescreens (when it does, it always gives the code ‘IRQL not less or equal’). But more often it’s a total black screen crash.

-Almost every time it black screens, the BIOS resets, requiring me to redo my fan and RAM speed settings.

-Occasionally the computer has posted with major graphical artifacts, dancing pixels, colors, etc. However, these artifacts are ONLY present in the BIOS. When it boots into Windows, the artifacts dissapear.

-Several times, the Windows Event Viewer has shown “Event ID 14, nvlddmkm”, occuring right before a crash. This seems to be related to an NVIDIA driver.

-When it crashes, the RGB on the motherboard and GPU remain on, but the system and power button is otherwise completely unresponsive, and I need to cut the power in order to restart.

-The motherboard Status LED for DRAM has lit up in several instances. However I also get a VGA status light sometimes.


Thoughts:
I’ve gone back and forth between suspecting the RAM, GPU, and motherboard.

The IRQL error seems to suggest a memory issue. It’s worth noting that for some reason, my system was reason not able to run the RAM at the default XMP profile of 3600 MT/s, and I had to switch to manual settings and drop down 3533. For the timings and voltage, I kept the XMP suggested settings.

The graphical artifacts and Event Viewer logs however point to a GPU issue. I’ve sent this card in once before when it completely died. Gigabyte repaired and returned it and I haven’t had any problems with it for over two years.

On the other hand, the BIOS resetting feels like a motherboard problem.


What I’ve tried, to no effect:

Changing CMOS battery
Updating the BIOS (previously FA, updated to FD)
Reseating the GPU
Updating GPU drivers
Reseating the RAM
Dropping the RAM further down to 3200.
Running memtest86+ on all the DIMMs individually. At one point, it actually crashed during the memtest!


As a final measure, I’ve swapped the 3090 out with my old GTX 1080. So far I’ve had no crashes or other irregularities, but I’m going to give it a couple days of regular use to put it to the test.


Thanks for sticking with this post so far, any input or thoughts would be very much appreciated.

I hope you’ve had stability with the GTX 1080, I would guess that the RTX 3090 or power, possibly even a mix of the two, are involved in the failures.

Getting the questions about the PSU out of the way: How old is the PSU? Is the GPU power connector using distinct rails from the CPU +12v connector?

I think that thoses error codes IRQL not less than or equal and Event 14, nvlddmkm point at the RTX 3090 failing, perhaps it’s stopped communicating with the motherboard. Can you try other drivers before another RMA?

K3n.

I had a similar issue with my 3960x with same mobo and after changing evey single thing on it turns out it was the cpu itself all along. Did an RMA and haven’t had a single BSOD ever since.

Thanks for the reply. After a couple days of no crashes on the GTX1080, I was starting to think I found the issue. But lo and behold I had another black screen on the third day.

My next step was to adjust the RAM speed, as I had realized that my RAM was set too high for what my CPU supports, and I have had a couple RAM related bluescreens in the past. After reducing it to 3200, the crashes stopped for a good week, and I once again thought I had solved it.

However today I got an identical black screen. I can now recreate it pretty reliably with one particularly heavy Blender render.

I’ve spent this evening disabling and enabling various settings: Adjusting the TDR delay in the registry, disabling PBO in the BIOS, and disabling hardware accelerated GPU scheduling. I stopped short of granting full permissions for the nvlddmkm.sys file in System32, as that would require me to grant user permissions for the entire C drive which seems like a bit of a security risk.

Finally, I tried rolling back to the old 551.23 NVIDIA driver from January. And guess what: So far, the render is proceeding with no problems and no crashes so far!

Knock on wood, but it seems like the issue was that the 555.85 driver (and perhaps even going back to 552.12) is faulty.

I do feel a bit foolish that rolling back wasn’t my first solution, but I’ve almost never had a bad driver experience with NVIDIA, so it was literally the last thing I thought of.

I’ll update if anything changes, but for now I’m calling this case closed. The nvlddmkm error seems to be a common one with Ryzen users, and the causes and solutions I’ve read about have been very wide ranging.

IRQL errors can be allot of things really from a outdated or improper device driver,
to faulty hardware or even malware.
But if the system has worked fine before than the first thing i would check are device drivers.
It could be that a buggy or outdated driver tries to access the wrong part of the memory.

Did you check with “BlueScreenView” by Nirsoft? You can pinpoint exactly what driver is giving you what error for the Blue Screen Crashes.