Windows Server 2025! Blue Screen Hunt - Who's the culprit?

Been trying to find any rhyme or reason to this issue for the longest time. After I migrated to Windows Server 2025, I have been trying to enable Memory Integrity, but whenever I do - my system crashes every day at least once; but if I leave it with Memory Integrity off, it can last a couple of weeks or so without any crash. Of course I get a BSOD and I have a selection of dumps to look at:

Before any of this, I have actually tested the RAM through the Windows Memory Diagnostic utility, so I’m not sure at this time which component is the one with the issue. I just know it’s not the RAM itself for now. The section below highlights all the things I’ve tried to do with no success, and what I’ve been able to do with WinDbg. Anyone else here with more experience that could help me figure this one out?? :slight_smile:


I’m trying to hunt what’s the culprit of this problem. 99% of the issues belong to the Bug Check 0x18B: SECURE_KERNEL_ERROR - with or without Memory Integrity enabled. However, I am at a true loss on how to navigate this bug. This is what I’ve been able to piece out from using WinDbg on the MEMORY.DMP file:
image

  • Windows crashes with 0x18B - SECURE_KERNEL_ERROR- However, that’s not the true reason. The actual reason lies on Arg1:
  • There was a crash because of 0x18C - HYPERGUARD_VIOLATION. In particular Arg2 states that this crash was due to HYPERGUARD_VIOLATION_PAGE_HASH_MISMATCH_1002_IMAGE_win32k.sys.
  • Just like BlueScreenView reported - there was corruption on the memory loaded from win32k.sys and that’s what caused the crash. However, what is actually causing it?
  • Looking at the contiguous memory section tells me nothing, as these belong to other Windows drivers:
  • But what about the stack? It just states that it crashed while being Idle.
    image
  • ChatGPT cruise control highlighted that other drivers could be interacting with win32kfull.sys and that it may be anyone’s guess onto what’s heppening. One of the suggested culprits was a driver from MSI Dragon Center (MsIo64.sys) which should not be there because I removed the thing ages ago. Needless to say this needed to go regardless.

Things I’ve tried:

  • Updating the motherboard BIOS for the AM4 MSI MAG B550 Tomahawk + Ryzen 3950X CPU
  • Ran the Windows Memory Diagnostic utility, no errors on RAM
  • Configrued Driver Verifier with standard settings - Crashes on the same Bugcheck code
  • Removed MsIo64.sys - because why not, doesn’t do anything in there.

I’ve read through all the evidence you got there and the issues can only be either on the hardware side (bad RAM or even memory controller) or on the software side.

On the hardware side I’d try disabling EXPO and see if anything changes. It’s easy and painless, worth trying.
The fact that it’s crashing at idle it’s somewhat of a sign, in my experience, of some hardware fuckery going on.
A more “adventorous” solution is messing with voltages: since it doesn’t seem core related maybe upping VSOC a smidge and see if that has any effect on stability could do the trick. Second in line is CPU voltage. I won’t mess with any other voltage.

On the software side I’d unplug all the storage and install from scratch on a different drive, keeping 3rd party software to a minimum. Since it’s crashing at idle you won’t need much time to figure out if it keeps crashing due to software you installed.

Hardware:
Welp - RAM was always at stock, never really bothered with EXPO, same as CPU.

I did get an X570 MoBo in good condition (so I can do some GPU Virtualization shenanigans, this B550 disables ports depending on what you connect, so hopefully this’ll give me a chance to run MemTest86+ as I migrate the hardware around. The PSU is a beQuiet! DarkPower Pro 12, so I don’t think it’s going flaky.

Software:
Yeah this is an install from 2012 really, just being updated over time. Windows Server has been so nice to have because it has handled everything like a champ, it would be a shame to do a clean install. In any case, I’ll monitor the driver situation after removing that MSI Dragon Center component, and check what happens. If it crashes again in a month, then shame - there’s some serious testing that needs to happen.

1 Like

I would definitely run memtest86. Built in windows memory tester isn’t as consistent from my experience.

PSU is from a reliable brand, but doesn’t mean it doesn’t have any issues. Sometimes a component within is bad, out of tolerance, or something was missed in QC/QA.

Definitely worth starting fresh. Windows is a lot more resilient these days, but doesn’t mean something isn’t playing nice under the hood software side. I’d probably just pop another drive in there with a clean install and test with that for software side. Most of the time those BSOD are driver related and it can be real pain to track down the culprit. You could always try this to capture what is happening prior to a crash: Process Monitor - Sysinternals | Microsoft Learn

I dunno about Windows Server, but if that was a Windows install I would:

Reseat connections. Especially RAM and Storage.
Run Memtest86/SMART checks
Reinstall Windows
Replace RAM/Storage

Well, I did reseat the RAM, maybe it got loose due to thermal expansion??? (mayyyyyyyyyyyyyybeeeeeeeeeeeeee?)) but let’s see. I’ve noticed that it just lock when it’s not running with Memory Integrity, it wasn’t generating memory dumps, so I’ve activated it to see what happens.

Running Windows Memory Diagnostics and boy I think the RAM is not good:
image

This is the first time ever I’ve seen the tool displaying something too egregious like this.

Swapped the memory banks around and now it’s failing quicker! :slight_smile: I am so glad I doubled down on a new set of RAM banks yesterday.

And to wrap this story nicely - yes - it was the RAM. Ended up swapping the motherboard to an X570 (used) from a B550, got support for IOMMU/SRV without any tricks, and now I can look into getting a dedicated 10 Gbps Nic. 200 bucks down and I can push this setup for the next 5-8 years!

1 Like