DoA MSI Z390 Motherboard or 9900kf?

TechTangents · November 25, 2020, 12:50am

I have been troubleshooting a new build all month and I’m at my wits end now. It’s built around an i9-9900kf and I went with an MSI Z390 board(full specs and models below.) It’s been so been so much so fast that I don’t remember how it responded when I first set it up. But as it is right now, the system has a problem where it will freeze so hard that the reset button isn’t able to restart the system. On Windows I thought it didn’t have the issue but after 4 days of stable use it froze once, then froze again the next day. If I load any Linux distro the problem will usually happen in minutes of loading the desktop environment. But sometimes it can last longer before it happens.

Here are the diagnostic steps I’ve taken so far:

Install Kubuntu 20.04, crashes
Install CentOS 7.3, crashes
Migrated existing Windows 10 1909 install, crashes
Disabled XMP, crashes
Updated/Wiped BIOS, crashes
Memtest86 test of 4 passes for 5 hrs, all pass
Tested each RAM stick independently, crashes
Intel Processor Diagnostic Tool, all pass
Replaced PSU, no change (both PSU’s are not new but came out of working systems)
Removed all PCIe cards, crashes
Replaced GTX 1080ti with GTX 970, crashes(cannot run without GPU due to being an “F” chip)
Booted from M.2/SATA/USB, crashes

Nothing has changed the crashing. Temps are all fine (I’m using a Noctua NH-D15). No overclocking, no RAM changes(so it’s running at 2133MHz). BIOS settings are all pretty much out-of-the-box

Another thing about the problem is that I haven’t been able to identify one particular trigger for it. The only possible correlation I’ve seen is that it may be more likely to crash after the system has been stressed. Not during the stress usually, but a minute or so after. Or it will crash if I restart after stress testing. All the stress testing has been done under Linux, I haven’t been able to reliably crash it in Windows so I haven’t tested it as much there.

I don’t know what is causing the freezing. But since I’ve had it fail with none of the cards, and with different RAM, I feel like it must be either the CPU, the motherboard, or how I have it configured. I would really appreciate advice on how to proceed.

Parts:
CPU: Intel i9-9900kf
MOBO: MSI Z390 GAMING EDGE AC
RAM: 32GB(2x161) Corsair LPX 3200MHz
GPU: EVGA GTX 1080 ti FTW3

Boot Drive: Samsung 970 Evo M.2
Bulk Storage HDDs: 5x Seagate IronWolf 2TB NAS (BTRFS RAID 5 with WinBTRFS driver)

Other PCIe cards:
Intel X520-T2 10Gb NIC
Marvell 88SE9215 based 4 port SATA controller

I haven been trying to avoid returning items on a whim to test because I’m not a huge fan of generating e-waste. I can return the motherboard to Amazon where I bought it from but I will have to RMA the CPU with Intel.

GigaBusterEXE · November 25, 2020, 1:11am

whoa is this THE tech tangents?

you might have overheating VRMs?
you can check the temps of them using HWinfo64

you could also try reseating the CPU

try running ram at 1.35v, you can run DDR4 up to around 1.45v daily on overclocks, I personally run 1.39v

TechTangents · November 25, 2020, 1:22am

Hah, I don’t know if I qualify as being know as “THE” Tech Tangents but here’s a video of the issue happening that will verify that I’m at least the one from Youtube. It looks like I can’t add links but it at https://youtu.be/vKw-AbEPKX0

I’ve never used HWinfo64 but I think this is the area that would show those temps? Everything looks good there to me:
mobotemps

I’ll restart and bump up the RAM voltage to that and try to get it to crash again in Linux.

GigaBusterEXE · November 25, 2020, 1:26am

I bumped you up a level, new users can’t post links so it limits spammers, I can’t get it to embed though, I think MOS is the VRM temps which were fine but you were also probably at idle, keep an eye on them during load

misiektw · November 25, 2020, 1:29am

To diagnose hardware problem with pc stuff, fastest and often cheapest way is to use more stuff. But use least amount stuff necessary to run it, removing variables is the first step.

That being said, judgeing from your description:

I would start with CPU first too. I cant remember ever having situation in which hardware reset doesn’t work, unless maybe cpu or bios died.
AFAIK this indicates that reset vector isn’t called properly, so basically cpu is dead until power cycle.

Also one memory not working but all working seems like memory controller also in CPU. Try one stick in different slots if your mobo supports one stick only.

So start with only cpu, mobo, memory and psu, (outside case), in your case (KF) also some lowend GPU. Boot from usb, disconnect anything else.

Then if problem persists I would escalate as such:

Reseat CPU like Gigabuster said
Put CPU in another mobo, best in known good working computer
Get another mobo and put current cpu in.

Those 3 steps will tell you if cpu/mobo combo is ok, which is fundamental for troubleshooting anything else. Otherwise you will be running in circles.

Also there’s small chance that your reset button is just broken

GigaBusterEXE · November 25, 2020, 1:33am

last time it happened to me was when my friend told me to “go up one” on voltage of my new Athlon II X2, apparantly he meant not an entire extra volt, ah the old days of being a noob

misiektw · November 25, 2020, 1:34am

Yeah, so basically there was nothing to reset really

SgtAwesomesauce · November 25, 2020, 1:46am

So this (being unable to reset-button out of a crash) usually indicates the OOB management controller is broken. You know, that little ARM CPU that brings up the board for the x86 CPU to take over from?

I have a nagging suspicion that you’ve got a bad motherboard.

Is there anything interesting in the syslog?

You can quick-upload with: dmesg | curl -F 'f:1=<-' ix.io

Other folk: We should find the source of the crashes, rather than just assuming that chucking more voltage at it will solve the problem. Since it happens in Linux (I got the impression, at least) more frequently than Windows, I don’t know that cranking up the vcore will fix anything.

TechTangents · November 25, 2020, 1:54am

I figured it was something like that, thanks!

I may not have phrased my troubleshooting there well. The crash can happen with both sticks installed and with either of the two individually installed.

I do need to try reseating it still. I don’t have any other motherboards or CPUs this new, my last new system build was in 2013. I was thinking about getting a replacement for the motherboard next but wanted to make sure I wasn’t being premature on that.

I have done this inside the case essentially. When I reseat the CPU I can do it again as a sanity check.

Heh, it does work but it would almost be funny if that was it at this point. But it still freezes anyway.

This would be the easiest to fix so I’m hoping that’s it.

When it crashes it is a full stop. Nothing gets logged. But here is that still: http://ix.io/2FnL

This part is what has been driving me nuts the most. It will seem stable and then just crash out of nowhere.

Linux does seem more prone to crashes from my experience troubleshooting it.

I adjusted the RAM voltage to 1.35v. Since I still don’t really have a way to force it to crash so I’m just using my highly scientific method of running a couple games while rendering a video to try and stress everything all at once. If it doesn’t crash during that a restart after 30m of stress usually trips it.

SgtAwesomesauce · November 25, 2020, 1:56am

Right, that’s expected. Sometimes we do see errors in syslog that can indicate a driver is misbehaving before a crash.

misiektw · November 25, 2020, 2:02am

Yes, try that. Also there are plenty of reasons why getting rid of case is good idea.

Some time ago on reason was that some bug crawled between mobo and case and was shorting something.
Other time someone screwed mobo with one screw post in wrong place.

Point is, you want to remove as many variables as possible.

SgtAwesomesauce · November 25, 2020, 2:03am

I’m not seeing anything suspicious in the syslog. It’s all looking good.

I highly suspect it is hardware.

TechTangents · November 25, 2020, 2:05am

It crashed at 1.35V for the RAM still. And like I said before, it made it through the whole stress test without any errors and then crashed after I closed everything.

I tried to reboot into Windows as quickly as I could after that to get a better idea of VRM temps and they were about 10C higher at max. I’ll stress it in Windows and watch those now.

That’s hilarious, and I totally get eliminating variables so I understand doing that still.

SgtAwesomesauce · November 25, 2020, 2:06am

Is it crashing as the CPU’s ramping down, or more of “in the moments after”?

If it takes 3 seconds to crash, that’d be moments after.

TechTangents · November 25, 2020, 2:07am

Moments after for sure. This was probably about a minute after this time.

SgtAwesomesauce · November 25, 2020, 2:09am

Okay. I ask because I’d blame bad LLC tuning on the crash if it were as it’s ramping down.

At this point, if it were me, I’d take the following steps:

Try booting a bare minimum with the mobo on the box, rather than in the case
RMA the board

misiektw · November 25, 2020, 2:13am

Just in case, put mobo on something non conducting. I usually use box that mobo came in if it doesn’t have silver embossing.
But any piece of cardboard will be fine.

I’m not judging, but again past experience shows that people sometimes have weird ideas.

TechTangents · November 25, 2020, 2:23am

I couldn’t get it to crash on Windows, which is par for the course again. But I ram Prime95 for a minute or so on it while tracking temps. VRMs to get a lot hotter but so does the whole chip.

(not that it really matters in this case, but ambient is 21C right now)

I’m going to pull the motherboard now.

Yeah, the box is fine for this. I can understand people not thinking about putting it on the foil ESD bags they come in sometimes, but that will lead to a bad day.

MisteryAngel · November 25, 2020, 2:34am

Well i do have to mention that unfortunately your particular motherboard isn’t really a great board for overclocking i7’s or i9’s.
The vrm is a bit on the weak side, with a discrete mosfet setup.
I noticed in your screenshot mos temp of 86°C,
which i assume might be the mosfet temperature.
It’s technically is fine, however not really ideal for stock.
Likely not really a whole lot of overclocking headroom with that board.

I also noticed that your cpu temp is pretty high.
so my guess that the crashes might be related to that.
But if you can still return the motherboard and get a refund.
Then it might not be a bad idea investing a little bit more in a better board.
And another question which cpu cooler do you use?

Also definitely try to re-seat the cpu.

TechTangents · November 25, 2020, 3:13am

Ah, that’s not very encouraging. I should add then that I think I can hear coil whine related to CPU stress so maybe I won’t exchange it I’ll just return it and get something else.

I don’t really plan on overclocking it, but I’d rather have something more appropriate for the CPU. I didn’t put a lot of thought into the board for CPU compatibility.

Well, that was with Prime95 running a torture test. The temps have been really good so far.

A Noctua NH-D15. I was trying to spec the system for OC’ing but I don’t really plan on doing it. I just want a lot of headroom and to have it run a little quieter.

I just finished doing that now.