PC sometimes failing to POST after upgrade

LukaLumen · January 31, 2022, 11:31pm

Hi everyone!

I’m hoping to find someone who has an idea how to fix an issue I’m having.

Here’s a PC I’ve been using for 2 years with no issues whatsoever:

Asus Prime TRX40 Pro motherboard
4x16GB Corsair Vengeance LPX 3200MHz C16-18-18-36
AMD Threadripper 3970X (32core)
Cooler Master Wraith Ripper cooler
Asus GeForce GTX 1660 Super
Sabrent Rocket Q 1TB NVMe SSD
TP-Link WiFi 6 AX3000 PCIe WiFi Card
EVGA 210-GQ-0750-V1 750 GQ, 80+ GOLD 750W PSU
2x Cooler Master Sickle Flow RGB fans
2x NZXT fans (that came with the case)

I work in video production/motion graphics/3D and have decided to upgrade my graphics card. About 2 weeks ago I got a new Gigabyte 3060 Ti OC from BestBuy and replaced the 1660 Super with it. Everything was fine.
Feeling good about upgrading it with the current state of the market I wanted to keep upgrading my system…

I got the XPG Gaming GAMMIX S70 BLADE 2TB NVMe, but BestBuy delivered it in a crushed flattened box packed in an unprotected envelope and forced into my mailbox. Even the inside plastic packaging was flattened, it probably had something heavy sitting on it in transport. The SSD didn’t look broken, so I decided to try to install it. It was working fine for a day or two, but then while unzipping a large file on it the computer froze for a few seconds and the drive was no longer functional, it wouldn’t initialize at all. I replaced it with a new one at BestBuy and it’s working perfectly now in the same M.2 slot.

One of the NZXT fans was making a weird chirping noise, so I replaced both of those with another pair of Cooler Master Sickle Flow RGB fans. each pair of fans is connected to one header via a splitter. And so is the RGB cable. So 4 fans on 2 fan headers and 2 RGB headers.

Then I got another 4x16GB Corsair Vengeance LPX 3200MHz C16-18-18-36 to boost it to 128GB. So now I have 8x16GB. The motherboard manual says this particular RAM is supported in configs of 1/2/4 sticks. So technically 8 sticks are not supported, but I’ve never overclocked any of my system components and the Asus motherboard is said to have good headroom when it comes to power, so I’ve decided to try it and it does work beautifully with 128GB at 3200MHz 16-18-18-36 - standard DCOP profile. It’s showing up in the task manager and works great.

Now here’s where the issues start…

If I do manage to boot into Windows the system is super stable and cool when stress testing with everything stock, including RAM at 3200. After 10-15 minutes of Prime95 with RAM and CPU saturated at 100%, the CPU is at a cool 70degC and stable. I never got a blue screen or a system crash even if doing 3D renders or heavy stress tests with FurMark, Ryzen Master or Prime95.

Problem is, sometimes it doesn’t want to POST. Most of the time actually. It gets stuck at code 94 or sometimes it gets in a code loop.
I thought it was RAM since it’s not officially supported, but then I tried removing the new sticks and it still wouldn’t post. I tried MemTest86 and in the beginning it would show errors, them freeze, but after several tries it would just freeze around half way through testing. And that’s with 8 sticks, 4 sticks or just 1 stick, so I don’t think it’s an issue with RAM. Especially since RAM is doing great once I manage to boot into Windows. My system just doesn’t like MemTest86 for some reason and it never finishes. I tried it at 2133, 2933, 3000 and 3200. All of them work fine in Windows, but BIOS is still not posting most of the time.
I tried moving the GPU to PCIe1 (recommended by the MB) since I always kept my card in PCIe3. It’s pushing slightly on the Wraith Ripper cooler if in PCIe1… It didn’t help. I tried rearranging the power supply cables to it as well. Still not POSTing.
I tried removing M.2 SSD and leaving just the original SSD in, but it’s still not fixed.
I unplugged the RGB headers since that’s the only thing left that changed, but nothing.

So now I’m getting frustrated and I don’t know what else to try. Since I’ve made all of these upgrades in the same timeframe, I feel like it could be anything. That faulty SSD could have fried something on the MB maybe? But now it works fine in the same M.2 socket. It could be the RAM< but even if I go back to how it previously was, the problem is still there. The PSU is good quality and 750W should be fine and it’s stable when stress testing in Windows.

I’m sorry for the long post, but I wanted to give a detailed explanation. Does anyone have any ideas?

Thank you!

TheSkuffedKnerd · January 31, 2022, 11:44pm

Have you tried a single stick of RAM in each slot, separately? Put a stick in Slot 1, try Memtest. If it passes, move to slot 2, repeat. It will be a long process but that will help eliminate if it’s a physical slot problem. Once the 1st stick of RAM passes on all slots, put it aside and try a different stick and repeat the process.

My guess is that eventually you will find either a physical slot problem, or a RAM stick problem. If for some reason they all pass individually, then you MAY have a motherboard issue.

Alternatively, you could just take that RAM and toss it in another machine and see if it boots fine and passes Memtest on another PC.

LukaLumen · January 31, 2022, 11:49pm

Thanks for your reply!

Whenever I would try just one stick, I would try from slot A1. It’s a recommended setup for just one stick…
I’ll do what you said first thing in the morning and I’ll test each one in different slots. I don’t want to turn my computer off today as I have a lot of work to do and while I’m in Windows it’s running great. I’ll report back tomorrow.

FinOxy · February 1, 2022, 8:24am

Hi
I just checked the manual, the post code 94 is “PCI BUS Enumeration issue”
Do you have another GPU to test this with, since I don’t think the threadripper has a iGPU in it?
And place it into the PCIE_1 slot

Hmm
so you cannot get a post at all in its current state?
Edit:
Hmm, so you can get it to post. Can you find any specific action, that would stop it from posting.
Here are some common sense things, that are always worth checking out
I would personally start with the basics
-PSU-cables
Try to remove and reseat all of the PSU cables.
-Is it a display issue?
Correct input from the display, wrong output port on the pc, etc.

As I already said, I think you should check things I listed up at this message. If they help, great!
This might not solve the issue, but these are usually good to check

Draaksward · February 1, 2022, 10:33am

The thing with MemTest86 makes me think in the direction of CPU lanes. I’ve heard a few times (on YT sadly, but still) of similar ill behavior just because the CPU was “sitting incorrectly”(it sounds dumb, I know, but have my own experience at finding the major problem being caused by something very silly).

You had an M.2, which, although not likely, may have caused some damage. Also the freeze you mentioned. I pulled out TWO winner lottery tickets with regular SSD’s (Kingston and WD), which seemed to have a corrupted file system out of the box. This resolved in a full system freeze, but reset helped (needed a SecureErase+ to fix it two times already). Although I have 2 m.2, I never got that kind of issue there, but would assume “it’s possible”(and what damage it may cause in case of m.2 lanes is unknown to me).

I don’t think it’s the GPU (you did say you had that 1660 for 2 years, right). But I would start that funtime game of mobo on the table, cpu, psu, 1 trusty ram stick and a gpu. If that scenario will produce the issue… well, then my suspicions may be something more than a “that guy, who doesn’t know what he’s speaking about”.

LukaLumen · February 1, 2022, 2:25pm

Thanks your replies!

I don’t have another GPU, but I was thinking of buying a simplest $20 one just to have a display output to test with.

I’ve tried reseating the PSU cables and all the other cables and sockets. The only thing I haven’t reseated is the CPU as it’s been working perfectly fine for two years since I installed it. Is it possible that it moved and it’s sitting incorrectly on its own after 2 years?

I tried it without the new SSD in the M.2 socket and it still does the same thing.

What’s weird to me is that all of the components (CPU, GPU, SSDs, PSU, RAM at full speed) work wonderfully once I manage to boot into windows. If I never had to restart, I would never have any of these issues.
How can there be a hardware issue that makes the BIOS not post and then it fixes itself once I’m in Windows?
The way it usually goes is, I get the code 94 or sometimes a code loop for 3-4 restarts, then the BIOS starts in safe mode and tells me to change settings, but I don’t change any settings, restart and get to Windows with all the components rock stable. Then it happens again after a shut down more than 50% of the time.

I just woke up and I’ll start with that process of stripping it down and testing one by one now. It’ll be difficult to notice the issue since it sometimes boots fine even now, but I guess I have to try.

FinOxy · February 1, 2022, 3:04pm

Hmm
I have also heard from the threadripper socket seating issues (again, sadly from youtubers), which might effect the PCIe lanes.
Based on that, it might wort reseasting the cpu as well.
Good that you checked the Motherboards cables, that is always a good step to start with. I would have mentioned it in the original message, but you most likely had so much data to process in your head, I would have that as well.

LukaLumen · February 1, 2022, 3:12pm

I’ll definitely do that, thanks. I’ll google about it a bit. Luckily I still have some MX-4 thermal paste…

It’s just weird that the CPU would trigger the bios not to post, but in Windows it’s super stable when stress tested. Worth a try though.

ElBawkBawk · February 1, 2022, 3:22pm

LTT had a post a few weeks back where power management on his thread ripper was the issue. If I remember, some setting during a bios update reset? Anyway, may be worth looking at the process they followed, for other ideas.

TheSkuffedKnerd · February 1, 2022, 4:00pm

Threadripper, I missed that part earlier. So it uses the little torque wrench thing to properly seat the CPU in the socket? There is a DEFINATE only 1-way to properly tighten those screws so you get the proper torque all the way around. I suppose it’s possible that if that wasn’t followed during the initial install, it could have worked itself loose with all the jostling around with upgrades being done.

LukaLumen · February 1, 2022, 4:48pm

Oh, I remember it. I’m very particular and detailed when it comes to that. I followed the instructions absolutely and used the little torque wrench.

It worked perfectly from the first time I did it until I messed with the upgrades two weeks ago. The only thing I fiddled with on the CPU is the thermal paste… I noticed the CPU was throttling with my Wraith Ripper cooler and the stock paste on it. So quite a while ago, maybe a year ago I cleaned it off and applied the MX-4. Huge improvement. The CPU stays below 70degC even after a while of full load. Of course it gets toasty after a longer period with Prime95, but HWinfo still shows no throttling. I’m super happy with it.

I’ll definitely watch the video now and try reseating it, thanks!

FinOxy · February 1, 2022, 5:00pm

Hmm, allright
Again, I am working off those videos, so take my comments from a person, who has never worked with threadripper. My experience is mostly on consumer level platforms and a single intel Zeon E-series CPU. Althought to me, it might as well be a core i5.
I should have made this more clear…
Sorry about that.

LukaLumen · February 1, 2022, 5:46pm

Thanks ElBawkBawk, I just watched the video and unfortunately his was booting fine, but would randomly crash in Windows. Mine is rock steady in Windows and 100% stable. Just not booting properly :-/ So there’s no logs for me to check and c-state is probably not the culprit.

I did do the bios update before installing Windows 11, but that was months before the problem started… Maybe it’s worth a shot re-flashing bios?

I’ll keep digging.

ElBawkBawk · February 2, 2022, 8:08pm

PSU is another thing. That’s near double wattage from the 1660 to the 3060 ti. Sure that’s at max, but your psu is old and not getting more-better. It’ll fade over time. Toss in some fans, and a power hungry nvme drive, and you may have reached it limit (which is surely lower than the day you bought it)

I have a small 5 gpu mining rig to heat the small addition on my house, and when the old PSU started to go, it would not post, and but it would repeatedly reboot, when it failed to post. On the 4th or 5th try, it would go just fine for days. I put power monitor on it to see what it was drawing, and it seemed when it hit 80% of it’s capacity, it would act erratic. I replaced it, last month, and haven’t had any issues since.

Check the new pci power cables, splitters too. double check they are seated in the supply correctly. then try a different GPU and PSU.

LukaLumen · February 2, 2022, 9:40pm

Thank you for your suggestions!

I had PSU issues in the past and that was something I considered. I made sure all the cables have good connections.
When going through wattage calculators, for my whole rig, with 8 sticks of ram, fans…, everything included, 650W PSU is recommended. I have 750 and it’s Gold, with Japanese capacitors. If I put maximum strain to it, with FurMark and Prime95 running at the same time, so the CPU and GPU are drawing their full power potential, the PSU is going strong. Considering the problem only happens when POSTing and the PSU is stable in Windows, I counted that out.

Now, I have some updates…
I’ve been testing the system, but it’s hard to say if I narrowed down the problem because the only thing that seems to improve chances of the BIOS POSTing is to remove the SSD (from a slot that experienced SSD failure recently). This is purely anecdotal at this point.

However, another big clue is that Memtest86 is failing absolutely every time I try it. No matter how many sticks of RAM and no matter which slots. Using my old trusted sticks or using the new ones, it shows errors and/or freezes every time. I’m sure not all sticks and all sockets are damaged, so this is weird behavior. Again, when I’m in Windows, RAM is working perfectly with all 8 sticks at 3200MHz. It’s only Memtest86 that freezes.

Could it be the CPU?? It was always great and it would be a huge coincidence if it’s failing now when I upgraded other components. Also, it’s running perfectly when stress tested in Windows.

This is driving me insane :-/

TheSkuffedKnerd · February 2, 2022, 10:46pm

At that point, I would go with mobo or CPU. Have you removed and tried reseating and making sure the proper torque sequence is followed when putting the CPU back in?

LukaLumen · February 2, 2022, 11:10pm

That’s something I left for last. I’ll try doing it now.

Before I do, here’s what happened now:

With everything stripped off other than CPU, GPU and 1 stick of RAM in slot A1, I ran another test with Memtest86… Again, it didn’t show any errors, but it froze around 68%. I left it running for over an hour, but it was just doing nothing.

The weird thing is, the GPU was very hot to the touch and the fans weren’t running at all. I don’t know why the GPU would be stressed with only Memtest86 running from a bootable USB.

Also, there was a code F2 on the display which means “Recovery mode started” and that makes no sense to me. It was frozen in Memtest86, why would a BIOS recovery mode start?

I’ll try reseating the CPU now as the last hail Mary. None of this makes sense :-/

TheSkuffedKnerd · February 2, 2022, 11:36pm

Hmm … I wonder if Memtest is getting confused between system RAM and the GPU VRAM and trying to test the GPU instead of the system RAM? I’ve never seen that happen but anything is possible.

If you happen to have another GPU laying about, try putting it in then run Memtest and see what happens.

This is getting weird and I’m beginning to think you are trolling us

LukaLumen · February 2, 2022, 11:41pm

Oh god, I think my PC is trolling me! I though I was quite knowledgeable, I’ve been building PCs for myself and friends/family for 20+ years and never seen anything like it.

At this point I’m suspecting the SSD that failed shorted the M.2 slot and that is giving the PCIe errors when posting (the new drive in that slot works fine in Win though!!!).

And Memtest86 is just not playing well with my PC for some reason and I’ll think of that as an unrelated problem for now.

LukaLumen · February 3, 2022, 1:30am

Ok, the nightmare is over. I think.

After stripping down the PC and testing it component by component, the issue is the M.2 socket that must have gotten damaged when the SSD shorted in it. It’s strange that it would still work fine if I managed to boot into Windows. But it would trip BIOS most of the time and didn’t let it POST.

The thing that seemingly made it harder is that the issue persisted even if I would remove the drive from that socket unless I cleared CMOS while the battery was out. So it wasn’t enough to take the drive out and try rebooting again…

Now after replicating it several times I’m 99% sure it’s the M.2 port.
The good news is, I still have a third M.2 socket I can use for the same drive and it’s working fine.

The only question I still have is - is it worth it to complain to Best Buy? The drive I got from them was in a crushed box and the reason one of my M.2 sockets is unusable is because that drive shorted it. I kind of blame myself for even sticking that drive into my system, but I thought the worst that could happen is that it wouldn’t work.

Thanks again to everyone who contributed! It made me feel better and not so alone…