Cannot boot anything except MemTest & UBCD

Hello LevelOneGeniuses,
I’ve run into a very strange issue today that has me completely stumped and I’m hoping someone here might point me in a new direction. I woke up today to my home server being down and I am now unable to get ANY OS to fully boot.

I’m running a TR Pro 5955WX in an ASUS WRX80E-SAGE WIFI mobo. It has been reliably running TrueNAS for a little over a year now but I am now unable to boot from either of the mirrored boot drives, or even from a USB drive. I’ve tried booting Ubuntu 22.04, a clean TrueNAS image, CentOS(unsure exact version atm), and an image of Debian server(11 I think?). All of these images successfully boot on my desktop PC.
I’ve tried resetting BIOS back to an external, known good backup as well as back to defaults, re-flashed the bios(1401->1401), I’ve tried CSM enabled and disabled(was always running disabled), I’ve removed all expansion cards besides GPU, and nothing results in a successful boot.
The boot process’ hang at seemingly random points in the process, with 2-3 times successfully booting into TrueNAS, but then locking up within 3 seconds. When these lockups happen, the keyboard input is not recognized and I’m forced to shutdown using the front power button. These lockups always happen after Grub has loaded and while the OS itself is booting.
Currently running P95 via UBCD to stress test the CPU as I’ve seen a report or 2 of CPU’s being the problem, but I’m so far 1hr into the P95 test without issues. I’ve also ran MemTest without any errors. Temps are not a problem causing this.

At this point I’m stumped, it could maybe be the CPU still, but the fact that I’ve been constantly working on this for over 12 hours now and can’t get anything to reliably boot but Prime has been issue free for an hour already seems odd? Mem seems odd as well, passing MemTest but constantly failing to boot an OS?
I unfortunately don’t have any means of acquiring a replacement CPU at this time, so I’m unable to test that, but if anyone else has any other ideas I’m hopeful to try them out! Hoping to eventually close this post with a resolution :slight_smile: Please let me know if there is any other information I can provide to maybe help diagnose this.

Thanks,
Kyle

Does the IPMI give you any insight into system problems?

Thanks for the response!
I’m not super familiar with all the IPMI stuff, as this was my "let’s learn something’ project a while back. I did check IPMI and I didn’t see anything in any of the logs, though I’m not sure if I should expect to see much in there.
I did notice that it is reporting a PSU error. I picked up a new PSU today and swapped it in, tried a CPU burn in test on UBCD and got the same error again about a PSU failure. It’s telling me that PSU1 and PSU2 are reporting errors, though I’ve only ever been using 1 Corsair HX1200.
I’m running another Memtest right now to see if that will cause it to throw any errors. I was wrong in my original post that it was running P95 for over an hour, it had actually frozen as well, but the cursor on screen kept blinking and the keyboard lights stayed on, where as usually they don’t so I thought it was ok.

At this point, is it more likely the motherboard or the CPU? Anyone know of any good ways to try and figure that out without going to buy another $1000+ mobo/cpu? Both should be in warranty, but I need to know which to send in for a warranty claim now.

Thank you so much!

Some of the usual suspects would be mounting pressure for the cpu and heat sink.
Do you have the lower pcie 6 pin power connectors in?
I did find that the newer IPMI firmware makes some errors ‘go away’, fan speed, not sure about PSU.

Also, check all the connections on the mobo are still tight. Make sure the CPU cooler is working.

I’d try disabling even the onboard peripherals like USB controllers as well.

If I’m reading correctly, you’ve only ever tried the same GPU in it? Can you try another? If you have a VGA cable you can connect it to the onboard VGA header connected to the BMC, but you’ll either need the adaptor or some Dupont wires to bodge it.

You can also enable the onboard VGA and use it over the IPMI interface, you don’t need a GPU.

Depending on the screen mode (i.e. a VGA mode), the cursor blink doesn’t need the CPU’s involvement, it’s entirely done by the GPU so the CPU might still be locked-up at that point.

I didn’t see you mention the post code, that would be an important bit o’ info