Diagnosing a dead build

Kevadu · October 12, 2017, 3:31am

So I finally put together the Threadripper system I’ve been planning for a while. Yay!

…it doesn’t work. Like, at all. Shit.

It powers on, sure, but not in a remotely usable state. I can’t even get to the UEFI. So now begins the painstaking process of figuring out where the problem is. Fortunately my motherboard (ASRock X399 Taichi) does have a digital display for error codes on it. Unfortunately the code it’s hanging on is completely unhelpful. The manual just says, “Chipset initialization error. Please press reset or clear CMOS.”

I’ve tried resetting the CMOS, it doesn’t do shit.

Now there are a bunch of other codes that flash by before that one. To be honest I’m not sure how much I should even be paying attention to those (this is my first mobo with one of these displays…).

I’ve tried fiddling with the RAM, installing just a stick at a time but it doesn’t matter which stick I use. To be honest I’m at a bit of a loss with regards to how to proceed from here. Since everything in this build is new I can’t have faith in any of the components…

So how do you guys deal with a situation like this?

anon54210716 · October 12, 2017, 3:50am

Ah that sucks. it is always exciting doing a new build, and then depressing when it isn’t awesome out of the box.

A couple of things I have done in similar situations:

Pull it all together, clean it all up and start again, slowly and methodically.
Only install the parts required to get the system to boot to UEFI. Read tha manual, follow the steps in there.
Make sure you have both additional power plugs connected to the motherboard, 4 & 8 pin.

The codes will change depending where the boot cycle is at. If it hangs on a particular code, that is usually important.

You can do a little bit of troubleshooting already, since you can probably rule out the power supply. I take it it is not shutting off.

Oh and I’m pretty jelly, I’d love a threadripper server.

RotaryWombat · October 12, 2017, 5:58am

Attempt to tighten the screws that hold down the CPU as well as the 4 screws that hold the socket to the board itself. If any of those are loose it wont make proper contact with the CPU and will cause issues.

noenken · October 12, 2017, 6:14am

Also try to remove the CPU and RAM and look if the error stays the same. If it does, I would guess it is the board.

anon85933304 · October 12, 2017, 7:23am

Keep a close eye on the “return by” date.
Last time this happened I spent 2 hors to narrow it down to the cpu/mb. I was so pissed off everything went back.

Kevadu · October 12, 2017, 8:37am

So it looks like of all things the problem is with the GPU. Why that would be producing CMOS error codes I don’t know, but I tried another one because it was something I could try (I don’t exactly have spare Threadripper CPUs sitting around to give a whirl, but I do have other video cards…) and much to my surprise I got the system to boot. Tried putting the previous one in again and it actually worked for a little while but then started giving me the same kind of issues. Swapped back and things were fine again.

The ‘worked for a little while’ part leaves me not completely convinced (I like it a lot more when things either work or they don’t…) but it certainly seems like that was the problem.

noenken · October 12, 2017, 8:40am

With the system now at least usable I would recommend a UEFI update.

SheepInACart · October 12, 2017, 9:04am

You really want to make sure it won’t crash during that however, so stability at least on the bios screen is almost needed beforehand.

DigitalBytes · October 12, 2017, 1:49pm

If it were me, I would use the video card you said is working, boot up ultimate boot CD or whatever you want to use that has a system stress test on it, and run that for a good hour or so to make sure you are stable enough to do a firmware update. If the board has dual UEFI like Gigabyte used to/still does (been awhile, not sure), you could YOLO it and try the update knowing you have a backup of the UEFI without doing a stress test. With a new platform like this, UEFI update should be the first thing you try before you pull your hair out.

NetBandit · October 12, 2017, 4:39pm

Never rule out the power supply. PSU problems are the worst. And especially high current systems like ThreadRipper.

Let me tell you about my power supply issue: System ran great for 4 weeks. Then I had one reboot, figured it might be software. Then it happened again, and again, increasing frequency. I had that sinking feeling in my gut, but I had all quality parts, and my build was meticulous (I’ve been building systems since the turbo XT days). After eliminating shit like USB devices as a culprit (USB is shit!), I suspected power.

Turns out the EVGA 850 P2 power supply was shit when in silent mode. I turned on the fan always mode (it’s a switch on the power supply) and things stayed up. I filed an RMA request with EVGA. I installed the replacement power supply, and it failed within an hour. It was worse than the original one.

So, I went and bought a Corsair RM850 power supply. Arguably, not as good as it’s only gold, and the EVGA PSU was Platinum. However once I installed the Corsair, everything has been great (100+ days uptime, never a single crash since). Fortunately, Amazon allowed me to return the original EVGA power supply and I shipped back the bad replacement EVGA power supply at their expense.

I’ll never use EVGA again. Not only were their power supplies crap, but the hoops they make customers jump through are absurd.

Kevadu · October 12, 2017, 7:35pm

Did a UEFI update without issues (well OK, it was actually really flakey about finding the new version on my USB stick, but once it finally did there were no problems…). Set up my RAID arrays, installed Debian, and actually left it running all night rsync’ing stuff from my old machine. No issues as long as I don’t have the bad video card in there.

Time to RMA a video card I guess. Thankfully of all the components that’s probably the easiest one to deal with.

NetBandit · October 12, 2017, 9:08pm

See, that could be a power supply thing
Taking out the power draw of the video card puts the system below the threshold on power.
Run some CPU tests without the video card. Might be illuminating.

anon54210716 · October 12, 2017, 10:45pm

Hmm strange. I figured switching power supplies will either work, or shut themselves down if overloaded/other errors. However, this protection probably doesn’t apply to every rail they provide, even all the strange ones.

Now with every man and their dog making power supplies, and they are getting cheaper, more need to be cautious I guess.

Good news! What sort of card did you have in there?

DigitalBytes · October 12, 2017, 11:22pm

In a perfect world they would. But depending on the quality and age of the PS, it can cause all sorts of weird problems. @NetBandit is right, they are sometimes the most frustrating things to diagnose because they often don’t produce any useful error logs due to them being the base of the whole system. That’s why I tell people to never cheap out on the PS, no matter how tempting it is to save a few bucks.

To rule out PS, I guess we would need to know what video cards we are talking about here. If power draw is similar and uses the same amount of PCIe power connections between the two, I would comfortably say the PS is good.

Kevadu · October 13, 2017, 1:57am

I don’t know why you’re fixated on this power supply idea but the working card draws significantly more power than the broken one so I really don’t think that’s it.

Perhaps I should explain that I was planning on doing GPU passthrough on this system so I got a cheap RX 460 for the host system and a Vega 64 I was going to use through passthrough. Since I’m still just setting things up when I first started running into these problems I was running the 460 by itself. Vega wasn’t even installed. Now I’ve got it running under Vega so GPU passthrough will have to wait but it’s still a pretty nice system…

DigitalBytes · October 13, 2017, 2:22am

With that information, I think you are good saying the RX460 is the issue for whatever reason. Without knowing what video cards you were trying to run before, it was impossible to say for sure. Does the 460 run in another system?

NetBandit · October 13, 2017, 2:38am

OK you have more information than I do. I suppose a video card can go belly up.
It’s just that in my experience, power supplies, RAM, and hard drives are the components most likely to fail. RAM is easy enough to test, and the problem doesn’t sounds like hard drive. Memtest86+ is so easy to use, that you should always run this for a couple passes just to rule out RAM.