Threadripper windows issues

Ok I’ve really had enough. Can I vent and ask for help?

TLDR: Are there any bios specific settings for this combination of hardware that makes windows stable?

So I’m very busy… too busy to deal with issues that I feel a manufacturer should have known about and dealt with.

I’m 1 IT guy for a medum sized business, and I have a lot on my plate so when our content creators needed new machines to replace their 2014 iMacs I suggested we go with a windows machine as the iMac( or Mac pro take your pick) would be 3x the price with 1/2 the performance(ok I over exaggerate but you get the picture ) of a threadripper.

They were all about it so I was happy but this is where is all goes wrong. So since I’m so busy, I thought buying a prebuilt gaming pc would be the easiest way to go, it would take the time of building one off of my hands and let me do my normal work. I love to build and always build my own but for the company just making a purchase makes since. So I decided to go with a reputable manufacturer that I like(I wont flame them) to custom build the machines. We bought 2 identical machines.

The specs are:

AMD Ryzen Threadripper 3970X - 32 Cores - 3.7GHz
Manufacturer Branded Liquid Cooled(it’s a corsair 240mm system)
NVIDIA GeForce RTX 2080 Ti
AsRock TRX40 Creator
64 GB HyperX Predator RGB - 4 x 16GB - 3200MHz
EVGA 850W SuperNOVA P2
500GB Seagate FireCuda 520 Gen4 M.2 NVMe (OS Drive) m.2 slot 1
2TB Seagate FireCuda 520 Gen4 M.2 NVMe (Scratch Drive) m.2 slot 3

Ok so this has been a nightmare. Damages during shipping breaking the glass on one of the cases, so I had to clean the case out, literally had to take everything apart and rebuild it. While not the manufacturers fault that was just the beginning. Both PC’s had issues. Turns out that both had bad graphice cards… but they made it through qc…? They sent us new graphics cards and after I replace them it fixed the completely unstable windows nightmare, but 1 problem has remained the entire time.

Every day the computer completely dies right after boot and loading windows. You might get signed in, and it might make it 1 ~ 3 minutes if you’re lucky, then the entire system dies… just goes black… dead… you have to push the power button to turn it back on. After maybe 1, 2, or 3 times doing this it finally stays running and they can get their work done. This is not right. Are there any bios settings\configurations that anyone knows of that can help?

Our current bios settings are factory defaults except
XMPP for the ram it’s set to 3200MHZ
Infinity Fabric is set to 1600MHZ
Everything is set to auto no overclock

Any insight, help, suggestions are appreciated. We’ve only had these systems like 30 days, I’ve spoken with support from builder several times and it get’s me no where. They’re saying things like it’s the power in your building, or to reinstall windows. I’ve done everything I know to rule out everything except for messing with the bios settings(which I think will fix this) unless there’s just some incompatible hardware here.

Thanks guys,

Jonathan

I’d slow the ram down, way down. XMP is overclocking, it is an optimistic rating based on a perfect world configuration.

Start at 2400mhz and see how things go. A lot of mass produced machine are still at 2133 so 2400 isn’t terrible.

Rather than reinstall windows, swap in another drive and do a fresh install. If stability improves great, if not then you have your original install to go back to.

Depending on what level of support you paid for from your builder, it should be on them to make sure the machines are reliable. Maybe it’s time to send them back and look for something else.

If this is a medium sized business (few 1000 employees,) you should command enough a big enough budget to influence a vendor to make sure you’re happy. Loosing out on hundreds of thousands in future sales isn’t something even big vendors want to risk.

I had to play hardball with Dell just a few weeks ago over a precision workstation that arrived with a crap GPU. They probably tested one output witch worked fine but didn’t test the others.

3 Likes

So I have done a reinstall of windows on 1 of the systems and the issue continued. I also ran the ram at the factory default(2933 if I remember right) for a bit and still had the issues so I bumped it back up to 3200 since there was no difference.

We have about 80 employees and or budget isn’t very big and won’t be of help with this vendor. I should have used our dell rep and just went with their machines but couldn’t get the specs I wanted at the time.

The vendor has tried several times to assure me that the systems went through qc but I don’t see how that is even possible.

So it not just the screen going black, but the fans stop spinning, and lights inside the case go off? Like someone switched off the PSU?

Yes like the psu was turned off. They just go dead, everything off, LED’s, fans, all buttons, everything, like it lost power. I’ve seen some really funky things with psu’s going bad over the years and know that it could be a possibility so I hooked them up to my psu tester and they’re both fine, no issues with either.

I’d send them back for a refund if that option is still on the table.

Lenovo is coming out with Threadripper based business class machines this winter.

1 Like

Check the voltages. I had it that the auto voltages on some rails (ex. PLL) when XMP enabled have gone through the roof on a 1.8V rail it gave suddenly 2.2V.

Almost sounds like a thermal issue, maybe the pump isn’t kicking on all the time?

I guess it could be the pump but after 2 or 3 times it stays solid for most of the day. They have had it cut off on them once during a render but other than that is’t mostly that 1-2 min window right after startup that it crashes like this.

That’s what we’re talking with them about. I really don’t have the time to troubleshoot that much but I’m just assuming it’s an issue because of one of the pieces of hardware. Ram is what I really think it might be but I don’t have different sticks to put in it. I’m going to set the freq down low like gordonthree said and see how that goes

The fact that the issue is impacting both machines indicates it is an inherent fault with the setup.

The timing of the glitch is interesting, as boot up can draw a lot of power in a very short time so it may be a momentary overheat. In a modern system thermal issues are the most common thing to cause instant shutoff.

I’m inclined to agree that sending them back now and cutting your losses is the best course of action. At the end of the day these are business workloads and any instability will cost you more in the long run than the hassle of a return.

As others have said there are enterprise vendors (Lenovo mainly) who sell non-gamer kit for the workload you describe, but you may want to also look at reputable boutiques like Puget Systems (depends on your market).

If you are going to persevere with the hardware, I’d investigate (in no particular order):

  • swapping the psu’s for another brand and consider 1000W option
  • use an adapter to connect the aio pump to the PSU directly, in case the mobo doesn’t have enough power (check which header it is on)
  • point a fan at the vrms to see if they are overheating (unlikely but easy to check)
  • dumb check - remove the aio and make sure they put thermal paste on it
  • swap out the GPU for a different brand.

At the end of the day if the list of actions above cost more (time and materials) than returning the equipment and starting again, you may want to consider ordering new stuff and getting a refund.

Final point. I’ve used threadripper on my workstation for 2 years. It can be an unstable mess and simple things like raid mirrors just don’t work properly. I’m a huge supporter of what AMD has achieved over the last couple of years but for the data centre and real workloads, we use Xeons. This is just a personal perspective though.

Thank’s for all of the feedback guys.

We’re going to return these machines and go with a dell solution since we already have that relationship. This morning they really had a freak out moment because one of the machines died before a complete boot “6 or 7 times” and they had some urgent things to complete.

The Vendor assured us that they would make it right and we could ship them the machines and they would rebuild them and make sure that they ran correctly but, with all that we’ve had to deal with, we’ve lost faith.

1 Like

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.