Goodbye (and Hello) threadripper

I was reading trough your other topic and i suppose you pretty,
much most of the obvious things already.
But have you tried to re-seat the cpu already?

I would not really be surprised if the culprit might actually be the motherboard at this point.

all 3 motherboards?

2 Likes

Ah he used three boards already?
than that is less likely then unless really unlucky.

Yeah in the video she summarizes all the parts she has swapped.

3x Motherboards
2x CPU’s
Zero to multiple M.2
I think at least 3 different GPU’s in different slots.
2 or 3 PSUs (even had the wiring updated to the socket I think)

//edit > fixed pronouns, apologies I think the audio in the first video mixed me up and I made assumptions.

1 Like

Yeah i was just watching that, was reading his other topic.
This is really strange.

So sad. Read all your stuff. Soo weird. I changed my system during production working for a AAA studio. Which is something that made me VERY nervous. But the trx50 sage wifi just booted with my old hard-drive. Imagine that.
Iam so sad for you.
But yeah Asus and amd for leaving us so in the dark with the expo thing again does not help.

A thing that was a bit strange to me is how stubern my kingston dimms were to slide in. Their pcb is a bit rough. Did you double check if your dimms sit correctly? You should hear the click sound on both sides!
Also i dont want to critique really but i would at least make a bit more for a clean/roomy space to build such things. Juyfin case you did that where it stands right now. I basically even grounded my hands every single time i walked through my apartment and continuing to assemble the pc… Call me crazy, but I wear loose outfits at home charging up quickly. I also did not wear these

1 Like
4 Likes

Was reading your saga with this build in another post. Unbelievable. Hope the next build is smooth and you can get back to music and other things you love. Good luck.

1 Like

Have you tried turning it off and on? /s

Seriously, that sucks. Sorry it’s being so stubborn for ya :confused:

5 Likes

I can’t believe the problem is the hardware at this point. There has to be something else going on. I would be interested to see if the problems persisted if the computer was plugged into a good quality, dual on-line conversion UPS which would effectively isolate your computer from your house electrical system. The hardware has been ruled out at this point. It’s time to start looking elsewhere.

3 Likes

This was supposed to be addressed in the latest spr microcode. I had tripped over this issue as well.

Does your board have a bios update?

2 Likes

Did not know that. Thanks, I would not have tested again and now I will. Just installed latest BIOS yesterday, 1202, for Pro WS W790E-SAGE SE.

Wild idea, I’ve had this happen exactly once in 30yrs.

When I last rebuilt ‘Hulk’, dual Xeon 6154, I removed the CPUs but misplaced the caps. The caps protect the socket from debris, bent pins etc. Not a big deal, I’d be careful and everything would be fine. Reassembled the system to get intermittent CATERRs. Did all the troubleshooting to no avail. Finally decided to rebuild the system again. Upon removing the CPUs I noted a single cat hair managed its way into socket 0, the socket I was getting CATERR on.

I understand you re-built this system several times with new components and done a lot of troubleshooting. Nevertheless, it might be worth inspecting the socket for debris.

Also I agree with your assessment regarding not achieving 100% solid functionality on TRPro. Fairly solid but recoverable PCIe errors are problematic at times. Other than lowering to Gen3 I’ve found no solution.

Best of luck.

3 Likes

ERR codes are really explicit nowadays. /s

8 Likes

I feel for you, going through similar stuff with mine after a year and change of working flawlessly… it started crapping on me…

What’s the exact issue that you are having? can it even open BIOS? or does it crash or storage doesn’t work?

have you tried a different cooler? since you’ve tried so much already, the cooler having just enough manufacturing defect to cause some kind of mounting pressure issue when the CPU goes through heat cycles would be a very hard issue to find, it’s an odd thought but this is an odd problem.

also check out this post, user inexperience aside it’s another odd issue with switches on this motherboard. aside from the switches that the PXaZ on reddit switched you can also try checking Ln2 switch, try tuning it on off and make sure the rsvd switch is off.

just shooting ideas here never know.

I have used AMD Threadripper since the release of the 3975WX. Personally, the only things I did not understand when loading up the system were the need for power and exactly which lanes on the motherboard are 4x16 and which ones are not. I learned the hard way that motherboard only supports 4 - 4.0x16 lanes. I do not remember which GPUS you have in that workstation when loaded, but that matters.

Also, I did not understand the power it takes to run that motherboard. I remember loading up that motherboard; the PSU was like no way. Then, I realized there were multiple PSU connections, and I needed to have power at all of them. Even though I removed the use of multiple GPUs, I kept the multiple power connections.

And out of respect, you have lot going on, and from my point of view. I would appreciate that kind of input and connections one at a time. I am starting to believe there is something in that system that keeps tripping you up.

Someone mentioned ram, and for me and this workstation board I have, I only use ECC. At the end of the stand, I wanted the stability in the ram depot. I am not saying it is your issue. But maybe consider fining the smallest ECC RAM, without to much cost, at at 4GB each in size. And see what happens.

1 Like

I feel for you. I’m thankful my TR 1950x with 2 GPUs and 2 PCI-E USB cards has been rock solid. Running Linux. I removed one usb card and one gpu a few months ago, but that’s because I was planning for a new build.

Not sure if you already did this, but have you tried running a hypervisor and doing PCI-E Passthrough to a Windows VM? Maybe linux will handle the system better. If linux is also crashing with your hardware, then we know it is the hardware. And if a windows VM crashes with the passed through hardware, we know it’s broken windows.

The experience with this passed-through VMs will be almost as if you’re running Windows, with the only side effect that it boots a bit longer (hypervisor boots up and you get a black screen, but can access via a browser, but the windows VM can start up on boot, so you don’t have to interact with the hypervisor).

This is partly why I don’t like niche high-end products. There’s gonna be bugs in everything, but things that are owned by more people will get more bug reports and will have priority over things that are barely used by anyone. That’s why consumer platforms tend to get stable pretty fast.

I remember the days of early TR when it was considered a buggy mess, yet here I am rocking a 1st gen TR now. I bought it from one of the forum members about 3 years ago or so, getting close to 4 now (whew, how time flew by!).

1 Like

At this stage, I’d remove everything then reseat everything back in place with a thorough examination of every mount, connector, wire, pin and visible capacitor on the CPU, mobo and RAM with a magnifying glass or zoomed in phone camera. Shift the RAM modules around to different slots, alternate HDD / SSD / NVMe connectors (if available,) new thermal paste, check the QVL for all components, the whole megillah. Pay attention to the USB Mobo connectors. I had a few with bent / pushed in / broken pins straight from the manufacturer including an RMA replacement.

If doing that gives you the same results, then time to swap out components one at a time.

1 Like