Hello!
All you lovely people were able to help with my last issue using my Threadripper and I’ve ran into another wall.
Current System:
Ubuntu 20.04
Threadripper Pro 3975WX
Gigabyte WRX80 Motherboard
8 x 64 GB NEMIX RAM → drop in replacement for a ram on QVL and given thumbs up by Gigabyte
2 x 3090 Gigabyte Gaming OC GPUs
Crucial P1 NVME 2TB Drive
Custom Loop Watercooling for TR Pro and GPUs → both are under 30C, max at 45C under load
Up until 2 months ago, my Dual Boot (Windows 10/Ubuntu 22.04 LTS) system worked like a charm.
Then, my screen would freeze (mouse could move) for about 20 seconds, and then my computer would reboot. This happened both in Linux and Windows.
I thought perhaps there was an issue with my video cards, so I reinstalled drivers, and even removed my dual boot, so that I was only debugging in Linux. I have tried OpenSUSE Tumbleweed, Ubuntu 20.04, Ubuntu 22.04, and Windows 10/11. All encounter the same issue with screen slowing down, then freeze, then mouse only moving, then reboot. Also I am only able to install new OS’s with nomodeset in my grub loader, and then using the built in graphics on the ASPEED thingy my motherboard has.
When I boot my computer, I get a lot of BIOS ACPI errors, so I looked to disable that (acpi=off in my grub). I then set the max CSTATE to 1 in my grub config too (seemed to have helped some linux Ryzen users).
After I reinstalled Ubuntu 20.04, my computer seemed to be stable (for 2 days), but then the crashes came back.
Not sure how to go about debugging this and would appreciate anyones help!
Given that this is happening under multiple operating systems, you’re almost certainly dealing with a hardware issue. When I hear random full system crashes, my mind instantly goes to RAM or PSU issues.
Under a linux distro, check dmesg/journalctl for errors after recovering from a crash. They may have quality hints regarding what is causing the crashes.
Definitely try running a full 4 passes of Memtest86 on your memory, Nemix brand RAM is very hit or miss.
I’ll get memtest back on my OS tonight or tomorrow and get the journalctl from the last crash.
There are two types of crashes I am experiencing:
an instant black screen → reboot
slowdown → screen freeze → only mouse able to move → then reset.
I have a ton of memory (512GB) so it would take about 1 day to do one iteration, I highly doubt my computer will survive for that long before crashing but I will try it using Passmark’s memtest
I unfortunately do not have another PSU, but there have been some issues with the Silverstone ST1500-TI that other users experienced n the past. Would have to get a 1500W+ PSU and those aint cheap. Would a power supply tester be a good idea instead?
For Memtest you don’t need anything more than the memory and the fans plugged in. You can use the onboard video out for display. Removing components that aren’t completely necessary for basic function is a good place to start troubleshooting overall, even if it does pass a full round of Memtest.
With all that extra stuff removed, especially the GPUs, you can get away with a much lower wattage PSU if you have one of those around.
I feel your pain on how long it will take, I am going to have 256GB worth to test next week once my new modules are delivered, but it’s a necessary step when sussing out issues. Personally I don’t even bother booting into an OS when I have new RAM until it survives a full round of Memtest. Hopefully (or not hopefully, depending on how you look at it) you will see memory errors popping up within the first 24 hours.
A consumer-grade PSU tester really only proves “does this rail currently show the correct number when plugged into a tester”. Your PSU might have a more difficult to prove issue when it is fully loaded with all those high-powered components. A more complex PSU tester is likely going to cost much more than a high wattage PSU.
coz when i did a apt update the other day and then ran a kerboroaster my system locked up with a black screen and then bsod. took about 10 seconds in total.
everything was working fine before hand…
end result i had to do a apt full-upgrade/apt-get dist-upgrade.
everything seemed fine after.
I’m unfortunately away for the holidays now. But when I get back will take apart the build, memtest, test the gpus and look for possible leaks that could have affected it.
I’m actually considering un-watercooling/SFXing this build since it’s really a pain in the butt to work with and I won’t be traveling much anymore (yay remote work).
Have a happy holidays and I’ll update this post with what I find.
Ah I have tried installing 4 distros from scratch - Windows, Ubuntu 22.04/20.04, OpenSUSE Tumbleweed, and Clear Linux. All had the same crashing tendencies