My system had a water cooler installed that started to malfunction. As a result the system crashed a few times due to an overheated CPU, before I had the change to figure this out and replace it with a good air cooler.
However since then my system reboots/shutdown at arbitrary times. Even sometimes when only entering the bios settings. There is never any logging available that might give any clue what is going on (Ubuntu 18.04, log files don’t mention anything like kernel panics before the reboot/shutdown).
System is X299, ASUS TUF Mark I, 7820X, 32 GB, 1080Ti and I use it solely for machine learning type of projects. There seems no direct link to high CPU usage (tried to make it reproducible using tools like stress-ng but failed).
Is there anything else besides a malfunctioning CPU that could explain this (before I replace the wrong part ???
P.S The power supply is a Corsair AX860i and at least the self test gives a green light.
I want to say that’s unlikely, hopefully, because CPUs will thermal cut off if they are in trouble and save themselves.
What was the malfunction with the old AIO specifically?
Is the new one mounted properly? Could still be going over temp and shutting down out of safety?
Have you tried reseating the CPU (take it out.of the socket and put it back, good time to check the under side of the CPU just in case and the pins in the socket), clearing the BIOS?
Systems should shut down before the CPU gets hot enough to die
I’d agree with the other poster. Reseat your CPU. Check if any pins are borked, re-apply your CPU cooler.
If you’re still having no luck, maybe try a different CPU (or your CPU in a different mobo)
I’d even take the ram out and test individually. Maybe you took it out and put it back in, or knocked it when changing cooler. RAM can be so finiky that even a tiny bit of dust in the slot can make systems not post. Worth a try.
The malfunction was indeed only with the old AIO not working. Before that a very stable system. I saw when I investigated the AIO problem that the CPU had been shutting down due to > 95 degrees celcius.
I’m now monitoring the CPU all the time ;), and it crashes even when temperature is below 40 degrees. The air cooler is doing its job fine.
It also crossed my mind I made some installation mistake when installed the new Dark Rock Pro cooler, but since it is working fine and I didn’t reseat the CPU, not sure what could went wrong.
I guess that is indeed the least expensive path. Will buy some extra cooling paste and try if I can fix it with just some cleaning up and reseating. Thanks!
I don’t think you did because the system shuts down before any real damage is done.
Those X299 chips are pretty big so I’d say reset the BIOS to default, re-seat the CPU and make sure the torque on all the cooler’s screws is somewhat equal, on both the mounting and the cooler itself.
The only thing that might’ve get damaged is the memory controller. On my old laptop that ran too hot for a while memtest86+ spits out errors sometimes even with a new RAM kit, but the laptop never crashes during use. So if you have some doubts try running memtest and see what happens.
I get a shutdown/reboot, even just using the bios => nothing to do with Linux or drivers.
I get the reboot even with only 1 memory stick => memory itself is fine
Tested the power supply separate under load, no issues => power supply unit is fine
open test bench with extra fan on VRM => very low temperatures but still reboot. Not a heating issue.
I also noticed that first time till a restart happens takes usually around 5-7 minutes (regardless of going into bios or into OS). Next time, it takes fewer minutes. So this hints more on some motherboard/capacitor issue than a CPU issue I would guess? Investigation continues…
Just a quick update: it indeed turned out not to be a CPU issue but instead the power supply was to blame. Apparently a known issue with the Corsair AX860i that it can be a little too fussy and decides to shutdown when it shouldn’t. A simple self test even with some load won’t show this malfunctioning.
Now I temporary connected a very old power supply, and everything is running fine again. Perhaps power supplies just aren’t meant to be smart
Last update from my side. I reported the issue with Corsair and even though I did not have the receipt anymore, they send me a refurbished ax1200i unit.
Will have to make some small alterations to my case due to the larger size, but everything is up and running. Also happy with Corsair customer support for helping me out.