[SOLVED] Please help me diagnose random system crashes. home server

i have a server, and its been crashing randomly for some time. very annoying. i finally got kdump working to try and get a core dump of these crashes. but i just happened to for once actually witness the system crash. kexec was not actually invoked, its like the kernel never actually panicked. the system just reset for no apparent reason. no POST errors, the bios led numbers all read normal, no thermal warnings, all memory passes memtest. im at an absolute loss as to how i can go about identifying the cause of the system crash and reset, given that the system never actually panics, and thus never invokes kdump. in my years of working with computers, ive never encountered a system resetting randomly and never throwing any error, no beep codes, no bios led abnormalities, and never panicing. so strange.

hopefully somebody here more experienced than i has seen this kind of thing before and could point me in the direction as to how to diagnose it.

asus prime x399-a.
Threadripper 1950x
Linux 5.18.

A lot could be the issue - try to isolate elements - for example use one memory stick at a time (if possible). Narrow it down as much as you can and don’t swap 2 things at once…

1 Like

all memory sticks are less than a month old. replacing all of them was the first thing i tried.

surely the memtest already ruled out memory failure. right?

given the randomness of these crashes, downing the server, by removing needed memory, for days on end just to wait, possibly in vain, for a crash, that may or may not even have anything to do with memory, seems like a very irrational way to go about it. these resets only happen about once a week or so.

i dont mean to argue, im just hesitant to down my server for so long with such little chance of finding the cause.

(&(& - missed the memtest (providing tech support to son at same time). But note it suggests not memory sticks (could also be memory settings, connectors etc.).
Don’t mind the “arguing” - we are both challenging assumptions on this! And I know the “downtime issue”.

So now thinking try different software - would BSD or windows work for a test? As a worst case, a much later linux? I personally run the latest (which may not suit you) - but a large-ish OS difference might help pin it down. Still so many variables - I feel your pain…

1 Like

the system has never been compatible with windows. never worked with BSD before, im not sure it would pickup any failures that linux wouldn’t. i am running with AER enabled. ill look into the feasibility of it.

perhaps it’s time to upgrade the system to linux 6.1. i shall start the backup scripts now. ill begin configuring and compiling the new kernel tomorrow afternoon. backups should be done by then. assuming all goes well, new kernel should be running by Thursday morning.

I’ve decided I’m not compatible with windows, so…
I remember (very many years ago Windows NT beta worked flawlessly where WIndows 3.x didn’t, so I’m always a fan of a different OS as a test - if feasible). I feel the system is somwhat more “production”, so I see the challenges. Good luck!

1 Like

Well, it’s worth noting that until this problem started appearing, the system ran perfectly for years, on kernels 4.10 and every major release since.

PSU age?

3 Likes

about 5 years. going on 6.
EVGA SuperNova series.
fed by an APC SmartUPS i got refurbished from Schneider electric. battery is less than a year old. UPS just self-tested this morning, all seems good. mains power issues are not the cause, i would see the anomalies in the UPS’ logs.

If the PSU were failing, wouldn’t it just, not turn on?

Wouldn’t AGESA throw some sort of POST error if something wasn’t getting enough power?

The PSU is modular, so replacing it is feasible, so long as I don’t also have to replace all the PSU cables inside the system. That, would be a several hour long job. Lots of stuff going on inside this server, lots of things routed cable management in hard to reach places.

Is there any safe way I could test the PSU without downing the system for very long? Does a PSU equivalent of Memtest exist?

I’ve never seen a PSU half-fail before, and I’m not sure if I’m seeing it now. Everytime I’ve encountered a PSU death, it just stopped working at all.

Update: can’t seem to compile Linux 6.1 (or newer) on Debian Bullseye. I’d imagine it to be unwise to attempt to upgrade to Bookworm when the system could reset without warning at any second.

Weird questions,

  • where is the server placed?
  • do you happen to sit on a chair that gives you electrostatic discharges when you stand up?
1 Like

The server is in its own room. It is a concrete basement/foundation. I remotely access it from my office, which is on a different floor of the building. I try to near the server as little as possible. Because it’s loud. Temperates are fine, I put alot of effort into keeping that basement cool.

The server rack is black powder coated steel, and all walls and the floor of the server room are concrete. No fabric or cloth anywhere to be found.

I have a foam air filter on the intake that should further dehumidify the intake air. I check and clean this filter regularly. This is in addition to the buildings own HVAC filters.

I feel like static damage is quite unlikely, as it would’ve made itself obvious, and the misbehavior of the system would be more severe, right?

I had this once. PC would suddendly do hard power off. Then after 20secs or so when I pressed the power button it would turn on again.
The problem went away when I installed a new PSU.

3 Likes

A flaky PSU could cause something like this - are you close to using its rated power? Another thought is updating the BIOS - perhaps there was an instability you never encountered until recently (different kernel, usage or something) - I tried looking it up and the Asus website failed me :frowning:
Given that downtime is an issue can you build a temporary replacement? I can’t think of anything more that wont take a lot of downtime to discover.

1 Like

Ok that rules out electrostacic discharges (it bit me with my last build and took me forever to diagnose)

Next suspect is going to be memory then … do you have sensors for the memory temps? Can you try pointing a fan at them and see if it helps ?
The PSU is another suspect, but given the server reboots itself more than powering down and you are not triggering protections I’d test first thermals from the memory.

If you don’t have a fan handy just opening the case may reduce inside temps a little

If that doean’t help, I would start trying less memory modules at a time, but that would take a lot of time and a lot of shutdowns …

1 Like

I will contact EVGA and see if my PSU is still under warranty. If not, it’ll be some time before it can be replaced.
update: yeah, its out of warranty. ill update you guys in like a month when i get the PSU swapped out.

You can get 6.1 from backports - enable in /etc/apt/sources* and install with:
sudo apt install linux-image-amd64 -t bullseye-backports

I’m betting on the power supply + motherboard combo not handling transients well.

You can try systester or mprime or stress. … and you can try running them wrapped with cpulimit tool which just uses SIGSTOP / SIGCONT. With these tools you can create a scenario where you’re drawing 300W+ for 50ms then near nothing for 50-100, then 300W+ for a brief interval, then almost nothing.

Having lots of power up / power down transients should “press the issue”.

1 Like

When the new unit comes in, is it advisable to just change the PSU itself and none of the cables? Or imis it a big deal to use fresh cables with a fresh PSU?

There’s a lot of these power cables, all. routed through cable management inside this case. It would take several additional man-hours to change them out. Would it be wise to avoid that? For the sake of downtime avoidance? My first time swapping a PSU in a downtime-sensitive situation.

Cables can be specific to brand (possibly model, not sure about date of manufacture), so not a good idea.

As you are looking for PSU issues, you could leave all cables in situ, disconnect mobo and temporarily use the new PSU to power that. (so using two PSUs at the same time).

If it makes a difference, you can then worry about cable management…

2 Likes

Try disabling C-states and/or power saving in the BIOS and see if the problem goes away.

1 Like