Debugging a crashy server

Hey,
New to the forum, sorry if this is in the wrong topic.

I have a Linux server I am running, which is suffering regular crashes (recently daily) and I have been driven by frustration, trying to debug this thing over the last year.

luckily this is not a remote machine, so resetting it is a single button press, but it is getting very inconvenient.

Does anyone have any tips for determining what is crashing my server, the SystemD logs have naturally proved useless, they just stop shortly before the server does?

Failing that does anyone know how to enable a watchdog to reset it when the OS borks?

Many thanks in advance.

Well since SystemD isnā€™t any help, the only other advise i can give you is to find out if you have a hardware or software problem. If I knew which distro of Linux you are running and what are your components of your computer. I might be able to have soon more ideas.

I am not confident about it being software or hardware, maybe a weird interaction between the two.

Software:
Arch Linux, no AUR packages, archzfs repos added.
Docker
QEMU/KVM hypervisor runnining too

Hardware:

Ryzen 5 1600,
ASUS x370 prime motherboard,
16GB of Corsair Non-ECC ram,
(I was on a tight budget)
Corsair FORCE MP 500 NVME disk
4 4TB drives forming a single z-1 pool
Asus 10G ethernet adapter
XFX 500W PSU

GPU? - haā€¦ this thing doesnā€™t even have a video out!

1 Like

Needs more info.

OS?
last update?
running services? (out of defaults)
hw info?
output of journalctl -b -1 -p err after a crash (logs from previous boot)

1 Like

ECC doesnā€™t matter anyway, you donā€™t need it.

I wouldnā€™t advise this for a server, at all. or zfs from a 3rd party repo.

1 Like

I have arch running on other servers, without issue, I confident this is something unique to this machine

As for ZFS itā€™s too late now, theres over 6TB of data stored on it that I have no where else to put

That doesnā€™t change the fact that most people who care about stability and uptime donā€™t run arch on servers.

Thatā€™s perfectly fine. If you switch to Ubuntu, it has native support for zfs from canonical.


As far as actual help though, do you have a spare disk somewhere that you can install Ubuntu or centos on? If you can, go do that and see if you still get daily crashes. Iā€™m inclined to blame arch, but donā€™t want to until we rule out an upstream or hardware issue. Have you run a memtest and system bench?

Are you configured for persistent systemd logs so you can check previous sessions? If not, do that and look again at journalctl -b 1 for your culprit. It may not show, but that might help.

Additionally, get a display, hook it up. If your crash is a kernel panic, and you donā€™t have a display server running, it will probably print out a panic message that should help you trace the problem module.

Both of these require me to hook up a GPU and run some tests. Looks like I will be running some tests over the weekend. Any recommendations on a system bench utility?

As for running Ubuntuā€¦ I used to, I explicitly installed arch becauseā€¦ why did I install arch? I forgetā€¦

So thereā€™s the gcc compile test that will tell you if you have the ryzen segfault bug. Since youā€™re encountering regular problems with a 1600, I recommend trying that.

You could also just simulate load (to see if itā€™s a stable cpu in general) with stress-ng.

Other benchmark tools include blender and unigine, but those require a display server and gpu, typically.

Sounds like a case of Stockholm syndrome. :stuck_out_tongue:

Yeaā€¦ this is the second Ryzen 5 1600 Iā€™ve had in this server. The first one did have the segfault bug, the second one is a replacement from AMD, and was thoroughly tested for the bug when I installed it.

1 Like

Okay, so itā€™s not that.

I would try out Ubuntu or centos, just to see if itā€™s an arch problem.

The bios has an option about the power supplyā€¦ regular stand by or low power standby. Set it to regular standby. Update uefi.

If that fails disable c5/6 low power states. Common issue witnh ryzen is they use less Power than power supplies expect so things get wonky

3 Likes

/me rolls eyes.

Could be a memory issue - amongst other tests, Iā€™d run memtest (often installed as one of the grub menu options) for 24 hours. Iā€™d also try and get the details of the messages that occur just before or during the crash; without a videocard, youā€™re looking at setting up a serial port console and using a separate laptop to capture the output.

Arch may not be the best choice for a server, but it shouldnā€™t be crashing that frequently.

This isnt a mission critical serverā€¦ obviously. Nor is ecc strictly required for zfs, again, because its not mission critical.

1 Like

ECC isnā€™t even required for Mission Critical ZFS. If anything, ZFS makes ECC less necessary.

1 Like

ECC is a bit of trigger for me - Iā€™d argue that ECC memory is important whenever youā€™re creating and storing data. But if you disagree with me, thatā€™s fine; Iā€™ll just not trust my data with anything you run :slight_smile:

You overvalue ECC. It only corrects certain types of errors, and doesnā€™t even warn against all errors either, so while itā€™s definitely nice to have, Iā€™ve run plenty of systems without ECC and Iā€™ve been just fine. If youā€™re working in a place with lots of cosmic rays, radiation or powerful magnetic fields, itā€™s probably ā€œrequiredā€ but otherwise, itā€™s an option.

That said, running a memtest is never a bad thing.

Well technically I ā€˜overvalueā€™ my data.

Call me mad, but Iā€™d rather trust my nearly 30 years of running servers over the opinion of some stranger on the Internet. Iā€™ve seen (in log files) many, many corrected errors.

Thats fine, im sure you run a stable OS, with stable and trusted packages, with hardware that has the right support and redundant hardware. But weā€™re not talking about your setup.

For this setup, its obviously not really needed. Nothing wrong with that necessarily.

1 Like

Well Shit!

Thatā€™s not a Kernel Panic,

It took several days to get this error, and it crashed again this morning, this time no errors just a locked up console.

OK what do I do with this error?