Hey,
New to the forum, sorry if this is in the wrong topic.
I have a Linux server I am running, which is suffering regular crashes (recently daily) and I have been driven by frustration, trying to debug this thing over the last year.
luckily this is not a remote machine, so resetting it is a single button press, but it is getting very inconvenient.
Does anyone have any tips for determining what is crashing my server, the SystemD logs have naturally proved useless, they just stop shortly before the server does?
Failing that does anyone know how to enable a watchdog to reset it when the OS borks?
Well since SystemD isnāt any help, the only other advise i can give you is to find out if you have a hardware or software problem. If I knew which distro of Linux you are running and what are your components of your computer. I might be able to have soon more ideas.
I am not confident about it being software or hardware, maybe a weird interaction between the two.
Software:
Arch Linux, no AUR packages, archzfs repos added.
Docker
QEMU/KVM hypervisor runnining too
Hardware:
Ryzen 5 1600,
ASUS x370 prime motherboard,
16GB of Corsair Non-ECC ram,
(I was on a tight budget)
Corsair FORCE MP 500 NVME disk
4 4TB drives forming a single z-1 pool
Asus 10G ethernet adapter
XFX 500W PSU
GPU? - haā¦ this thing doesnāt even have a video out!
That doesnāt change the fact that most people who care about stability and uptime donāt run arch on servers.
Thatās perfectly fine. If you switch to Ubuntu, it has native support for zfs from canonical.
As far as actual help though, do you have a spare disk somewhere that you can install Ubuntu or centos on? If you can, go do that and see if you still get daily crashes. Iām inclined to blame arch, but donāt want to until we rule out an upstream or hardware issue. Have you run a memtest and system bench?
Are you configured for persistent systemd logs so you can check previous sessions? If not, do that and look again at journalctl -b 1 for your culprit. It may not show, but that might help.
Additionally, get a display, hook it up. If your crash is a kernel panic, and you donāt have a display server running, it will probably print out a panic message that should help you trace the problem module.
Both of these require me to hook up a GPU and run some tests. Looks like I will be running some tests over the weekend. Any recommendations on a system bench utility?
As for running Ubuntuā¦ I used to, I explicitly installed arch becauseā¦ why did I install arch? I forgetā¦
So thereās the gcc compile test that will tell you if you have the ryzen segfault bug. Since youāre encountering regular problems with a 1600, I recommend trying that.
You could also just simulate load (to see if itās a stable cpu in general) with stress-ng.
Other benchmark tools include blender and unigine, but those require a display server and gpu, typically.
Yeaā¦ this is the second Ryzen 5 1600 Iāve had in this server. The first one did have the segfault bug, the second one is a replacement from AMD, and was thoroughly tested for the bug when I installed it.
Could be a memory issue - amongst other tests, Iād run memtest (often installed as one of the grub menu options) for 24 hours. Iād also try and get the details of the messages that occur just before or during the crash; without a videocard, youāre looking at setting up a serial port console and using a separate laptop to capture the output.
Arch may not be the best choice for a server, but it shouldnāt be crashing that frequently.
ECC is a bit of trigger for me - Iād argue that ECC memory is important whenever youāre creating and storing data. But if you disagree with me, thatās fine; Iāll just not trust my data with anything you run
You overvalue ECC. It only corrects certain types of errors, and doesnāt even warn against all errors either, so while itās definitely nice to have, Iāve run plenty of systems without ECC and Iāve been just fine. If youāre working in a place with lots of cosmic rays, radiation or powerful magnetic fields, itās probably ārequiredā but otherwise, itās an option.
That said, running a memtest is never a bad thing.
Call me mad, but Iād rather trust my nearly 30 years of running servers over the opinion of some stranger on the Internet. Iāve seen (in log files) many, many corrected errors.
Thats fine, im sure you run a stable OS, with stable and trusted packages, with hardware that has the right support and redundant hardware. But weāre not talking about your setup.
For this setup, its obviously not really needed. Nothing wrong with that necessarily.