I recently setup a server running Jellyfin and a Caddy reverse proxy on Ubuntu Server 23.04.
However I noticed that this system would randomly crash about once a day.
Specs are as follows:
Intel Core i5-9400F
ASUS B360-F
DDR4 8GB Single Stick 2400mt/s
2x WD Green 120GB SSD (Media is mounted to this host via a SMB share)
Intel Arc A750 GPU
FSP Hexa 85+ 350w PSU
What might be the problem and how do I troubleshoot this thing?
If you look closer on that page there’s a test section.
Re: forcing/triggering crashes… not sure. We don’t know whether it’s hardware or software… you could try stressing the system with other burn in tools.
Since you mentioned stuckness, after checking up on crashlogs, look for information on watchdogs.
Are you using the Intel Arc for transcoding? You could also try stressing the GPU part of the system by starting and stopping jellyfin playback a lot somehow, … not sure how to script it.
Right now, I’m leaning towards either power supply+motherboard not being able to handle big transients well… or Intel Arc drivers… but this is wildly speculating.
Yes, but it never crashed while I’m trying to playback something.
The system immediately hung after I triggered a crash from the console the first time
It got stuck at pstore: crypto_comp_compress failed, ret = -22! the second time I triggered a crash
The grub config file mentioned in that section for changing the memory size doesn’t exist on my system
Leaning towards hardware and power stuff more, that thing is supposed to return length of compressed data, -22 makes no sense unless it’s some wonky undocumented error code.
… can you fiddle with power options/CPU governor maybe, make it not go down all the way to C10 or those fancy power saving states somehow? … do you have a screen/keyboard… can you check the UEFI for CPU power states?
(btw, changing this just as a diagnostic thing, not as a permanent solution).
Clearing the UEFI settings might also be a thing you want to try (not sure if you were tuning it to save on your power bill).
Maybe something is really off with the crash handling and triggering code (oh Ubuntu how you disappoint).
C10 is a save-a-lot state, try doing CPU governor performance, disable C8/C10, and just use the box for a day or two, see what happens. (unless someone has a better suggestion).
Try looking up other burn in options in the meantime.
I think this works as a two-for-one ddx, you spend two days and if crashes increase you know its not software, if crashes decrease you know its not software. … (shitty that it’d take two days).