Need Help, System Randomly crashing

I setup a Jellyfin server a while a go and the host has been randomly crashing and going offline.
The operating system is Ubuntu 23.04 Server

System specs are as follows:

  • Intel Core i5-9400F
  • 8GB DDR4 Ram
  • ROG Strix B360-F Motherboard
  • 2x WD Green 120GB SSD
  • Intel Arc A750
  • FSP Hexa 85+ 350w Power Supply

Troubleshooting done so far:

  • Memtest86+ came back without errors
  • fsck on boot drive came back without errors found
  • Disabled Ubuntu automatic reboot for updates didn’t help

Ongoing attempts to catch the error:
I have setup a Youtube livestream capturing its output at the link below:
Timezone of the timestamp on the top right is in UTC+8

The screenshot below is system logs before it had last crashed

can you get kdump working?

How do I do this? Can you provide a link to a guide on how to do this on Ubuntu?

Here’s a starting point.

These may be helpful in the future as well:

@NightMoose @mathew2214
Thanks for the help, I have gotten Kdump to work.
Is there anything I can do outside of waiting for it to crash now?

After about 2 months, it finally crashed again.
It likely wasn’t due to a kernel panic since the kdump didn’t kick in even though the system did in fact crash and the system just hung.

After a few days it crashed again. This time I was able to catch it and it is stuck on this screen (viewing through the capture card I attached)

All drives are detected in this screenshot

Tried updating BIOS? It likely crashed because of something else but it could be anything from video driver issue to a hardware fault.

Update: I finally managed to catch the actual crash happening on video.
It was… uneventful

Hi, I know you tested your RAM, but have you tried more RAM or a different set, have you tried turning off XMP and running the RAM stock? RAM overclocking can sometimes cause glitches in a system.

Have you checked the OS or the OS drive? There might be some corrupted data or a bad memory block on the SSD

I noticed that the server login said Jellyfin, 8GB RAM is the minimum for Jellyfin, and you might need a little extra RAM. If it is doing transcoding work you will definitely need more RAM.

You might also consider a bigger power supply. A quick trip over to PC Part Picker to quickly model your system (Part List - Intel Core i5-9400F, Arc A750 - PCPartPicker) shows that the max power draw on those components is 385W If it is doing transcoding work it will sail over your power supply’s 350W.

Just some things I would try if I was chasing down a bug causing a crash.

If the system runs out of RAM it shouldn’t reboot (possibly crash or be very unhappy in general however), otherwise it’s broken memory management of the Linux kernel :wink:

Transcoding doesn’t take up much RAM in general so 8Gb should be fine however since the ARC series is still pretty new you might see some kind of driver bug which may take down the system so a newer kernel and video driver might resolve your issue however I have no idea how you change that using Ubuntu or if its even possible without breaking the package system. That being said, it might very well be some kind of heat, power related issue. The prompt doesn’t seem to the one where it outputs kernel output or such so without a log it’s hard to determine whats going on.

I haven’t as it never showed signs pointing to an issue with the BIOS until a few days ago but now I don’t have physical access so can’t do it.

The RAM stick doesn’t support XMP and is running at its JEDEC speed.

I would’ve like to but I don’t know how. Keep in mind that I don’t currently have physical access so I can only do it from the OS. However it won’t let me since I can’t unmount my root file system in the middle of using the system afaik. I can’t run fsck without unmounting.

Even with all that, I don’t think the problem is with the SSD. This SSD was used for something else, and it would have already caused problems then if it was actually bad.

8GB is a recommended number, not a hard requirement. I wrote that page, I know. If the system is running out of memory, the OS would kill processes to free memory instead of just crashing.

Every time it crashes it’s sitting idle doing a whole lot of nothing, and it never cut out in the middle of transcoding something for Jellyfin. If the power supply actually got overloaded it would restart instead of crash. I don’t think this is the problem.

I’ve performed multiple full system updates, including updating the kernel and the same behavior still persists.

There are no traces of the event in the system logs so I really can’t pinpoint the issue. This is why I have a capture card attached to the system in the first place, so that I can (hopefully) extract some useful information from the screen it crashed.

I do not know about Jellyfin’s encoding procedure, but Plex uses more RAM when it transcodes, and 8GB is Jellyfin’s base minimum. The Intel Arc is likely using QuickSync, but you are right that the kernel might need to be updated. The RAM, however is the quickest hardware change to make, and RAM can do weird intermittent things. As my uncle used to say, “it couldn’t hurt, and it might help.”

Currently I suspect something is up with the motherboard.

I’m having the people physically there to swap the set of CPU/MB/RAM (Makes it easier for them)

Is you power clean and do you have spare reliable power supply? This looks like one those weird power issues or very low level hardware problem.

The kind of problem you never get answer to how and why, just it went away after replacing power supply, or after installing UPS is stable again :slight_smile:

Check if your random resets do not correlate with high power usage (cpu+gpu), your combined tdp might be too much for you psu. It might not be able to supply 250-300W on 12V power line.

Its “bronze” certified, i.e lowest possible quality if certified. Cheapest available gold/platinum seasonic is always good bet.

Power quality issues aren’t really a thing in my area. I have another system using the exact same model of power supply and it isn’t exhibiting the same behavior. This power supply was also a known good unit.

They don’t. Also, this PSU is rated for full 350w output on the +12V rail

I don’t think they would’ve bothered to use all Japanese caps and a DC-DC design if it was supposed to be a bargain basement grade unit.

Well, I have run out of patience, and the system is yours to fix, so good luck.

On that note … you do know that the Japanese capacitor thing is a marketing ploy. Sure there are great Japanese capacitors, but just like everything manufactured by any component manufacturer there are also some just fair Japanese capacitors. Also … How do you get D/C - D/C design … there has to be an A/C portion in the design somewhere.

CPU, MB and Ram have been replaced (as a set)

I’ll see how it goes.

I also installed a smart plug so I can reset it more easily.

The thing with a broken file system is that it can be the cause but it can also be the result of all this

Its also rated 350E on 12V rail and 350W total on all rails. That and is the point here. There is very little slack here. Your gpu alone is known to power spike up to 310W (ref. with different model here).

Generally recommended target psu for your config is around 500-550W. You are way below that. I would risk you psu size if it were high effciency unit from well known manufacturer, but it definitely isnt.

Historically fortron bronze certified units did not even pass basic conformace testing under normal load, so you get exactly what you pay for. Little. There are however no reviews or any actual data for this model, so we are going by well deserved reputation and pricepoint only.

Its nor risk worth the price difference between actually proven good units and this cheap ones.
Look its just 100 USD delta between this 500W platinum unit, that will last you years.

TLDR: Cheaping out on PSU is risky, cheaping out on “server” psu is even more unwise.

Try running both cpu and gpu stress test simultaneously and see if it triggers random unexpected shutdown. Proper power testing is out of the question, since we nobody here has either tools or expertise.

Power testing is very underserved area in tech journalism, outside of few enthusiast sites. It really shame, especially once you see how deep the rabbithole really goes.

There is really nothing else to be done here beside running basic tests and replacing components one by one.

I can’t find any CLI tools for stress testing Intel GPUs under linux