Computer crashes intermittently on Linux but not Windows

I’ve be having this issue for a few years now. It first started way back when I was running Ubuntu (16.04 I think) and after not being able to find the root cause, I found some work arounds that made it stop crashing… or at least, made it crash much less.

The problem is now rearing it’s head on my Debian 10 (root on ZFS) install.

The crashes only occur when I’m away from the system. They also only occur when the system is not doing certain tasks… let me explain. Back when my Ubuntu install was crashing, the symptoms were basically that the screen would remain blank when waking or switching the monitor back on, or everything apart from the mouse would be frozen and unresponsive. Sometimes though, not even the mouse would work. I couldn’t switch TTYs… even SysRq shortcuts wouldn’t work. My only recourse was to do a hard reboot or power cycle.

I can’t remember the reasoning behind it but I tried running RhythmBox (Ubuntu’s default music application) and leaving it playing an internet radio station. Which drastically reduced the crashes to the point where I can’t be certain that it did crash again. When I said it only crashed when not doing certain tasks, those tasks were anything that had continual user input (basically any form of physical interaction with the mousse and keyboard) or when it was playing media such as music. The system only crashed when being left to transfer or hash files, download torrents etc.

Fast forward a few years to my Debian install and after a few months without issue, it starts freezing when I would leave it transferring and hashing files between local and NAS storage. There’s a deadline on transferring and hashing the files, so I installed RhythymBox and tune into an internet radio station as a short termm “fix”, which helped for a couple of days, but now it has crashed again. Only this time, the system was left idling. A Firefox window was open, I was using VI to make a shopping list, and the internet radio was playing. I left the system for maybe fifteen minutes and when I came back, there was a blank screen, my GPU fans were running (they should be in 0rpm mode), and 1 of my 2 CPU fans had stopped.

The things is though, these issues have never happened on my Windows install (which makes me think the issues is related to Linux and it not playing nice with my hardware). I’m a really noob when it comes to diagnosing this sort of thing on Linux and would really appreciate some guidance please.

Here are my hardware specs:

CPU: Ryzen 7 1700x
CPU Heatsink: Noctua NH-D15
Mobo: Gigabyte AX370 Gaming 5
GPU: MSI 1660 Super Gaming X
Case: SIlverstone FT02 (3x 180mm fans and 1x 120mm fan)
Boot - Linux: WD Black 500GB NVMe
Boot - Microsoft Spyware 10: Samsung 850 500GB

Defunct hardware

CPU Heatsink: Scythe Kotetsu
GPU: Asus GTX 670 DirectCU II TOP
Case: Fractal Design XL R2
Boot - Linux: Crucial MX200
Boot - Microsoft Spyware 10: Samsung 830 250GB

[NOTE] Back when I was using my old Fractal case and the GTX 670, I stress tested the system for reason not related to the (or any) crashes. I do have fan curves that err on the side of silence but cooling was not an issue then and although I haven’t properly stress tested the updated system in the FT02 case, I havee no reason to believe cooling is an issue now. GIven that I can play game and transcode blurays on my Windows install without any issues.

The fact that this happens when the system is idle seems to point towards c-states or something to do with CPU frequency scaling.

As a starting point. Try disabling C-states in BIOS and see if that solves the crashing under Linux.

EDIT:

Maybe try setting your CPU governor to performance first. Failing that disable C-states.

As root:

cpupower frequency-set -g

3 Likes

I have a Ryzen 1700 on a Gaming K7 board and 2x16GB of Corsair LPX DDR4 2400MHz RAM. I would also suffer from odd system freezes in Linux from time to time under idle or near idle conditions. Never in Windows. It drove me crazy as stress tests like Prime95 would show no error with the CPU or RAM.

After much researching and playing with settings I came to the conclusion that I was affected by that weird bug in the first batches of first gen Ryzen CPUs and I could do nothing except RMA, which had a fat chance of happening due to where I lived.

I did some more testing and found out that it was my RAM when I set it to run at it’s rated speed (SPD 2133MHz had no issues). In the end I bumped the RAM voltage by +0.05V over the stock value and the problem went away just like that.
No future BIOS revision ever solved this, I always had to run my RAM overvolted if I wanted to run it at it’s rated speed and timings.

Maybe it’s just my RAM, or the CPU itself. Either way, with that simple fix the issue went away for me. Still weird that it only happens on Linux and not on Windows.

1 Like

Given that the symptoms are display freezes, I would guess it’s a display driver issue. You didn’t mention what drivers you are using. Since you have an NVidia card, I would guess you are using the proprietary drivers? You could look at the journal/log. After a crash, when you reboot you can have a look at the journal from the last boot in the terminal like this:

$ journalctl -b -1

If your regular user is not allowed to see the journal, you might need to run it as root, so just use sudo or su -. You can also look at earlier boots where you experienced a crash by first finding the boot id:

$ journalctl --list-boots

The output should be something like this:

-2 caa49d8474384420842713609d45cffa Mon 2020-03-23 11:49:39 CET—Thu 2020-04-02 03:32:54 CEST
-1 e9b68d4f2ed342c6bb1f0f35197d5ec9 Thu 2020-04-02 08:04:52 CEST—Thu 2020-04-02 09:05:47 CEST
0 9573803bb5044fc1a860ba2185a9752f Thu 2020-04-02 11:06:36 CEST—Thu 2020-04-02 09:18:57 CEST

You can then query the journal for a specific boot by giving the boot id or the offset to the -b option; like this:

$ journalctl -b e9b68d4f2ed342c6bb1f0f35197d5ec9

If --list-boots doesn’t properly discover all the journals, you can always play with the offset to find the relevant journals; -ve numbers count backwards from the current boot. So -3 would be 3 boots before the current one.

I hope this helps you diagnose the problem. I haven’t used proprietary drivers in many years, so if this is indeed a display driver issue, I can’t help further.

I haven’t been on Ubuntu or Debian in a while, but do they sleep/hibernate when left inactive? I remember laptop users complaining about issues resuming from sleep, and I’m not sure if ZFS on Linux supports hibernation. Also, how much RAM do you have?

I’m really shooting in the dark here, but it’s worth a look

In case it helps - my dad’s computer randomly started freezing a few months back (he runs Ubuntu). We assumed power saving at first too, since it only happened when he left it for a few hours, and always had the screen off when he returned.
It actually turned out to be xscreensaver - or more accurately, one specific module (he’s got it on random, and no I can’t remember which module it was). So if you’re using xscreensaver, it might be worth looking at that too.