I’ve been using Linux on a daily basis for now half a year. So far, it has been a painless experience. However, since about 1 month, it has become unusable. My OS boots correctly, however, after 5/10 minutes of using it, the screen freezes for about 10/15 seconds and then the computer reboots.
At first, I thought it was a hardware related issue but I have a dual boot with Windows 11 and it works flawlessly.
So far, I haven’t tried many things because to be honest, I don’t really know what to look for.
I ran memtest to ensure that the issue wasn’t related to the RAM. No issue on that side
I checked the kernel logs and the dmesg logs and I didn’t see anything probing. Maybe, I could attach the logs if you want to take a look at them.
I tried updating my GPU drivers but I kept having the same issue.
Updating my BIOS
My config is the following :
CPU : Ryzen 5950X
GPU : GTX 1060 6GB
RAM : 2 x 16Go
PSU :
Motherboard : ASUS TUF GAMING B550-PLUS WIFI II
I tried two flavors of Ubuntu with the same results :
Have you tried using an older kernel version or older nvidia driver version? Do you have btrfs/timeshift setup? Can you roll back to a previous known-good date and see if that fixes the problem?
Can you post kernel logs from the previous boot that crashed? On a systemd system, journalctl --list-boots to get the boot offset, then journalctl -k -b <boot_offset> should do. You may want to see if one of them contains the RIP: line near the end of the log.
Afer posting my message, I’ve tried xserver-xorg-video-nvidia and I’ve been able to watch a YouTube video for 2 hours without any stability issue. However, when I use this driver, my whole interface feels laggy and I’m not willing to keep using it.
After that, I tried switching back to nvidia-driver-535 and I could not even reach my desktop. The login screen would show up but as soon as I logged in, it crashed. I had to switch the console to go back to nvidia-driver-390 to be able to login.
However, I still have the same stability issue and the system crashed after a little while.
There really doesn’t seem to be anything here. If it was a GPU driver-related crash, I expected there to be something related to the GPU there, but there wasn’t. On the Windows side, do you see anything in system event logs in the Event Viewer? (Event Viewer → Windows Logs → System).
Also, it might be worth monitoring the temperature of the CPU and GPU.
iirc, that’s just a generic VESA driver, and cannot actually interact with the card properly. It’s stuck at lowest clocks, doesn’t have acceleration for things, it just carries a frame to the display for you. Basically like running windows without a GPU driver.
Thansk for the explanation, I didn’t take time to check what it was exactly.
However, given that the system is stable with it, doesn’t it mean that the Nvidia driver is likely to be the culprit?
It implies that the nvidia driver having full feature access to the GPU is causing a problem somewhere in the chain. But, it doesn’t really narrow it down much.
It could be partial GPU failure, it could be a bad GPU driver from nvidia, it could be a kernel module incompatible with nvidia’s driver, it could be a kernel bug, it could be a power issue…
We know it happens in two Debian installs with two versions of the nvidia driver and not much else.
Does it happen on any Arch based distros?
Does it happen on a fresh install?
Does it happen in Nouveau? Does it happen in Nouveau when you reclock it to use 3D clocks?
Are you using X11 or Wayland based DEs? Can you try another desktop environment that uses the other?
I’ve tried a fresh install of Fedora 36 and I had the same issue. So it isn’t related to Debian nor to the fact that they aren’t fresh installs.
I’ve tried KDE, XFCE and Gnome without any difference
I will verify but I think they all were using Wayland.
I also have to verify but if I remember well, my PSU is a 750W from a reputable brand. Is there a way to test if the problem is the PSU?
For the GPU, my girlfriend also has a GTX1060 so I could try to swap them to see if it’s hardware related. However, I don’t understand why windows works flawlessly.
In case it is related to the drivers? Should I try to contact Nvidia?
I didn’t try an Arch based distrib, the issue is that even if it works, it won’t solve my issue because I use Yocto which isn’t tested on Arch distrib.
What is Nouveau you are talking about?
Do you have a link about reclocking I could dig into?
I don’t think XFCE supports wayland yet, so it sounds like it’s… Possibly hardware failure that the windows driver is masking?
Nouveau is the open source nVidia driver. It barely works and for most GPUs, it requires manually setting the power state to be non-idle, I’d be surprised if it didn’t crash. I’m not sure Pascal even supports reclocking at all, it needs to be done manually via the terminal for Kepler.
In terms of testing Arch, it’s just another step to seeing what works and what doesn’t. It probably won’t work, but if it did, theoretically, it would mean that there’s some problem that isn’t just that nvidia’s linux drivers have suddenly become completely broken, such as the distros not properly blacklisting nouveau, or the kernel not loading nvidia drivers properly on boot.
It would be even more useful to have a known-good/working/legacy install to see if the problem persists with old software. If old was-working software is no longer working, that implies some change to the hardware causing it to not work. If you have any old disk images of a working linux install, it would be good to test those. If not, a live environment that has baked-in nvidia drivers from last year sometime could also do.
Swapping your girlfriend’s GPU in and seeing if that works would also be good to try, as would booting a liveimage in her computer to see if it has the same problems.
I have no idea what’s actually wrong here or where to look, so I’m just throwing out anything I can think of that might produce some kind of information to go off.
You’ll probably need to start from scratch to even use Nouveau, but it should work in a liveimage, since it’s the open source driver and plays nicely with the rest of linux, though not with the nVidia GPUs it’s supposed to be working with.
It doesn’t look like pascal supports reclocking, though, so it may not be a very useful test.
Doesn’t help troubleshoot his problem with the gpu but the nvidia drivers do in fact allow reclocking of both memory and core clocks if you set it up with what is known as the “coolbits tweak”
Works on every generation of Nvidia card back to Maxwell. A simple command line statement to enable it and reboot and then to either set your desired overclock via the terminal and nvidia-settings or just manually adjust the clocks via the Nvidia X Server Settings application that gets installed by default with the proprietary drivers.
Just a FYI. I run all my Nvidia cards with an appropriate overclock to get them back to equivalent P0 power state after the driver imposes a downclock penalty for running compute loads on consumers cards.
#!/bin/bash
/usr/bin/nvidia-smi -pm 1
nvidia-smi -i 0 -pl 200
nvidia-smi -i 1 -pl 200
nvidia-smi -i 2 -pl 200
/usr/bin/nvidia-settings -a "[gpu:0]/GPUPowerMizerMode=1"
/usr/bin/nvidia-settings -a "[gpu:1]/GPUPowerMizerMode=1"
/usr/bin/nvidia-settings -a "[gpu:2]/GPUPowerMizerMode=1"
/usr/bin/nvidia-settings -a "[gpu:0]/GPUFanControlState=1"
/usr/bin/nvidia-settings -a "[fan:0]/GPUTargetFanSpeed=85"
/usr/bin/nvidia-settings -a "[fan:1]/GPUTargetFanSpeed=90"
/usr/bin/nvidia-settings -a "[gpu:1]/GPUFanControlState=1"
/usr/bin/nvidia-settings -a "[fan:2]/GPUTargetFanSpeed=85"
/usr/bin/nvidia-settings -a "[fan:3]/GPUTargetFanSpeed=90"
/usr/bin/nvidia-settings -a "[gpu:2]/GPUFanControlState=1"
/usr/bin/nvidia-settings -a "[fan:4]/GPUTargetFanSpeed=85"
/usr/bin/nvidia-settings -a "[fan:5]/GPUTargetFanSpeed=90"
/usr/bin/nvidia-settings -a "[gpu:0]/GPUMemoryTransferRateOffset[4]=800" -a "[gpu:0]/GPUGraphicsClockOffset[4]=60"
/usr/bin/nvidia-settings -a "[gpu:1]/GPUMemoryTransferRateOffset[4]=800" -a "[gpu:1]/GPUGraphicsClockOffset[4]=60"
/usr/bin/nvidia-settings -a "[gpu:2]/GPUMemoryTransferRateOffset[4]=800" -a "[gpu:2]/GPUGraphicsClockOffset[4]=60"
This enables overclocking. Sets the power limits on each card, sets the fans speeds and sets the core clock and memory clock speeds for each card.
I also would recommend to not use Wayland. The Nvidia drivers are more stable with the X Server. 22.04 defaults to Wayland now I believe. You can revert to X by commenting out the:
WaylandEnable=false
statement in the /etc/gdm3/custom.conf file with a # and save the file and reboot to get up on X11 DE.