So after swapping the GPU with my girlfriend’s one, it works flawlessly. I left my computer running the whole after and no crash
I’ll try to swap back tomorrow and keep you updated
Regarding the PSU, it is a 750W one from Corsair
So after swapping the GPU with my girlfriend’s one, it works flawlessly. I left my computer running the whole after and no crash
I’ll try to swap back tomorrow and keep you updated
Regarding the PSU, it is a 750W one from Corsair
Yes, that would be a good, further T/S step. It could be the original card was not fully seated in the slot. Or there could have been a bit of crud in the slot fingers that caused the issue. Removing and installing your gf’s card may just have cleaned out the slot or been fully seated.
So I swapped GPUs again this morning and now everything works fine.
I have no clue what happened since it has been working for 6 months like this but anyway. It’s good that is has been fixed.
Thanks everybody for you help !
EDIT : It just happened again
I wonder if PSU may be the culprit here, but the fact that it doesn’t happen on Windows is weird. I don’t think the 1060 has a lot of transient power spikes either (and when it happens, it should happen during a heavy load, e.g., gaming session).
Do you hear some clicking sound from the PSU when the machine turned off?
PCIe ASPM L1 is also known to cause a crash with some devices, but I don’t think that’s enabled by default (I don’t think an AMD board even exposed an option to enable it until very recently, either).
If I understand the above correctly, “it” (the machine) crashed even with an entirely different GPU?
Maybe a hardware issue with the motherboard? Have you tried putting a GPU (preferably a definitely-good one) in a different motherboard PCIE slot? Double check power cable connections to the card?
And it will happen again and again…
I had for while a 1030/2080 combination for my system, the 1030 for the host system and the 2080 as a passtrough GPU for VMs
Almost all problems I had during this time had to do with the Nvidia drivers and no, I’m not saying that there isn’t a stable combination of distro, kernel and driver version.
But as an Nvidia user, before upgrading your system, make sure you have ssh access and use BTRFS or ZFS snapshots, this will certainly save you a lot of time at some point.
But it gets better over time as you will have figured out all the tricks how each distro tries to mitigate the problem, but it always remains a problem.
The positive is, you will master Linux faster
Very important questions.
I know you stated you ran memtest, but this doesn’t always expose memory issues. I would suggest you disable any XMP profiles or overclocks (note XMP is an overclock) you might have simply for testing.
Also have you undervolted your CPU? Many people do this not understanding that as silicon ages it can cause things to start to misbehave. Vendors set the voltages with some headroom to account for ageing and quality of silicon issues, undervolting puts you right on the edge leaving little room if things go wrong.
As for PSU, I highly doubt it’s PSU. If it were you wouldn’t see a reboot, but rather the PSU go into overload protection and shut off entirely. It still can be due to poor rail regulation, but it’s doubtful.
A GPU crash/failure doesn’t normally cause a reboot of a Linux system either, usually just loss of video and a kernel oops. I spent over a year working on VFIO GPU failure issues, not once have I ever seen either AMD or NVIDIA GPUs directly cause a system reboot on failure. A crashed card that falls off the bus, or a hung system yes, but not a reboot.
I’d suggest you SSH into your system from another PC, run dmesg -w
and wait for a reboot/crash again. It’s possible the kernel is logging useful information that is not getting flushed to disk before the reboot causing you to miss it.
Edit: Perhaps also in another SSH session at the same time, run watch -n0 'sensors; nvidia-smi'
to check for a temperature transient that could explain it.
@Lyama I don’t know if using Nividia 535 version drivers is the cause of your problem, but it appears Nividia introduced a bug in the Linux version of the 535 graphic drivers. At least, that is what I have heard online. Many people have graphical card issues with that driver; I would switch to the 525 or older version.
I am using Nvidia 535.86.05 on kubuntu 23.04 with x11 and can confirmed I see bugs that didnt exist prior to 535. Mostly when existing games that block compositing on other applications(like CSGO), its not constant but when It happens X11 loses its mind, applications bounce between screens plasma crashes, sometimes applications will display upside down between screen flickering.
I had issues previously that forced a reinstall, Asus X570, Ryzen 5950x; but it did end up being a bad stick of ram memtest eventually revealed.
Hardware fails !. this sucks when you dont have replacement parts. Ezt test us remove and reseat all the parts.
Hi guys,
Thanks for all your feedback. I’ve not resolved my issue. It’s starting to driving me nuts because I can’t use my computer for more than 10min and I still don’t have any idea of where it could come from.
I don’t hear any sound coming from my PSU when the computer crashes.
Regarding the XMP, I’ve already disabled it.
I will try to log in SSH and check dmesg.
I’ve tried Nvidia driver 530 without much success. I will try 525
Before crashing, the system freezes for around 30/40seconds before rebooting.
I don’t know if you can interpret anything from this
Hmmm… can feel you on this ran into a lot of issues lately with my 3070 and it got to the point where I went back to Windows 10 as I don’t have the time to constantly troubleshoot Linux just so I can play my games. Current trying Pop OS and have had some good luck… for now.
IMO the issue is probably GPU related and it’s most likely the Nvidia drivers/Deb kernel. However, it can be your setup/BIOS settings.
Now if you want to continue troubleshooting check my below questions and answers/recommendations.
Are you using a PCIe riser for the GPU? If so remove it and test it(Below question may help resolve it make sure to match the riser PCIe version to the PCIe setting)
Is the PCIe bandwith properly set within the BIOS?( Go into bios press f7 at launch screen (get into advanced settings from basic), toggle over to advanced in this main menu, Onboard Devices Configuration>PCIEX16_1(could be 2 or 3 depending on where you have the GPU plugged into) mode=Gen 4 (you want 3) This fixed some issues I had with my GPU in linux.
Have you made sure your packages are up to date/working? You can check by running debsums -c
Depending on you answer/results I may be able to help you narrow down the issue or help someone else here with more knowledge troubleshoot.
Not really, you need to do as I stated above. Connect to the system remotely and start watching the dmesg output (dmesg -w
). When the system crashes, if it dumps any information you will see it on the 2nd PC (hopefully) which will help narrow things down.
@xyz all the suggestions you are making are great if the system was not rebooting, however the Linux Kernel is designed to NOT just reboot in the event of a failure, this points to a more major issue. I can state for a fact that PCIe bus communication issues can not cause a system reboot.
The kernel may power off if something catastrophic happened such as over temp (amdgpu will do this for example) which doesn’t seem to be the issue here, but not just reboot.
Again, a faulting GPU (or any PCIe device) does not cause a full system reboot. I don’t know why people are ignoring this major fact and huge hint as to the cause. Even the PCIe spec doesn’t allow for a “reboot”, emergency power off yes, but a device can not directly trigger the system to reboot.
As @Lyama pointed out that the reboot takes a bit of time (a hardware watchdog perhaps) this is a good indication that there may be useful logging information to collect that may not be getting flushed to disk. Until this is checked for and provided (if it exists) there is ZERO point speculating as to other causes.
Also the fact that Windows works and Linux doesn’t while is a useful metric, is not a guarantee that the hardware is OK. The Linux scheduler and workload is wildly different to Windows and as such may be triggering an edge case which is resulting in this reboot.
This. I have a system that is 100 percent stable in windows and crashes (well, PSU trips and shuts down) due to transient power spikes in Linux. VEGA 64. It also did it with a specific version of the windows driver. Linux and windows power management and power consumption can be wildly different.
I’ll also add that this machine was also stable (well given it was crossfire; it wasn’t tripping the PSU via spikes at least) running crossfire 64s in windows and single Vega 64 caused this in Linux.
Have you tried disabling wayland and running on xorg?
Please read through the thread before making suggestions like this. Even if switching to xorg/wayland “fixes” it, it does not inform as to the root cause of the problem, nor does it inform as to possible causes.
I don’t use a PCIe riser.
I’ve set the PCIe version for the GPU slot to 3 but it didn’t yield any better result.
All my packages are up to date. I installed and ran debsums and it didn’t report any error
Sadly, dmesg -w didn’t show anything useful : [ 168.246114] audit: type=1400 audit(1692681331.903:134): apparmor="DENIED" ope - Pastebin.com
As you can see, the temps are also far from crazy : Every 0,1s: sensors; nvidia-smi - Pastebin.com
Both of these logs are the last one I got from my other computer connected in SSH after the system crashed