Help with hard server crash - Ubuntu with Nvidia 3090

Hey guys, I am begging for some help tracking down a long term issue. I have a server I built a year ago for Gen AI. The build list for it can be seen here. The only thing missing is the Nvidia 3090 TI I got on ebay. I can’t confirm this is the culprit as troubleshooting this has been the most difficult thing in my life (second only to raising a toddler).

Here’s what happens: At some point (after a few hours, or after a day or two) the system will completely freeze. I mean no response from console at all, and I have to physically power cycle the system. So if I wanted to confirm the GPU, I’d have to unplug the power to it, and let the system run for several days or a week before I could confirm it, then plug things back in and wait a few days to see if something crashes.

I’ve already run a memtest, and this seems to be happening if I have Ubuntu 24.04 or Unraid 7.0. The memtest passed with flying colors.

I’m at a complete loss as to how to troubleshoot this. It’s made even more difficult because I can’t replicate the crash, or ever seem to find anything helpful in the logs. I’m at the point where I’d consider running a series of command in a cron job every minute if I thought it would help, I just don’t know what I’d look for.

I’d think it’s a heat issue somewhere, but this case has plenty of airflow, and right now the case is open with a fan blowing right into it. It’s also running in my basement where temps average like 65F.

Maybe the GPU has a fault somewhere, but it runs perfect up until it crashes. I can throw all of the inference tasks I want at it. It’s run Stable Diffusion, Ollama, Oobabooga etc in the past, all without fail.

The only thing I haven’t really tried is downgrading the Nvidia drivers. I just did a fresh install (Ubuntu 24.04 with the latest Nvidia drivers etc) but that system is so complex I can’t make heads or tails of what I should install instead. Should I use the Server, Open, or some other suffix etc?

I’m desperate for any help on commands to run, log files to look into, apps I can install to help troubleshoot etc. I’ve got a lot of money in this and I need it to work.

1 Like

Which nvidia driver are you using there are 4 options for any version?

Have you tried installing Ubuntu desktop, and using the nouveau driver, see if the card works and its just the NVIDIA driver?

That screams “IT professional hands on”, frankly.

For starters, setup the Phoronix test suite and let it run a while:

I’d suggest swapping out the boot SSD to preserve your current system, then on the new boot SSD, install a minimal Debian Stable then run the Phoronix test suite. Debian Stable is just that: stable. It’s intended for server use, really, so anything going astray is most likely a hardware fault of sorts.

HTH!

That’s half the battle, so many options.

Currently on nvidia-driver-570-open

Thanks for this, I’ve never even heard of it.

I’m also curious around your phrasing of Debian as “stable” vs Ubuntu. While I’m not a hard code Linux guy, I believe Ubuntu and Debian are both pretty reliable and ‘stable’ at this point? I’m genuinely curious if you’re saying Debian is more stable, and if so, what makes it that way in your opinion?

try the open server variant, wondering if something in the regualr variant is crashing due to missing desktop env and other deps

nvidia-driver-###-server-open

I found you a few answers to that question, in no particular order:

https://www.reddit.com/r/debian/comments/12e8ly0/why_choose_debian_over_ubuntu/

There are more, but these will occupy a large portion of your evening :wink:

Not even ssh’ing into it works? If so, I don’t think the GPU is the issue, as crashes on that usually leave the system running still.

Just to be sure, both in dmesg and also on journalctl?

I was having the same issue with Debian 12 and Ubuntu 24.04 but was similarly unable to narrow down the problem.

The system and all QEMU VMs would lock up at random, completely regardless of system load or uptime, and absolutely nothing was printed to logs. Multiple fresh installs did the same while sitting idle on a desktop. This would happen at least once per day, so the system was unusable.

I tried swapping all my U.2 and SATA storage, all 12 DIMMs, and even reseating my processors to no avail.

I’m running a dual socket Cascade Lake machine with an RTX A5000 (~3080). On Debian, I was able to reproduce the issue with kernels from 5.10-6.12 from the bullseye and backports repos.

This led me to install another distro, and CachyOS with the x84 v4 repos and LTS kernel has not crashed on me in the past two months of use.

Even from RESIUB? I usually know if a console is gone because the underscore stops flashing.

For the future, do you know about the Linux RESIUB keys which are the fullest test if your Linux system has crashed: there’s a core loop in most Linux kernels listening for Alt+SysRq+… one of a bunch of keypresses. Ignoring the R for the moment, this allows you send SIGTERM to get processes except PID 1 to end themselves (E), sync cached data to disk (S), force-kill all tasks except PID1 (I), unmount filesystems and remount read-only (U) and reboot the whole system (B). The search term is ‘RESIUB’ because some keyboards need the ‘R’ step to switch to a raw kernel-handled mode.

For logging, you might send logs to a network-connected device (if you have a spare) so that the data won’t be lost when the system locks up. The systemd suite that’s running as PID1 on contemporary Linux systems has systemd-journal-remote and systemd-journal-upload (see this from Digital Ocean, still relevant to Debian 12 and 13) or there’s rsyslog (see this at Stack Over) able to send to other devices.

Good luck finding something to help you diagnose this, and may you also find any other Linux environment that’s stable for your hardware.

K3n.

Glad to hear I’m not alone. I ended up installing the -open-server drivers recommended by someone else here and I’ve had 5 days of uptime…which is a new record…sadly.

NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8

Wanted to come say thanks for the suggestion. I tried the server-open drivers and I’m at 5 days of uptime and counting!

It’s really sad when 5 days is good news…but hey…here I am :slight_smile: