Hey guys, I am begging for some help tracking down a long term issue. I have a server I built a year ago for Gen AI. The build list for it can be seen here. The only thing missing is the Nvidia 3090 TI I got on ebay. I can’t confirm this is the culprit as troubleshooting this has been the most difficult thing in my life (second only to raising a toddler).
Here’s what happens: At some point (after a few hours, or after a day or two) the system will completely freeze. I mean no response from console at all, and I have to physically power cycle the system. So if I wanted to confirm the GPU, I’d have to unplug the power to it, and let the system run for several days or a week before I could confirm it, then plug things back in and wait a few days to see if something crashes.
I’ve already run a memtest, and this seems to be happening if I have Ubuntu 24.04 or Unraid 7.0. The memtest passed with flying colors.
I’m at a complete loss as to how to troubleshoot this. It’s made even more difficult because I can’t replicate the crash, or ever seem to find anything helpful in the logs. I’m at the point where I’d consider running a series of command in a cron job every minute if I thought it would help, I just don’t know what I’d look for.
I’d think it’s a heat issue somewhere, but this case has plenty of airflow, and right now the case is open with a fan blowing right into it. It’s also running in my basement where temps average like 65F.
Maybe the GPU has a fault somewhere, but it runs perfect up until it crashes. I can throw all of the inference tasks I want at it. It’s run Stable Diffusion, Ollama, Oobabooga etc in the past, all without fail.
The only thing I haven’t really tried is downgrading the Nvidia drivers. I just did a fresh install (Ubuntu 24.04 with the latest Nvidia drivers etc) but that system is so complex I can’t make heads or tails of what I should install instead. Should I use the Server, Open, or some other suffix etc?
I’m desperate for any help on commands to run, log files to look into, apps I can install to help troubleshoot etc. I’ve got a lot of money in this and I need it to work.