Hi guys, I’m getting repeatable system freezes whenever I use PyTorch. After around 5-20 minutes of training, my system freezes (no mouse/keyboard, no tty, no remote SSH, no displayed error messages, reset button required). I’m using 4 workers in the data loaders, so the CPU sits at around 20% usage during training.
Here’s my system details: Ubuntu 18.04, Nvidia driver 410 (installed from ppa:graphics-drivers/ppa), CUDA 10.0, cuDNN 7.4.2. 3 GPUs (1080, 2x2080Ti). Thermals (from
nvidia-smi) seem normal. MB: Asus Zenith Extreme, CPU: Threadripper 2950X, RAM: 128GB of Corsair Vengeance LPX (3200MHz sticks but running at 2666Mhz). PSU rated to 1500W. The images used for the training are stored on a Samsung SSD connected via SATA.
Here’s what I’ve tried: Update Linux kernel to 4.19 (didn’t help so went back to the Ubuntu default). Reseated all the GPUs & RAM. Tested each GPU individually. Fresh install of Ubuntu. Stock Python 3.6 & Anaconda 3.7. I tried 1 hour of 100% CPU usage just to make sure it’s not a CPU/RAM thing and also GPU burn seems fine.
I thinking it could be some sort of PCIe issue since it only occurs under GPU load (regardless of GPU and PCIe slot). Or a multithreading issue since I’m using 4 workers (I’m currently running with 1 worker and will see if that causes a freeze).
Any ideas to what I should try next and what logs would help diagnose the issue?
Update I still got a freeze with a single worker, so it’s unlikely to be a multithreading issue