Seeing if someone could give me some guidance on troubleshooting a new AL/DL Epyc 7402P server with Dual Nvidia Tesla A100 40GB GPUs. This is my first Epyc server (moving from Xeon E5-2680 v3s…I know very old but they work for GPU servers very well, all of the load is on the GPUs) and I had a ton of problems with the original Threadripper 2920X back on Ubuntu 16.04 when it launched…
It has been running for a few weeks now with no issues but this morning while ssh in, I have this error. Searched on for solutions online but nothing to pinpoint the issue. The BMC shows no Health Event Logs or component problems. From what I can see in the logs, the problem randomly happens only while starting a new python job via docker container…maybe a problem with RAM and swap space filling up?
This is a research server for a university and I need it to stay stable, it is already book for the next few months, and I need to finish my own research work.
No passthru or VM everything is running on bare metal…create a new user with elevated privileges and assign a pquota usage limit for the users docker container data with XFS.
They sound similar and point to a mismatch between python/cuda/pytorch and or a breaking change in socket libraries for inter-GPU communication that may be fixed by setting an ENV variable ?
Setting NCCL_P2P_LEVEL to NVL made no change, still get watchdog: BUG: soft lockup CPU stuck error.
However, after replicating the issue several times uninstalling Nvidia Drivers and Cuda toolbox, then reinstalling the latest Full Cuda Toolbox and after a reboot I get a memory error. Replaced the DIMM with a new one and it’s been running smooth for the last day with out issues.
Strange because all the DIMMS passed a 24 hour memtest with zero errors just a week before…
Posting the solution for future users with similar problems,
After changing RAM still experienced problems, swapped CPU and Motherboard no change. A different server with Epyc 7742 and 6 x RTX4090s started crashing with the same problem, watchdog: BUG: soft lockup
AMD EPYC systems have a known serious compatibility issues/problems with the core nvidia NCCL library that distributed pytorch (and other frameworks)