Server Crashing: Epyc 7402P with Nvidia Tesla A100s Ubuntu 20.04

Seeing if someone could give me some guidance on troubleshooting a new AL/DL Epyc 7402P server with Dual Nvidia Tesla A100 40GB GPUs. This is my first Epyc server (moving from Xeon E5-2680 v3s…I know very old but they work for GPU servers very well, all of the load is on the GPUs) and I had a ton of problems with the original Threadripper 2920X back on Ubuntu 16.04 when it launched…

Motherboard: Supermicro H12SSL
RAM: 256GB ECC 2133 Samsung
CPU: Epyc 7402P
GPU: Dual A100 40 GB
Storage: 1TB 3.0 NVME OS, 4TB 4.0 NVME for datasets

It has been running for a few weeks now with no issues but this morning while ssh in, I have this error. Searched on for solutions online but nothing to pinpoint the issue. The BMC shows no Health Event Logs or component problems. From what I can see in the logs, the problem randomly happens only while starting a new python job via docker container…maybe a problem with RAM and swap space filling up?

This is a research server for a university and I need it to stay stable, it is already book for the next few months, and I need to finish my own research work.

Any help would be Greatly appreciated,

Thanks,

Jake

is the python docker container accessing the gpu via vfio?

No passthru or VM everything is running on bare metal…create a new user with elevated privileges and assign a pquota usage limit for the users docker container data with XFS.

Look at this:

and this:

They sound similar and point to a mismatch between python/cuda/pytorch and or a breaking change in socket libraries for inter-GPU communication that may be fixed by setting an ENV variable ?

Setting NCCL_P2P_LEVEL to NVL made no change, still get watchdog: BUG: soft lockup CPU stuck error.

However, after replicating the issue several times uninstalling Nvidia Drivers and Cuda toolbox, then reinstalling the latest Full Cuda Toolbox and after a reboot I get a memory error. Replaced the DIMM with a new one and it’s been running smooth for the last day with out issues.

Strange because all the DIMMS passed a 24 hour memtest with zero errors just a week before…

Posting the solution for future users with similar problems,

After changing RAM still experienced problems, swapped CPU and Motherboard no change. A different server with Epyc 7742 and 6 x RTX4090s started crashing with the same problem, watchdog: BUG: soft lockup

AMD EPYC systems have a known serious compatibility issues/problems with the core nvidia NCCL library that distributed pytorch (and other frameworks)

Pythorch Thread:

Nvidia Info:
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#pci-access-control-services-acs

Solved issue by disabling IOMMU and SVM Mode in the motherboard BIOS.

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.