Dual RTX PRO 6000 Blackwell Max-Q | How to make P2P NCCL work?

BillyTech · December 12, 2025, 1:56am

Hello everyone!

I’m trying to debug a strange NCCL P2P infinite hang on my new WRX90 workstation with 2× RTX PRO 6000 Blackwell Max Q. Maybe I am doing something completely wrong?

PyTorch CUDA P2P via torch.cuda.can_device_access_peer( ) returns true in both directions; NCCL test with P2P disabled also works fine, but NCCL with P2P enabled consistently hangs on the first collective, even after disabling ACS in BIOS.

I’d really appreciate any insight from people who have more experience with NCCL, or maybe someone has already overcome this issue.

Hardware

CPU: AMD Ryzen Threadripper PRO 9965WX
Motherboard: ASUS WRX90E-SAGE SE, BIOS with ACS disabled in setup
GPUs: 2× NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
- GPU0: PCI bus 0000:c1:00.0 | VBIOS 98.02.6A.00.03
- GPU1: PCI bus 0000:f1:00.0 | VBIOS 98.02.6A.00.03

GPUs work in PCIe Gen 5 x 16 mode. Resizable BAR and IOMMU are enabled. I also tried different PCIe slot combinations, but nothing worked for me.

GPU Topology via nvidia-smi topo -m:

        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    0-47           0               N/A
GPU1    NODE     X      0-47           0               N/A

Software

OS: Ubuntu 24.04
Kernel: 6.14.0-29-generic
NVIDIA driver: 580.95.05
CUDA version: 13.0 (as reported by nvidia-smi)
nvcc reports:

nvcc: NVIDIA (R) Cuda compiler driver
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0

NCCL built from current github master:

=== NCCL packages ===
ii  libnccl-dev                                   2.28.9-1+cuda13.0                        amd64        NVIDIA Collective Communication Library (NCCL) Development Files
ii  libnccl2                                      2.28.9-1+cuda13.0                        amd64        NVIDIA Collective Communication Library (NCCL) Runtime
ii  libvncclient1:amd64                           0.9.14+dfsg-1build2                      amd64        API to write one's own VNC server - client library

libcuda / libnccl symlinks:

=== libcuda / libnccl symlinks ===
lrwxrwxrwx 1 root root       12 Oct  1 12:17 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       20 Oct  1 12:17 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.580.95.05
-rw-r--r-- 1 root root 96276264 Oct  1 12:17 /usr/lib/x86_64-linux-gnu/libcuda.so.580.95.05
lrwxrwxrwx 1 root root        12 Dec 12 00:28 /usr/lib/x86_64-linux-gnu/libnccl.so -> libnccl.so.2
lrwxrwxrwx 1 root root        17 Dec 12 00:28 /usr/lib/x86_64-linux-gnu/libnccl.so.2 -> libnccl.so.2.28.9
-rw-r--r-- 1 root root 189918680 Dec 12 00:28 /usr/lib/x86_64-linux-gnu/libnccl.so.2.28.9

PyTorch: 2.9.1+cu130 (official cu130 wheels)

nccl-tests: current GitHub (nccl-tests version 2.17.6).

What works:

Simple PyTorch P2P test between GPU0 and GPU1:

import torch

print("Torch:", torch.__version__)
print("torch.version.cuda:", torch.version.cuda)
print("CUDA available:", torch.cuda.is_available())
print("Device count:", torch.cuda.device_count())

for i in range(torch.cuda.device_count()):
    p = torch.cuda.get_device_properties(i)
    print(f"GPU {i}: {p.name}, {p.total_memory/1024**3:.1f} GB, CC {p.major}.{p.minor}")

print("P2P 0->1:", torch.cuda.can_device_access_peer(0, 1))
print("P2P 1->0:", torch.cuda.can_device_access_peer(1, 0))

returns correct information:

=== CUDA P2P test via PyTorch ===
Torch: 2.9.1+cu130
torch.version.cuda: 13.0
CUDA available: True
Device count: 2
GPU 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, 95.0 GB, CC 12.0
GPU 1: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, 95.0 GB, CC 12.0
P2P 0->1: True
P2P 1->0: True

2.NCCL single-GPU all_reduce (g=1) both P2P enabled or disabled
Executing ./build/all_reduce_perf -b 8 -e 64M -f 2 -g 1 works wihout issues

# nccl-tests version 2.17.6 nccl-headers=22809 nccl-library=22809
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 67108864 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  10657 on  wlad-hedt device  0 [0000:c1:00] NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
NCCL version 2.28.9+cuda13.0
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong 
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)         
           8             2     float     sum      -1     2.62    0.00    0.00       0     0.09    0.08    0.00       0
          16             4     float     sum      -1     2.64    0.01    0.00       0     0.10    0.16    0.00       0
          32             8     float     sum      -1     2.58    0.01    0.00       0     0.10    0.33    0.00       0
          64            16     float     sum      -1     2.58    0.02    0.00       0     0.10    0.65    0.00       0
         128            32     float     sum      -1     2.58    0.05    0.00       0     0.10    1.33    0.00       0
         256            64     float     sum      -1     2.60    0.10    0.00       0     0.10    2.60    0.00       0
         512           128     float     sum      -1     2.56    0.20    0.00       0     0.09    5.50    0.00       0
        1024           256     float     sum      -1     2.63    0.39    0.00       0     0.10   10.59    0.00       0
        2048           512     float     sum      -1     2.68    0.76    0.00       0     0.09   22.24    0.00       0
        4096          1024     float     sum      -1     2.62    1.56    0.00       0     0.09   43.74    0.00       0
        8192          2048     float     sum      -1     2.59    3.16    0.00       0     0.09   87.99    0.00       0
       16384          4096     float     sum      -1     2.62    6.25    0.00       0     0.06  272.61    0.00       0
       32768          8192     float     sum      -1     2.57   12.74    0.00       0     0.06  532.38    0.00       0
       65536         16384     float     sum      -1     2.62   25.00    0.00       0     0.06  1090.4    0.00       0
      131072         32768     float     sum      -1     2.82   46.49    0.00       0     0.09  1392.2    0.00       0
      262144         65536     float     sum      -1     2.87   91.43    0.00       0     0.14  1817.9    0.00       0
      524288        131072     float     sum      -1     3.03  173.32    0.00       0     0.09  5628.4    0.00       0
     1048576        262144     float     sum      -1     3.06  342.21    0.00       0     0.10   10963    0.00       0
     2097152        524288     float     sum      -1     3.30  636.28    0.00       0     0.10   20520    0.00       0
     4194304       1048576     float     sum      -1     4.79  875.97    0.00       0     0.10   41880    0.00       0
     8388608       2097152     float     sum      -1     6.78  1238.0    0.00       0     0.10   84605    0.00       0
    16777216       4194304     float     sum      -1     9.69  1731.7    0.00       0     0.09  181082    0.00       0
    33554432       8388608     float     sum      -1    16.76  2002.5    0.00       0     0.17  198254    0.00       0
    67108864      16777216     float     sum      -1    31.30  2144.2    0.00       0     0.06   1e+06    0.00       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#
# Collective test concluded: all_reduce_perf

3.NCCL multi-GPU all_reduce with P2P disabled (g=2, NCCL_P2P_DISABLE=1)
Works as expected

# nccl-tests version 2.17.6 nccl-headers=22809 nccl-library=22809
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 2 minBytes 8 maxBytes 67108864 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  10695 on  wlad-hedt device  0 [0000:c1:00] NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
#  Rank  1 Group  0 Pid  10695 on  wlad-hedt device  1 [0000:f1:00] NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
NCCL version 2.28.9+cuda13.0
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong 
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)         
           8             2     float     sum      -1     7.32    0.00    0.00       0     7.21    0.00    0.00       0
          16             4     float     sum      -1    15.18    0.00    0.00       0    10.83    0.00    0.00       0
          32             8     float     sum      -1    17.41    0.00    0.00       0    12.91    0.00    0.00       0
          64            16     float     sum      -1    10.57    0.01    0.01       0    16.86    0.00    0.00       0
         128            32     float     sum      -1    17.19    0.01    0.01       0     7.79    0.02    0.02       0
         256            64     float     sum      -1     7.42    0.03    0.03       0     8.39    0.03    0.03       0
         512           128     float     sum      -1     8.15    0.06    0.06       0     8.43    0.06    0.06       0
        1024           256     float     sum      -1     8.41    0.12    0.12       0     8.38    0.12    0.12       0
        2048           512     float     sum      -1     8.35    0.25    0.25       0     8.42    0.24    0.24       0
        4096          1024     float     sum      -1     8.45    0.48    0.48       0     8.36    0.49    0.49       0
        8192          2048     float     sum      -1     8.66    0.95    0.95       0     8.65    0.95    0.95       0
       16384          4096     float     sum      -1     8.67    1.89    1.89       0     8.66    1.89    1.89       0
       32768          8192     float     sum      -1     9.75    3.36    3.36       0    18.54    1.77    1.77       0
       65536         16384     float     sum      -1    12.25    5.35    5.35       0    11.89    5.51    5.51       0
      131072         32768     float     sum      -1    16.97    7.72    7.72       0    23.75    5.52    5.52       0
      262144         65536     float     sum      -1    26.98    9.72    9.72       0    26.82    9.77    9.77       0
      524288        131072     float     sum      -1    47.34   11.07   11.07       0    46.57   11.26   11.26       0
     1048576        262144     float     sum      -1    59.87   17.51   17.51       0    59.81   17.53   17.53       0
     2097152        524288     float     sum      -1    91.29   22.97   22.97       0    91.83   22.84   22.84       0
     4194304       1048576     float     sum      -1   223.24   18.79   18.79       0   157.01   26.71   26.71       0
     8388608       2097152     float     sum      -1   414.08   20.26   20.26       0   283.28   29.61   29.61       0
    16777216       4194304     float     sum      -1   725.69   23.12   23.12       0   546.07   30.72   30.72       0
    33554432       8388608     float     sum      -1  1249.43   26.86   26.86       0  1072.41   31.29   31.29       0
    67108864      16777216     float     sum      -1  2166.72   30.97   30.97       0  2122.39   31.62   31.62       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 8.94738 
#
# Collective test concluded: all_reduce_perf

What is broken:

NCCL multi-GPU all_reduce with P2P ENABLED (g=2, NCCL_P2P_DISABLE=0)
Using:

export NCCL_SOCKET_IFNAME=lo       # also tried default, same result
export NCCL_P2P_DISABLE=0
export NCCL_DEBUG=INFO
./build/all_reduce_perf -b 8 -e 64M -f 2 -g 2

Leads to the infinite hang with 100% dual gpu load:

Logs tell:

It always hangs immediately after:

P2P is detected as fully direct (isAllDirectP2p=1, isAllCudaP2p=1),
channels are set to via P2P/direct pointer,
and after Connected all rings, use ring PXN 0 GDR 1.

Same behavior with:

ACS enabled vs disabled in BIOS,
VPN on/off.

Questions

Has anyone seen a similar NCCL P2P hang on WRX90 / Threadripper PRO + RTX 6000 Blackwell (or other Blackwell Pro GPUs)?
Is this a known issue (or expected limitation) with NCCL 2.28.x + CUDA 13.0 + driver 580.95.05 on Blackwell?
Am I missing something obvious?

I would be very grateful for any recommendation or tools&drivers version combination that makes NCCL P2P work for your PRO blackwell gpus

thr3e · December 12, 2025, 2:52am

What is your grub boot line?

I am currently running this:

GRUB_CMDLINE_LINUX_DEFAULT="nosplash text pci=realloc amd_iommu=on iommu=pt"

with ACS OFF in the BIOS, it took me a good number of attempts to get my p2p stuff working and there are bugs with our rtx pro 6000 cards in certain LLM-related libraries as well.

Cuda and driver version:

BillyTech · December 12, 2025, 2:52am

Well, after nearly a week of debugging, I finally found the solution! Disabling IOMMU (ACS is currently off) solved the problem, and now P2P works as it should.

BillyTech · December 12, 2025, 3:04am

Currently have the super default one:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"

It’s interesting that we reached working setups via different approaches: I had to disable IOMMU completely in BIOS, while your configuration works with IOMMU enabled in passthrough mode.

Do you have ACS off rn?

thr3e · December 12, 2025, 8:22pm

ACS is the only thing in the BIOS i flat out disabled, my IOMMU is on and working but in PT mode via grub.

Also, during my training run last weekend i was hovering around 72c and needed to take drastic measures:

insert “something something onlyfans joke” here

Jon_Nielsen1 · January 6, 2026, 1:52pm

Hello all,

New member here.

I just signed up after finding this thread. It seems we have something in common.

My motherboard is the TRX50, Treadripper, and the same dual GPU (workstation edition), running Ubuntu, and finding strange issues with P2P related functionality.

My main issue is with either lack of communication or hard crashes when running inference with vLLM.

Depending on the various permutations of BIOS settings, grub parameters and NCCL environment variables I’ll either get:

No communication or syncing, vLLM will not even load model in TP mode
Weird error messages (IO_PAGE_FAULT in dmesg, originating from one or both GPU devices) and whatever I was doing stop working
Hard crashes (complete blackout and hard reset required)

The latter two depends on the correct mixture of IOMMU settings, correct meaning I get just the error messages. It typically happens when load is high or there’s lots of communication going on, e.g. inference tool calling and output parsing.

So I’ve been mostly running in circles thinking it was cooling issues or power spikes.

One thing I’ve found for my use case is that disabling vllm’s custom all reduce is what’s always required to allow stable operation.

What are the latest experiences here? Do we have any idea as to the source of the issues?

Cheers!

Jon

thr3e · January 6, 2026, 9:43pm

Your issue sounds interesting and distinct from the peer to peer stuff I ran into, in my case it either worked or didnt, there was never a stability or gpu operation problem, simply less bandwidth + higher latency.

If you completely default the settings on the latest bios and boot with both cards, can you check from the BMC if there are any PCIE AER errors getting logged? Which slots are your cards in on the board? And can you run the single GPU load tests fine? Then what does the nvidia cuda sample p2p bandwidth test show? Save us all the output and start with an unmolested bios if possible.

Also what specific drivers/versions of everything are you running?

cottthemh · January 9, 2026, 8:07pm

Nice work guys, i have the same system with also 2 blackwell 6000 cards. Can you tell us which slots you are actually using for the cards? Which slots allow P2P communication of the ASUS WRX90E-SAGE SE? and is it possible to use more than 2 cards P2P?