WIP: Blackwell RTX 6000 Pro / Max-Q Quickie Setup Guide on Ubuntu 24.04 LTS / 25.04

EDIT: solved!!! put the card in a DL360 GEN10 (scalable gen2), installed the driver and boom it worked!! So its the pcie root ports etc. You need a capable system! I could not get it to work on Dell 7920 also gen2 scalable intel but those workstations are a bit limited about pcie stuff, was not working on Asus rog maximus hero z790 and not on DL580 GEN9.

1 Like

On an MSI Z690 board, it wouldn’t boot without disabling ReBAR first.

1 Like

Had in fact a similar issue too. ITX board would not boot with a 32GB (only 32GB) GPU because it could not map enough addresses. There were no settings in the BIOS to change that, ReBar was enabled. Only dmesg reported mapping errors with the GPU.

Board works fine with lower VRAM cards and GPU works fine on other boards with more expensive chipsets.

1 Like

Thank you.

Wanted to circle back on a few things discussed in this thread.

Thanks to @Phaedawg (and @Panchovix for coming up with the numbers in the first place) I tried LACT with training on my 16 RTX 6000 Pros. With an offset of 1000 and the peak frequency still limited, my training code still maxes out on power (and occasionally thermals), converting everything to performance instead. About ~15% better at 600w.

Inference is much less efficient, so it more readily converts to power savings. However, I can power limit during training to ~450w and get to where I used to be before LACT with 600w. That’s a pretty decent compromise, and much easier to cool as well.

LACT also boosts the power-limited mamf finder performance posted earlier in this thread. I take back what I said about the MaxQ edition - you can totally get very close with the WS edition + LACT, and have margin to max out performance when needed.

Also, there was some discussion on PCIe gear to drive multiple of these GPUs. I moved from the Gigabyte MZ33-AR1 motherboard to the ASRockrack GENOAD24QM32. The AR1 had poor PCIe signal quality, worse than its predecessor AR0 (though it could be my specific unit). The GENOAD24QM32 will drive everything at Gen5 even at 75cm passively via any of PCIe outs (MCIOs, slot, OCP), the Gigabyte is barely stable at 50cm. The GENOAD24QM32 is also the only card I know of that gives you full access to 128 Gen5 PCIe lanes. That means you can technically do 8 GPUs @ x16, no retimers, redrivers or switches needed, or 16 gpus with x8/x8 bifurcation (though then, you’d have no lanes left for Nvme).

EDIT: Forgot. We also discussed trying to get NCCL working with 16 GPUs on 1 machine (really anything > 8 GPUs). I realized you can get it to work:

export NCCL_CUMEM_ENABLE=0
export NCCL_WIN_ENABLE=0
export NCCL_P2P_LEVEL=SYS

It’s a bit slower than doing my own p2p copies in cpp/cuda, but it’s obviously a lot cleaner to work with. Some inference engines that crashed with > 8 GPUs could maybe work again with this setting. It’s doing real P2P, not a host bounce. For RTX 6000 Pro, that just works. For 4090/5090, need modded driver.

3 Likes

Hi all, can anybody help with this problem? Many thanks!!!

This is what happened when I ran the code on 2 x Pro6000 GPUs.

torch.nn.parallel.comm.gather doesn’t produce the expected result when the output device is set to cuda.
This in turn results in nn.DataParallel producing incorrect results.

import torch 
from torch.nn.parallel import comm 
tensor1 = torch.rand(2, device=torch.device("cuda:0")) 
tensor2 = torch.rand(2, device=torch.device("cuda:1")) print(f"tensor1: {tensor1}, tensor2: {tensor2}") 
print(f'combined tensor - Cuda: {comm.gather([tensor1, tensor2], destination = torch.device("cuda:0"))}') 
print(f'combined tensor - CPU: {comm.gather([tensor1, tensor2], destination = torch.device("cpu"))}')

Expected result:

tensor1: tensor([0.6706, 0.7866], device='cuda:0'), tensor2: tensor([0.0953, 0.0895], device='cuda:1') combined tensor - Cuda: tensor([0.6706, 0.7866, 0.0953, 0.0895], device='cuda:0') combined tensor - CPU: tensor([0.6706, 0.7866, 0.0953, 0.0895])

Observed result:

tensor1: tensor([0.6706, 0.7866], device='cuda:0'), tensor2: tensor([0.0953, 0.0895], device='cuda:1') combined tensor - Cuda: tensor([0.6706, 0.7866, 0.0000, 0.0000], device='cuda:0') combined tensor - CPU: tensor([0.6706, 0.7866, 0.0953, 0.0895])

Works for me. Does p2pBandwidthLatencyTest show p2p working (though that doesn’t verify integrity)?
If didn’t post there already, might be this issue: torch.nn.parallel.comm.gather produces incorrect results on cuda devices · Issue #166360 · pytorch/pytorch · GitHub

Otherwise does this code work (it is what comm.gather does under the hood anyway):

tensor1 = torch.rand(2, device=torch.device("cuda:0"))
tensor2 = torch.rand(2, device=torch.device("cuda:1"))
print(f"tensor1: {tensor1}, tensor2: {tensor2}")
destination_cuda = torch.device("cuda:0")
tensor1_on_cuda = tensor1.to(destination_cuda)
tensor2_on_cuda = tensor2.to(destination_cuda)
combined_tensor_cuda = torch.empty(tensor1_on_cuda.numel() + tensor2_on_cuda.numel(), device=destination_cuda)
combined_tensor_cuda[:tensor1_on_cuda.numel()].copy_(tensor1_on_cuda)
combined_tensor_cuda[tensor1_on_cuda.numel():].copy_(tensor2_on_cuda)
print(f'combined tensor - Cuda: {combined_tensor_cuda}')
destination_cpu = torch.device("cpu")
tensor1_on_cpu = tensor1.to(destination_cpu)
tensor2_on_cpu = tensor2.to(destination_cpu)
combined_tensor_cpu = torch.empty(tensor1_on_cpu.numel() + tensor2_on_cpu.numel(), device=destination_cpu)
combined_tensor_cpu[:tensor1_on_cpu.numel()].copy_(tensor1_on_cpu)
combined_tensor_cpu[tensor1_on_cpu.numel():].copy_(tensor2_on_cpu)
print(f'combined tensor - CPU: {combined_tensor_cpu}')

If it doesn’t you can pin-point exactly what fails, egs., the cuda peer copy.

Thank you!

The code you provided does not work as expected for me. The output is as follows, which simply copies tensor1:

tensor1: tensor([0.2940, 0.5223], device=‘cuda:0’), tensor2: tensor([0.5463, 0.5249], device=‘cuda:1’)
combined tensor - Cuda: tensor([0.2940, 0.5223, 0.2940, 0.5223], device=‘cuda:0’)
combined tensor - CPU: tensor([0.2940, 0.5223, 0.5463, 0.5249])

I found this is because tensor2_on_cuda = tensor2.to(destination_cuda) fails to move tensor2 to cuda:0

It seems disabling IOMMU fixes the issue. Thank you!

Ah should have thought of that! Disable ACS too while you’re at it.

I just got a Blackwell 6000, this afternoon. Finally got it working with CUDA 13.0 and the latest 580.
I’m on a consumer MB, 14900KF, 192GB RAM, I had to add “pcie_aspm=off intel_iommu=on iommu=pt” the boot command line and turn off HMM via “options nvidia_uvm uvm_disable_hmm=1” in ‘/etc/modprob.d’

IDK which specific combination works just that thats the state its in and pytorch works a docker container and CUDA works on the host.

Hey Wendell, love your setup! can you share what hardware you are using? Any issues with power or cooling and general reliability?

this was a silverstone tower case and silverstone aio. those are great. this build was ephimeral for this project, though, fwiw

Anyone else have vBIOS 98.02.52.00.03 on the Workstation? I’ve tried RTXPro6000WSv9802810007 but I get the message that the device is unsupported for update

image

Can someone please share the magic runes for getting past the Linux installer (tried Latest Ubuntu / /Proxmox 9.1 / Proxmox 8.0) with just a Blackwell GPU in the system? I’ve just got a RTX Pro 4500 Blackwell and I can’t get beyond GRUB without it locking up. I’ve done a bit of googling and tried all the nomodeset and other switches, but still no joy. I even swapped GPUs to do the install which worked, but I get the same hang after the ‘loading initial ramdisk’ message when I swap the 4500B GPU back in :frowning:

Is everyone else running a separate display-GPU / IPMI for the installation / console output?

A friendly man on YouTube just told me I should join this forum during “those tumultuous times” while he was handling DDR5 ECC memory that doubled in price while he was talking …

So … I guess I am here now. Hello.

1 Like

update bios, make sure rebar is enabled, disable csm if its enabled

So my guess here is that Ubuntu, as other distributions as well, would try to load the open source nouveau driver by default. I never had good success using that and it had too many rough edges to be useful. Normally I install the system by other means, then blacklist the nouveau driver, and install the open source driver from Nvidia, following the instructions on their website.

Here is an example for RHEL based distributions I followed in the past. You’d need to do something similar but for Ubuntu based distributions. You need to make sure nouveau is not used, and then make sure the Nvidia driver is installed.

The instructions to install the Nvidia driver can be found here by the way: Ubuntu — NVIDIA Driver Installation Guide

Welcome to the forums, have a look around. If you are interested feel free to search for existing topics and if you have questions that bug you, feel free to open your own thread too. There’s lots to be learned if you want.

1 Like

Thanks for the suggestions but it didn’t seem to help unfortunately!