I’m really curious if anyone knows the actual answer to this already. Obviously the RTX Pro 6000 Blackwell card does not have the physical bridge or NV-Link options. But I want to know that since this is a PCIe gen 5 card. Is the PCIe fast enough to support linking two cards when running LLM inference through vLLM? Splitting something like a Llama3 70B across two of them and inferencing. Anyone confirmed if its possible or if it chokes?
64gb/sec duplex for pcie5. 70,b runs fine on one card even with large context window
Cool!
So I’m currently a Bachelor’s student in Comp Sci, and am (by a bunch of luck, I should say) heading an AI lab with a Threadripper 7995WX and a RTX A6000 (Ampere) in a Lenovo P8, with 256 Gigs of RAM.
Given this PC is, at most times, used by multiple researchers in our own rapid prototyping lab, we were considering upgrading to OR appending an RTX Pro 6000 as well - since some of our current workloads are VRAM limited (we work on industrial collaboration projects with super short timelines).
Would love to hear your thoughts!
In my tests for inference I haven’t seen much TX/RX. For 1 request with vLLM 2x5090 (connected to pcie 4.0x16) using -tp 2 = it’s in the hundreds of MB/s, 200-300.
It peaks at 3-4GB/s for 10 parallel request though:
Test with this command:
VLLM_FLASH_ATTN_VERSION=2 CUDA_VISIBLE_DEVICES=2,4 python benchmarks/benchmark_latency.py \
--model /mnt/llms/models/mistralai/Mistral-Nemo-Instruct-2407 \
--tensor-parallel-size 2 \
--input-len 1024 --output-len 256 \
--batch-size 10 \
--n 1 --num-iters-warmup 0
vLLM installed from source with a custom pytorch 2.7.0 environment to support the 5090s
Ayo @wendell I don’t mean to beg but Nvidia support doesn’t exist for mortals - would you mind asking your contacts over there about MIG support if the opportunity arises?
Docs say it’s definitely a thing for Blackwell 6000.
It even spells out 575.51.03 as the minimum driver you need - which is what’s in the Nvidia Ubuntu apt repo right now, but that version definitely does not allow you to turn MIG on. There’s a newer .run driver (also doesn’t work), and it’s the same deal on Windows (576.something driver). Tried with 2 different 6000s on 3 different systems without luck.
Yeah thanks! That was the “newer run driver” I was referring to but was too lazy to check and quote exact version.
I did pull it from the repo just now to see if maybe they snuck in some last-minute changes before publishing it in the repo, but even if they did it’s not a fix for MIG:
fwiw, that command also doesn’t work on a secondary compute card either.
# nvidia-smi -mig 1 -i 1
Unable to enable MIG Mode for GPU 00000000:E1:00.0: Not Supported
Treating as warning and moving on.
All done.
Might be like CC which has requirements to enable. Might be worth digging through the open source driver to check what it needs if anything.
I remember Wendell you were talking about how they’re going to squeeze more performance out with newer drivers. Just ran with 575.57.08 recently released in RPM Fusion and getting really impressive numbers. Would love to see your numbers with the (not that much) newer drivers.
Thanks for checking it on your end but what GPU is that? MIG has very narrow support even among the datacenter SKUs.
It enables just fine on this lowly A30 (I just plugged it in instead of the 6000, it’s the same box, software stack, everything):
$ sudo nvidia-smi -mig 1
Enabled MIG Mode for GPU 00000000:41:00.0
Warning: persistence mode is disabled on device 00000000:41:00.0. See the Known Issues section of the nvidia-smi(1) man page for more information. Run with [--help | -h] switch to get more information on how to enable persistence mode.
All done.
$ nvidia-smi
Sun Jun 8 18:50:14 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A30 Off | 00000000:41:00.0 Off | On |
| N/A 65C P0 45W / 165W | 0MiB / 24576MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+==================================+===========+=======================|
| No MIG devices found |
+-----------------------------------------------------------------------------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
6000 blackwell
Ah okay, I thought you meant a datacenter card that’s compute-only. But no, just casually doubling up on the 6000s… Mind if I ask what PSU is powering that rig?
The only way I could get a quote for two cards in one box was either to wait for the max-q or use a 240V PSU, neither appealed much. The only 240V outlet I have in the workshop is taken by a lab oven, but I do have 30A ones for computers.
Does anyone know if power limiting the 600w card to 300w should yield statistically the same performance as the Max-Q version?
Wondering if the Max-Q might be binned differently for lower power operation or something like that.
I’ve done this test, and the performance is similar, it’s why I chose workstation cards because you can still step them down to 300w if you have heat/power issues but if you want to crank them up, you can.
Hi, I’m new here and saw your video of the RTX6000 Pro. I have just ordered one for a new build with a Gigabyte TRX50. I spoke to some of the nice people at computer components yesterday, but they couldn’t really answer my question. I have another RTX waiting to be delivered either for a colleague of mine or to add it to my system. As NVlink doesn’t work anymore what is going to be your solution to connecting the two and do I need another hardware part to do this connection. Much appreciated. D
The cards will work together across the PCIe bus in software.
If anyone is looking for an RTX 6000 Pro Blackwell Workstation, I’m an authorized reseller and have 4 left in stock ready to ship. Cards are brand new, with warranty, unopened. Shoot me a DM, these will be the last cards I get for the rest of 2025.
Just setup 9 of them in an open air mining rig and they’re chugging along nicely I see some larger gains for things like LLM training (>1.8x over a similar Ada 6000 configuration). This is without using any of the new fancy features like microscaling, fp4, or even flash attention 3. Just CUDA 12.8 and pytorch 3.7, no python re-written. It can only get better.
Dropping to 450w loses ~15%, and dropping to 300w loses ~40% relative to 600w. For inference, the drops vs power are less extreme, around the ~20% number between 300w and 600w (which is the case for most non-saturating workloads).
At 300w the dual flow through design runs a lot cooler than my Ada blowers, despite the cards eating each other’s backwash. Around ~60C (in my 22C room) vs 80C, and a lot more quiet too.
At 600w, the backwash eating is apparent, with the worst card in the congo line hitting 85C with max fans (2 empty slot spacing between the cards, 4-5 cards in a row). That’s still acceptable. Placing baffles in between them will probably help. If you’re stacking cards you probably don’t want to put these in a case (or get the Max-Q instead!).
I did not have trouble getting many cards at MSRP. No reason to go for the eBay prices.
As NVlink doesn’t work anymore what is going to be your solution to connecting the two and do I need another hardware part to do this connection.
No hardware, the (renamed) Quadro line can do PCIe P2P without going through the host if they are part of the same PCIe root complex (need to set up ACS, IOMMU, etc. correctly), and unlike the hacked 4090 variant they have 2 copy engines, so they can actually do full duplex PCIe. In Blackwell, that’s potentially full Gen5 x16 bandwidth.
NCCL, pytorch’s .to()
etc. will all automatically use the peer memcpy under the hood if available. You don’t technically get significant bandwidth benefits on paper (NVLink is much faster), but you get latency benefits (no CPU involved) and don’t need to stage via pinned DRAM buffers (don’t get limited by system RAM speed/channels). If you have a PCIe switch (like a PLX), the GPUs hooked up to it will talk directly without using any direct bus traffic at all.
Even without this feature though, moving data over pinned buffers via CPU is still pretty fast for a lot of applications and you don’t miss the lack of NVlink. I switch to data parallel & context/sequence parallel with PCIe P2P instead of tensor parallel (TP) for almost no loss in LLM training. For inference, yeah you’ll take a slight hit if you run a TP engine, but it is totally usable. For rendering, very little needs to move between GPUs for egs., Blender render, though you lose the old VRAM pooling.
Would love to see a pic!
Curious if anyone has the Max-Q edition and can comment on fan noise level? Been trying to find in various sources if there’s a sound test of sort to see how high it can be. This is for a local ai server for development at home. Had the RTX Pro workstation variant and was decent volume at high use. But looking to go with 2x max-q variants for a new build and wondering if the sound is like that with some blower fans which can be screeching loud on load (ex. RTX 4080 modded 48GB).