nVidia rtx pro 6000 info dump

eousphoros · May 8, 2025, 7:45pm

Thread for random things I find interesting while I break in the new cards

_CudaDeviceProperties(name='NVIDIA RTX PRO 6000 Blackwell Workstation Edition', major=12, minor=0, total_memory=97258MB, multi_processor_count=188, uuid=a243a9ae-a528-c1ce-522c-45c468f481de, L2_cache_size=128MB)

Screenshot From 2025-05-08 11-35-36

ADA vs Blackwell 6000 3dmark speed way

eousphoros · May 8, 2025, 9:29pm

Power Limiting by nvidia-smi -pl 400 works

eousphoros · May 8, 2025, 10:09pm

Blackwell (power limited to 400w)


** Dtype: torch.bfloat16

** Platform/Device info:
Linux flat-blck-io 6.14.0-15-generic #15-Ubuntu SMP PREEMPT_DYNAMIC Sun Apr  6 15:05:05 UTC 2025 x86_64 x86_64
_CudaDeviceProperties(name='NVIDIA RTX PRO 6000 Blackwell Workstation Edition', major=12, minor=0, total_memory=97248MB, multi_processor_count=188, uuid=a243a9ae-a528-c1ce-522c-45c468f481de, L2_cache_size=128MB)

** Critical software versions:
torch=2.7.0+cu128
cuda=12.8

** Additional notes:
benchmark version: 2


--------------------------------------------------------------------------------


Warming up the accelerator for 30 secs ... accelerator warmup finished
^C
Tried  7864 shapes => the best outcomes were:
mean:   327.8 TFLOPS @ 1280x2304x2048 (MxNxK)
median: 327.7 TFLOPS @ 1280x2304x2048 (MxNxK)
max:    347.0 TFLOPS @ 1280x2304x2048 (MxNxK)

geomean: 196.8 TFLOPS for 7864 shapes in range: m=[0, 4096, 256] | n=[0, 4096, 256] | k=[0, 20480, 256]

Legend: TFLOPS = 10**12 FLOPS
Elapsed time: 0:06:42

Ada


** Dtype: torch.bfloat16

** Platform/Device info:
Linux flat-blck-io 6.14.0-15-generic #15-Ubuntu SMP PREEMPT_DYNAMIC Sun Apr  6 15:05:05 UTC 2025 x86_64 x86_64
_CudaDeviceProperties(name='NVIDIA RTX 6000 Ada Generation', major=8, minor=9, total_memory=48519MB, multi_processor_count=142, uuid=392e3fa3-a720-e90c-4dc3-1373d48c6f3e, L2_cache_size=96MB)

** Critical software versions:
torch=2.7.0+cu126
cuda=12.6

** Additional notes:
benchmark version: 2


--------------------------------------------------------------------------------


Warming up the accelerator for 30 secs ... accelerator warmup finished
^C
Tried  5581 shapes => the best outcomes were:
mean:   151.5 TFLOPS @ 1280x1792x4096 (MxNxK)
median: 151.7 TFLOPS @ 1280x1792x4096 (MxNxK)
max:    155.5 TFLOPS @ 1280x1792x4096 (MxNxK)

geomean: 95.7 TFLOPS for 5581 shapes in range: m=[0, 4096, 256] | n=[0, 4096, 256] | k=[0, 20480, 256]

Legend: TFLOPS = 10**12 FLOPS
Elapsed time: 0:08:34```

mpdarkguy · May 9, 2025, 10:23am

This card is SR-IOV capable right? What if someone wanted to multiplex gaming on it?

Or was it that poopy thing where you had to license it to even have their proprietary form of SR-IOV?

H-i-v-e · May 9, 2025, 1:54pm

Would be my question as well. Looking Glass client on the host operating system with the display attached to it and a Windows VM with access to a MIG device with the Looking Glass Server installed and then see the performance to have both accelerated GPU on the host and guest from the same card.

eousphoros · May 9, 2025, 7:30pm

It has the same limitations that the other pro series cards nvidia has put out like the Ada, if you want to partition the card you need vgpu license.

However they did fix the MSI bug that required a registry hack to get the ada card working through vfio. So you can vfio pass through the entire card into the vm without any registry hackery which is nice that they cleaned that up for this generation. So a bit of progress for us plebs who don’t want to pay for vgpu.

H-i-v-e · May 9, 2025, 8:00pm

Oh I just read the promotional information again so you can’t use it for the host and the partitions vGPUS for a guest at the same time?

eousphoros · May 9, 2025, 9:02pm

If you pay for vgpu (seperately) you can definitely do that.

H-i-v-e · May 9, 2025, 9:13pm

Even using the RTX 6000 for display output on the host and acceleration in a guest at the same time, are you sure? It read to me like you need to pass through the entire card, but you can pass through a vGPU, or a partition or however you wanna call it, to the VM while having display output on the host?

eousphoros · May 9, 2025, 10:48pm

Theres two ways to do it, the free way which is to have two video cards. One video card uses the vfio driver and the entire card is passed into the vm while the other video card is used for the actual display. Then you have two ways to view the vm, some sort of remote desktop or something like looking glass which does dma transfers between the two cards for ultra low latency.

The other way is a single card approach which allows you to use both the host and vm on the same card and pass in a vgpu into the virtual machine. This costs money for licensing fees

H-i-v-e · May 10, 2025, 9:35am

The former setup is what I have right now, that I know of well.

Okay so I get that as long as you pay for a vGPU license you can use the RTX PRO 6000 series for both host and guest at the same time. This actually sounds good. I’ve found this document, do I understand correctly that one perpetual workstation licence costs $450? Given the cost of the card and the amount of VM’s I need I think that is bad but not too bad.

eousphoros · May 27, 2025, 5:47am

Screenshot_From_2025-05-26_20-47-38

llama_model_loader: - type  f32:  471 tensors
llama_model_loader: - type  f16:   94 tensors
llama_model_loader: - type q4_K:  519 tensors
llama_model_loader: - type q6_K:   47 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 132.63 GiB (4.85 BPW) 
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe

230billion param model running at 40 tokens a second

eousphoros · June 1, 2025, 9:37pm

680billion param model running at 45 tokens a second

./build/bin/llama-cli --model /mnt/models/llama/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf --n-gpu-layers 62 --temp 0.6 -c 163840 --prompt '<｜User｜>What is the value of pi accurate to the 1,000 decimal? You must give all 1,000 values. Feel free to use scripts, calculators, or other resources to accurately calculate the number. Show your work calculating and verify the number for accuracy.\n<｜Assistant｜>' -no-cnv -fa -sys 'This assistant is DeepSeek-R1, created by DeepSeek.\nToday is Sunday, June 1st, 2025.'

llama_perf_sampler_print:    sampling time =     214.13 ms /  6351 runs   (    0.03 ms per token, 29659.55 tokens per second)
llama_perf_context_print:        load time =   19414.90 ms
llama_perf_context_print: prompt eval time =     516.14 ms /    55 tokens (    9.38 ms per token,   106.56 tokens per second)
llama_perf_context_print:        eval time =  138112.28 ms /  6295 runs   (   21.94 ms per token,    45.58 tokens per second)
llama_perf_context_print:       total time =  139185.50 ms /  6350 tokens

print_info: arch             = deepseek2
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 163840
llama_context: n_ctx_per_seq = 163840

lerner · June 13, 2025, 3:43am

edit: in case anyone else had the question, this is from mamf-finder.py as described in
WIP: Blackwell RTX 6000 Pro / Max-Q Quickie Setup Guide on Ubuntu 24.04 LTS / 25.04