RTX Pro 6000 MIG & OctoCrysis Traveler

wendell · August 11, 2025, 7:16pm

What Is This Beast!??!?

Here’s our FalconNorthwest Talon PC featuring

Threadripper Pro 9995WX from AMD
2x RTX 6000 Blackwell Pro
4x CM7 15.6tb from Kioxia
20tb local M.2 Storage Pool (5x) featuring Kingston Fury Renegade Gen5
… and oh so much more.

Background

Multi-Instance GPU on RTX Pro 6000 is aweesome, but it has been a journey.

This required an updated vbios. I flagged the issue on my retail cards in May, but it wasn’t until August that the updated vbios was distributed. So if you have an early card double check you have an up-to-date vbios.

There are many other threads on this forum if you want to read the long journey up till here.

Setting It Up

When you enable MIG, you must use the displaymodeselector tool from nvidia and set the card to compute. When you do that it disables the DP outputs on the card. They are meant to be used headlessly or via your hypervisor.

For the Falcon setup, I simply used the onboard ASPEED vga as the host GPU. From there there were 8 GPUs available to assign to virtual machines.

This makes 1x RTX PRO 6000 with 96gb show up as 4x RTX PRO 6000s with 24gb vram each.

8 24gb vram GPU instances. Madness!!

Note: I am still experimenting with a windows/hyper-v path.

Fixes in RTX Pro 6000 Vbios

In addition to the MIG fixes, it also resolved issues with BAR space allocaiton and device compatibility on WRX90. The USB4/asmedia controller was non functional due to I/O resource overlap until the updated vbioses; these cards did not work properly on the Threadripper WRX90 platform until after updating the vbios past 2.52.

SerenityDev · August 11, 2025, 10:19pm

Along with the build comes the benchmarks. Here’s how the Threadripper Pro held up against it’s counterparts!

Productivity / Artificial

3DMark: CPU

1-thread
2-threads
4-threads
8-threads
16-threads
max-threads

3DMark: Direct X Raytracing

direct-x

3DMark: Fire Strike

fire-strike

3DMark: Port Royal

port-royal

3DMark: Speed Way

speed-way

3DMark: Steel Nomad

steel-nomad

3DMark: Time Spy

time-spy

Aida64

9970x
9970x
9980x
9980x
9995WX
9995WX

Blender: CPU

classroom
junkshop
monster

Cinebench R23

single-core
multi-core

Cinebench 2024

single-core
multi-core

CPU-Z

single-thread
multi-thread

Geekbench 6: CPU

single-core
mult-core

PugetBench: Photoshop

photoshop

PugetBench: Premiere Pro

premiere

PugetBench: After Effects

Gaming

Assassain’s Creed Shadows

1080p
1440p

Borderlands 3

1080p
1440p

The Callisto Protocol

Ultra

1080p
1440p

Ultra w/ Raytracing

1080p
1440p

Clair Obscur 33

1080p
1440p

Cyberpunk 2077: Phantom Liberty

Ultra

1080p
1440p

Ultra: Raytracing

1080p
1440p

Final Fantasy XIV: Dawntrail

1080p
1440p

Monster Hunter Wilds

Ultra

1080p
1440p

Ultra w/ Raytracing

1080p
1440p

The Elder Scrolls IV Oblivion: Remake

Ultra

1080p
1440p

Ultra w/ raytracing

1080p
1440p

Silent Hill 2

Ultra

1080p
1440p

Ultra w/ raytracing

1080p
1440p

Shadow of the Tomb Raider

1080p
1440p

RBBS-PC · August 12, 2025, 5:41pm

In truth, I hesitate to comment… but at $17,7780.00 (before tax), with no keyboard, mouse or (sigh) display of any kind, I MUST ask.

Does this build make any kind of sense to anyone? A 9070 XT? You mean to say that at this price, I’m not getting a 5090?

For $18,000, I can build or buy pre-built a great many things.

With that said, in fairness, can you take a second a put a comparison to say… I dunno, A dual Ice Lake Xeon Server with a Terabyte of DDR4 RDIMMs? I mean… I’m just curious with a 9070 XT just how close a 3rd Gen system even with only PCIe 4.0 lanes (for days) would do against this thing in these same tests. I think I’d have enough money left over to build a decent 100Gbe gaming rig with a 7800X3D and a 5090 to smoke it in that regard and still make it under your $18,000 (without a display) budget. Or am I totally wrong here?

Sorry to be critical; but … $18,000.00 I mean… wow.

AlphaPrime · August 12, 2025, 6:46pm

Could you run some LLM benchmarks on CPU only, like 100gb, 200gb and 400gb models

RBBS-PC · August 12, 2025, 6:57pm

Sadly, the machine in question only has 128MB of RAM, so… um… no.

wendell · August 12, 2025, 8:10pm

Sure, I think I could do 512gb local memory and 1tb via 512gb local memory + 512gb CXL.

(I don’t have 8x128gb at the current moment.)

Oh, and if you want to run hybrid cpu+gpu https://www.youtube.com/watch?v=T17bpGItqXw having 96 cores of fast avx512 is nice for AI. The same way we used hybrid CPU+GPU on this “small” system in that video this “enormous” system can run models typically only reserved for 100k+ servers. It’ll run slower, sure, but not unusably slow. Plenty fast for research purposes.

eousphoros · August 12, 2025, 10:03pm

Theres also alot of free performance sitting in some of these inference engines. I found a 15 times speed up with BF16 in flash attention and picked up a free 10% fp16 performance gain fixing bf16.

github.com/ggml-org/llama.cpp

WIP: ggml-cuda: Add bf16 cuda support to fattn (Flash Attention)

ggml-org:master ← eous:master

opened 07:54AM - 12 Aug 25 UTC

eous

+962 -72

tldr; in progress change to move bf16 off cpu when using flash attention resulti…ng in a huge speed up. Example run without the patch ``` llama_kv_cache_unified: CUDA1 KV buffer size = 384.00 MiB llama_kv_cache_unified: size = 384.00 MiB ( 4096 cells, 48 layers, 1/1 seqs), K (bf16): 192.00 MiB, V (bf16): 192.00 MiB llama_context: CUDA1 compute buffer size = 1203.00 MiB llama_context: CUDA_Host compute buffer size = 88.05 MiB llama_context: graph nodes = 2983 llama_context: graph splits = 98 main: n_kv_max = 4096, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 99, n_threads = 1, n_threads_batch = 1 | PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | |-------|--------|--------|----------|----------|----------|----------| | 2048 | 2048 | 0 | 114.938 | 17.82 | 156.992 | 13.05 | ``` With the patch ``` llama_kv_cache_unified: CUDA1 KV buffer size = 384.00 MiB llama_kv_cache_unified: size = 384.00 MiB ( 4096 cells, 48 layers, 1/1 seqs), K (bf16): 192.00 MiB, V (bf16): 192.00 MiB llama_context: CUDA1 compute buffer size = 1219.00 MiB llama_context: CUDA_Host compute buffer size = 48.05 MiB llama_context: graph nodes = 2983 llama_context: graph splits = 2 main: n_kv_max = 4096, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 99, n_threads = 1, n_threads_batch = 1 | PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | |-------|--------|--------|----------|----------|----------|----------| | 2048 | 2048 | 0 | 0.640 | 3199.60 | 10.693 | 191.52 | ``` It still needs to be fully validated and tested and there is still some cleanup todo but the core impl is together and given the size wanted to see if there was any interest in taking in a commit along these lines. ``` > Write 5 haiku's about the superiority of bfloat16 over float16. Here are 5 haiku exploring bfloat16's superiority over float16: **Extended precision** Bfloat16 guards Significant digits from loss— Deep learning flourishes **Stable gradients** Float16 overflows Bfloat16 maintains flow Training remains steady **Wider range** Not just bits, but breadth— Bfloat16's exponent spans Large numbers survive **Better convergence** Small errors don't multiply Bfloat16 keeps weights stable Neural networks learn **Practical advantage** Float16's precision limited Bfloat16 balances speed And numerical reliability > EOF by user llama_perf_sampler_print: sampling time = 5.74 ms / 160 runs ( 0.04 ms per token, 27879.42 tokens per second) llama_perf_context_print: load time = 2203.57 ms llama_perf_context_print: prompt eval time = 101.25 ms / 27 tokens ( 3.75 ms per token, 266.68 tokens per second) llama_perf_context_print: eval time = 762.39 ms / 132 runs ( 5.78 ms per token, 173.14 tokens per second) llama_perf_context_print: total time = 58227.53 ms / 159 tokens llama_perf_context_print: graphs reused = 131 ```

Still the wild wild west, we got to enjoy it while we can!

H-i-v-e · August 13, 2025, 8:11pm

Would you be able to provide us with a reference where to find that quad U.2/U.3 carrier card?

wendell · August 13, 2025, 11:19pm

quad e3 carrier, it’s on Amazon for $100

mark.cl · August 19, 2025, 4:40pm

What a beast this thing is. Gaming benchmarks are one thing, but I think a lot of why people are interested in these rigs nowadays is for compiling and for language models.

Benchmarking for compilers is pretty straightforward on Linux: GitHub - nordlow/compiler-benchmark: Benchmarks compilation speeds of different combinations of languages and compilers.

Benchmarking for LLMs is more difficult, since there are multiple inference engines that you’d want to test. I think for a mixed CPU/GPU inference setup, starting with llamacpp would a good representation of the system’s capabilities. llama-bench is a binary within the default build target of llamacpp. Here’s an example invocation for

10, 15, and 20 GPU layers offloaded
4096, 8192, and 16384 token prompt processing
1024 and 2048 token generation
5 repetitions of each test

Some variables are multiplicative, e.g. prompt processing & token generation is repeated for each number of GPU layers.

./llama-bench \
--model ~/models/unsloth/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-Q4_K_M-00001-of-00009.gguf \
--mmap 1 \
--n-gpu-layers 10,15,20 \
--n-prompt 4096,8192,16384 \
--n-gen 1024,2048 \
--repetitions 5

more details

usage: llama-bench [options]

options:
  -h, --help
  --numa <distribute|isolate|numactl>       numa mode (default: disabled)
  -r, --repetitions <n>                     number of times to repeat each test (default: 5)
  --prio <-1|0|1|2|3>                          process/thread priority (default: 0)
  --delay <0...N> (seconds)                 delay between each test (default: 0)
  -o, --output <csv|json|jsonl|md|sql>      output format printed to stdout (default: md)
  -oe, --output-err <csv|json|jsonl|md|sql> output format printed to stderr (default: none)
  -v, --verbose                             verbose output
  --progress                                print test progress indicators
  --no-warmup                               skip warmup runs before benchmarking

test parameters:
  -m, --model <filename>                    (default: models/7B/ggml-model-q4_0.gguf)
  -p, --n-prompt <n>                        (default: 512)
  -n, --n-gen <n>                           (default: 128)
  -pg <pp,tg>                               (default: )
  -d, --n-depth <n>                         (default: 0)
  -b, --batch-size <n>                      (default: 2048)
  -ub, --ubatch-size <n>                    (default: 512)
  -ctk, --cache-type-k <t>                  (default: f16)
  -ctv, --cache-type-v <t>                  (default: f16)
  -dt, --defrag-thold <f>                   (default: -1)
  -t, --threads <n>                         (default: 24)
  -C, --cpu-mask <hex,hex>                  (default: 0x0)
  --cpu-strict <0|1>                        (default: 0)
  --poll <0...100>                          (default: 50)
  -ngl, --n-gpu-layers <n>                  (default: 99)
  -sm, --split-mode <none|layer|row>        (default: layer)
  -mg, --main-gpu <i>                       (default: 0)
  -nkvo, --no-kv-offload <0|1>              (default: 0)
  -fa, --flash-attn <0|1>                   (default: 0)
  -mmp, --mmap <0|1>                        (default: 1)
  -embd, --embeddings <0|1>                 (default: 0)
  -ts, --tensor-split <ts0/ts1/..>          (default: 0)
  -ot --override-tensor <tensor name pattern>=<buffer type>;...
                                            (default: disabled)
  -nopo, --no-op-offload <0|1>              (default: 0)

Multiple values can be given for each parameter by separating them with ','
or by specifying the parameter multiple times. Ranges can be given as
'first-last' or 'first-last+step' or 'first-last*mult'.

If you want to make me really happy, I’d appreciate if you could try a smaller or more quantized model and compare 4 channel and 8 channel results. I’d be more than willing to collaborate on developing a testing procedure for LLMs.

eousphoros · August 19, 2025, 8:16pm

Llama.cpp is not a performant inference engine for gpu’s. It is a cpu inference program that has begrudgingly added gpu support.

mark.cl · August 19, 2025, 8:25pm

VLLM wins for sure when using a pure GPU setup, but most homebrew LLM setups are mixed GPU/CPU. I’m particularly interested in how much the 96GB of VRAM speeds up prompt processing.

KyrSam · August 26, 2025, 8:48pm

Where to get the bios tho? If it’s nvidia manufactured

H-i-v-e · August 26, 2025, 8:50pm

I think the official way is to ask your reseller to get it from Nvidia if they did not already do so and then provide it to you. It comes as in binary providing a minimal cli tool that includes the VBIOS.

I do have a card from PNY so I am not sure, but when I flashed mine it appears the tools made a little viability check if it can actually flash the card or not. I can however not say if it would work on a card from Nvidia too because I have no idea if there even are differences between manufacturers.