RTX Pro 6000 MIG & OctoCrysis Traveler



What Is This Beast!??!?

Here’s our FalconNorthwest Talon PC featuring

  • Threadripper Pro 9995WX from AMD
  • 2x RTX 6000 Blackwell Pro
  • 4x CM7 15.6tb from Kioxia
  • 20tb local M.2 Storage Pool (5x) featuring Kingston Fury Renegade Gen5
    … and oh so much more.

Background

Multi-Instance GPU on RTX Pro 6000 is aweesome, but it has been a journey.

This required an updated vbios. I flagged the issue on my retail cards in May, but it wasn’t until August that the updated vbios was distributed. So if you have an early card double check you have an up-to-date vbios.

There are many other threads on this forum if you want to read the long journey up till here.

Setting It Up

When you enable MIG, you must use the displaymodeselector tool from nvidia and set the card to compute. When you do that it disables the DP outputs on the card. They are meant to be used headlessly or via your hypervisor.

For the Falcon setup, I simply used the onboard ASPEED vga as the host GPU. From there there were 8 GPUs available to assign to virtual machines.

This makes 1x RTX PRO 6000 with 96gb show up as 4x RTX PRO 6000s with 24gb vram each.

8 24gb vram GPU instances. Madness!!

Note: I am still experimenting with a windows/hyper-v path.

Fixes in RTX Pro 6000 Vbios

In addition to the MIG fixes, it also resolved issues with BAR space allocaiton and device compatibility on WRX90. The USB4/asmedia controller was non functional due to I/O resource overlap until the updated vbioses; these cards did not work properly on the Threadripper WRX90 platform until after updating the vbios past 2.52.

7 Likes

Along with the build comes the benchmarks. Here’s how the Threadripper Pro held up against it’s counterparts!

Productivity / Artificial

3DMark: CPU

1-thread
2-threads
4-threads
8-threads
16-threads
max-threads

3DMark: Direct X Raytracing

direct-x

3DMark: Fire Strike

fire-strike

3DMark: Port Royal

port-royal

3DMark: Speed Way

speed-way

3DMark: Steel Nomad

steel-nomad

3DMark: Time Spy

time-spy

Aida64

9970x
9970x
9980x
9980x
9995WX
9995WX

Blender: CPU

classroom
junkshop
monster

Cinebench R23

single-core
multi-core

Cinebench 2024

single-core
multi-core

CPU-Z

single-thread
multi-thread

Geekbench 6: CPU

single-core
mult-core

PugetBench: Photoshop

photoshop

PugetBench: Premiere Pro

premiere

PugetBench: After Effects

ae

Gaming

Assassain’s Creed Shadows

1080p
1440p
4k

Borderlands 3

1080p
1440p
4k

The Callisto Protocol

Ultra

1080p
1440p
4k

Ultra w/ Raytracing

1080p
1440p
4k

Clair Obscur 33

1080p
1440p
4k

Cyberpunk 2077: Phantom Liberty

Ultra

1080p
1440p
4k

Ultra: Raytracing

1080p
1440p
4k

Final Fantasy XIV: Dawntrail

1080p
1440p
4k

Monster Hunter Wilds

Ultra

1080p
1440p
4k

Ultra w/ Raytracing

1080p
1440p
4k

The Elder Scrolls IV Oblivion: Remake

Ultra

1080p
1440p
4k

Ultra w/ raytracing

1080p
1440p
4k

Silent Hill 2

Ultra

1080p
1440p
4k

Ultra w/ raytracing

1080p
1440p
4k

Shadow of the Tomb Raider

1080p
1440p
4k

3 Likes

In truth, I hesitate to comment… but at $17,7780.00 (before tax), with no keyboard, mouse or (sigh) display of any kind, I MUST ask.

Does this build make any kind of sense to anyone? A 9070 XT? You mean to say that at this price, I’m not getting a 5090?

For $18,000, I can build or buy pre-built a great many things.

With that said, in fairness, can you take a second a put a comparison to say… I dunno, A dual Ice Lake Xeon Server with a Terabyte of DDR4 RDIMMs? I mean… I’m just curious with a 9070 XT just how close a 3rd Gen system even with only PCIe 4.0 lanes (for days) would do against this thing in these same tests. I think I’d have enough money left over to build a decent 100Gbe gaming rig with a 7800X3D and a 5090 to smoke it in that regard and still make it under your $18,000 (without a display) budget. Or am I totally wrong here?

Sorry to be critical; but … $18,000.00 I mean… wow.

Could you run some LLM benchmarks on CPU only, like 100gb, 200gb and 400gb models

Sadly, the machine in question only has 128MB of RAM, so… um… no.

Sure, I think I could do 512gb local memory and 1tb via 512gb local memory + 512gb CXL.

(I don’t have 8x128gb at the current moment.)

Oh, and if you want to run hybrid cpu+gpu https://www.youtube.com/watch?v=T17bpGItqXw having 96 cores of fast avx512 is nice for AI. The same way we used hybrid CPU+GPU on this “small” system in that video this “enormous” system can run models typically only reserved for 100k+ servers. It’ll run slower, sure, but not unusably slow. Plenty fast for research purposes.

1 Like

Theres also alot of free performance sitting in some of these inference engines. I found a 15 times speed up with BF16 in flash attention and picked up a free 10% fp16 performance gain fixing bf16.

Still the wild wild west, we got to enjoy it while we can!

2 Likes

Would you be able to provide us with a reference where to find that quad U.2/U.3 carrier card?

quad e3 carrier, it’s on Amazon for $100

1 Like

What a beast this thing is. Gaming benchmarks are one thing, but I think a lot of why people are interested in these rigs nowadays is for compiling and for language models.

Benchmarking for compilers is pretty straightforward on Linux: GitHub - nordlow/compiler-benchmark: Benchmarks compilation speeds of different combinations of languages and compilers.

Benchmarking for LLMs is more difficult, since there are multiple inference engines that you’d want to test. I think for a mixed CPU/GPU inference setup, starting with llamacpp would a good representation of the system’s capabilities. llama-bench is a binary within the default build target of llamacpp. Here’s an example invocation for

  • 10, 15, and 20 GPU layers offloaded
  • 4096, 8192, and 16384 token prompt processing
  • 1024 and 2048 token generation
  • 5 repetitions of each test

Some variables are multiplicative, e.g. prompt processing & token generation is repeated for each number of GPU layers.

./llama-bench \
--model ~/models/unsloth/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-Q4_K_M-00001-of-00009.gguf \
--mmap 1 \
--n-gpu-layers 10,15,20 \
--n-prompt 4096,8192,16384 \
--n-gen 1024,2048 \
--repetitions 5

more details

usage: llama-bench [options]

options:
  -h, --help
  --numa <distribute|isolate|numactl>       numa mode (default: disabled)
  -r, --repetitions <n>                     number of times to repeat each test (default: 5)
  --prio <-1|0|1|2|3>                          process/thread priority (default: 0)
  --delay <0...N> (seconds)                 delay between each test (default: 0)
  -o, --output <csv|json|jsonl|md|sql>      output format printed to stdout (default: md)
  -oe, --output-err <csv|json|jsonl|md|sql> output format printed to stderr (default: none)
  -v, --verbose                             verbose output
  --progress                                print test progress indicators
  --no-warmup                               skip warmup runs before benchmarking

test parameters:
  -m, --model <filename>                    (default: models/7B/ggml-model-q4_0.gguf)
  -p, --n-prompt <n>                        (default: 512)
  -n, --n-gen <n>                           (default: 128)
  -pg <pp,tg>                               (default: )
  -d, --n-depth <n>                         (default: 0)
  -b, --batch-size <n>                      (default: 2048)
  -ub, --ubatch-size <n>                    (default: 512)
  -ctk, --cache-type-k <t>                  (default: f16)
  -ctv, --cache-type-v <t>                  (default: f16)
  -dt, --defrag-thold <f>                   (default: -1)
  -t, --threads <n>                         (default: 24)
  -C, --cpu-mask <hex,hex>                  (default: 0x0)
  --cpu-strict <0|1>                        (default: 0)
  --poll <0...100>                          (default: 50)
  -ngl, --n-gpu-layers <n>                  (default: 99)
  -sm, --split-mode <none|layer|row>        (default: layer)
  -mg, --main-gpu <i>                       (default: 0)
  -nkvo, --no-kv-offload <0|1>              (default: 0)
  -fa, --flash-attn <0|1>                   (default: 0)
  -mmp, --mmap <0|1>                        (default: 1)
  -embd, --embeddings <0|1>                 (default: 0)
  -ts, --tensor-split <ts0/ts1/..>          (default: 0)
  -ot --override-tensor <tensor name pattern>=<buffer type>;...
                                            (default: disabled)
  -nopo, --no-op-offload <0|1>              (default: 0)

Multiple values can be given for each parameter by separating them with ','
or by specifying the parameter multiple times. Ranges can be given as
'first-last' or 'first-last+step' or 'first-last*mult'.

If you want to make me really happy, I’d appreciate if you could try a smaller or more quantized model and compare 4 channel and 8 channel results. I’d be more than willing to collaborate on developing a testing procedure for LLMs.

Llama.cpp is not a performant inference engine for gpu’s. It is a cpu inference program that has begrudgingly added gpu support.

VLLM wins for sure when using a pure GPU setup, but most homebrew LLM setups are mixed GPU/CPU. I’m particularly interested in how much the 96GB of VRAM speeds up prompt processing.

Where to get the bios tho? If it’s nvidia manufactured

1 Like

I think the official way is to ask your reseller to get it from Nvidia if they did not already do so and then provide it to you. It comes as in binary providing a minimal cli tool that includes the VBIOS.

I do have a card from PNY so I am not sure, but when I flashed mine it appears the tools made a little viability check if it can actually flash the card or not. I can however not say if it would work on a card from Nvidia too because I have no idea if there even are differences between manufacturers.