One Spark Two Spark Red Strix Blue Arc?

I am just making some notes here for an upcoming video. What model(s) would you suggest testing via llama bench?

Spark Platforms

MSI EdgeXpert (No Thermal limits!)

Nvidia DGX Spark (repasted with PTM7950)

Strix Halo Platforms in test:

Nvidia DGX Spark

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes

model size params backend ngl n_ubatch fa mmap test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 pp2048 1608.25 ± 5.97
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 tg32 48.51 ± 0.33
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 pp2048 @ d4096 1557.78 ± 4.35
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 tg32 @ d4096 45.99 ± 0.16
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 pp2048 @ d8192 1506.69 ± 8.17
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 tg32 @ d8192 43.34 ± 0.77
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 pp2048 @ d16384 1308.31 ± 9.09
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 tg32 @ d16384 40.03 ± 0.19
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 pp2048 @ d32768 1061.15 ± 7.27
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 tg32 @ d32768 34.46 ± 0.14

MSI EdgeXpert DGX Spark

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes

model size params backend ngl n_ubatch fa mmap test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 pp2048 1729.39 ± 19.50
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 tg32 52.59 ± 1.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 pp2048 @ d4096 1680.80 ± 8.96
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 tg32 @ d4096 49.10 ± 1.35
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 pp2048 @ d8192 1581.90 ± 13.02
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 tg32 @ d8192 46.88 ± 0.69
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 pp2048 @ d16384 1356.77 ± 130.05
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 tg32 @ d16384 42.87 ± 0.75
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 pp2048 @ d32768 1073.65 ± 138.28
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 tg32 @ d32768 37.25 ± 0.36

Strix Halo Using the “Strix Halo Toolboxes”

bash-5.3#  llama-bench --mmap 0 -m ./gpt-oss-120b-mxfp4-00001-of-00003.gguf   -fa 1 -d 0,4096,8192,16384,32768
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |  1 |    0 |           pp512 |       760.45 ± 54.88 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |  1 |    0 |           tg128 |         51.81 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |  1 |    0 |   pp512 @ d4096 |        687.37 ± 1.62 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |  1 |    0 |   tg128 @ d4096 |         44.72 ± 0.02 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |  1 |    0 |   pp512 @ d8192 |        599.81 ± 5.67 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |  1 |    0 |   tg128 @ d8192 |         37.45 ± 0.03 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |  1 |    0 |  pp512 @ d16384 |        462.82 ± 2.02 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |  1 |    0 |  tg128 @ d16384 |         28.91 ± 0.02 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |  1 |    0 |  pp512 @ d32768 |        304.50 ± 1.03 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |  1 |    0 |  tg128 @ d32768 |         19.81 ± 0.02 |

Strix Halo Like For Like

bash-5.3#  llama-bench --mmap 0 -m ./gpt-oss-120b-mxfp4-00001-of-00003.gguf   -fa 1 -d 0,4096,8192,16384,32768  -p 2048 -n 32 -ub 2048 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |          pp2048 |       1008.28 ± 0.97 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |            tg32 |         51.84 ± 0.01 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |  pp2048 @ d4096 |        868.00 ± 2.51 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |    tg32 @ d4096 |         42.42 ± 0.03 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |  pp2048 @ d8192 |        681.53 ± 1.87 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |    tg32 @ d8192 |         36.58 ± 0.03 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 | pp2048 @ d16384 |        492.90 ± 1.55 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |   tg32 @ d16384 |         28.95 ± 0.02 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 | pp2048 @ d32768 |        309.00 ± 0.73 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |   tg32 @ d32768 |         19.84 ± 0.01 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |          pp2048 |       1061.18 ± 1.80 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |            tg32 |         51.87 ± 0.01 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |  pp2048 @ d4096 |        869.10 ± 1.53 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |    tg32 @ d4096 |         42.59 ± 0.03 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |  pp2048 @ d8192 |        681.64 ± 1.61 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |    tg32 @ d8192 |         36.60 ± 0.01 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 | pp2048 @ d16384 |        492.73 ± 0.77 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |   tg32 @ d16384 |         28.92 ± 0.03 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 | pp2048 @ d32768 |        309.35 ± 0.23 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |   tg32 @ d32768 |         19.84 ± 0.03 |

Gemma 3 12b it qat Q4

 ./build-cuda/bin/llama-bench -m ~/.cache/huggingface/hub/models--ggml-org--gemma-3-12b-it-qat-GGUF/snapshots/05c2df468ad7a0bb1284b3d6fe2bdf495a885567/gemma-3-12b-it-qat-Q4_0.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | CUDA       |  99 |     2048 |  1 |    0 |          pp2048 |      1782.56 ± 13.32 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | CUDA       |  99 |     2048 |  1 |    0 |            tg32 |         26.95 ± 0.22 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | CUDA       |  99 |     2048 |  1 |    0 |  pp2048 @ d4096 |      1639.08 ± 10.67 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | CUDA       |  99 |     2048 |  1 |    0 |    tg32 @ d4096 |         25.39 ± 0.23 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | CUDA       |  99 |     2048 |  1 |    0 |  pp2048 @ d8192 |       1498.74 ± 8.98 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | CUDA       |  99 |     2048 |  1 |    0 |    tg32 @ d8192 |         23.38 ± 0.24 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | CUDA       |  99 |     2048 |  1 |    0 | pp2048 @ d16384 |      1328.67 ± 11.42 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | CUDA       |  99 |     2048 |  1 |    0 |   tg32 @ d16384 |         22.52 ± 0.46 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | CUDA       |  99 |     2048 |  1 |    0 | pp2048 @ d32768 |       1112.42 ± 8.28 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | CUDA       |  99 |     2048 |  1 |    0 |   tg32 @ d32768 |         19.24 ± 0.10 |
bash-5.3# llama-bench -m /workspace/.cache/huggingface/hub/models--ggml-org--gemma-3-12b-it-qat-GGUF/snapshots/05c2df468ad7a0bb1284b3d6fe2bdf495a885567/gemma-3-12b-it-qat-Q4_0.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |          pp2048 |        987.75 ± 2.10 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |            tg32 |         27.13 ± 0.00 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |  pp2048 @ d4096 |        731.58 ± 5.75 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |    tg32 @ d4096 |         24.22 ± 0.02 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |  pp2048 @ d8192 |       554.30 ± 10.88 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |    tg32 @ d8192 |         21.66 ± 0.01 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 | pp2048 @ d16384 |        307.08 ± 2.10 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |   tg32 @ d16384 |         20.23 ± 0.01 |
Memory access fault by GPU node-1 (Agent handle: 0x85d7ed0) on address 0x79bc94306000. Reason: Page not present or supervisor privilege.

it was very time consuming to get these results as I would face issues like:

[48476.062006] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48476.062016] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48478.734485] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48478.734488] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48481.406821] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48481.406826] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48484.078855] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48484.078857] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48486.754327] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48486.754333] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48489.426367] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48489.426369] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48492.099739] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48492.099744] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48494.771692] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48494.771695] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48497.443736] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48497.443739] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48500.115835] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48500.115837] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48502.793322] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48502.793328] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48505.466503] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48505.466510] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48515.976288] amdgpu 0000:c3:00.0: amdgpu: Dumping IP State
[48515.977056] amdgpu 0000:c3:00.0: amdgpu: Dumping IP State Completed
[48515.977076] amdgpu 0000:c3:00.0: amdgpu: ring sdma0 timeout, signaled seq=129, emitted seq=132
[48515.977088] amdgpu 0000:c3:00.0: amdgpu: Starting sdma0 ring reset
[48515.977106] amdgpu 0000:c3:00.0: amdgpu: reset sdma queue (0:0:0)
[48516.211306] amdgpu 0000:c3:00.0: amdgpu: failed to wait on sdma queue reset done
[48516.211310] [drm:amdgpu_mes_reset_legacy_queue [amdgpu]] *ERROR* failed to reset legacy queue
[48516.211528] amdgpu 0000:c3:00.0: amdgpu: Ring sdma0 reset failure
[48516.211530] amdgpu 0000:c3:00.0: amdgpu: GPU reset begin!
[48662.409460] INFO: task kworker/u132:15:4723 blocked for more than 122 seconds.
[48662.409482]       Not tainted 6.14.0-1015-oem #15-Ubuntu
[48662.409486] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

while trying to get comparative numbers with llama-bench. (vLLM was more stable fwiw)

bash-5.3#  llama-bench -m /workspace/.cache/huggingface/hub/models--ggml-org--gemma-3-12b-it-qat-GGUF/snapshots/05c2df468ad7a0bb1284b3d6fe2bdf495a885567/gemma-3-12b-it-qat-Q4_0.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |          pp2048 |        989.10 ± 1.74 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |            tg32 |         27.21 ± 0.01 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |  pp2048 @ d4096 |        734.50 ± 2.93 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |    tg32 @ d4096 |         24.27 ± 0.11 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |  pp2048 @ d8192 |       556.55 ± 14.63 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |    tg32 @ d8192 |         21.70 ± 0.01 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 | pp2048 @ d16384 |        308.02 ± 0.67 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |   tg32 @ d16384 |         20.24 ± 0.01 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 | pp2048 @ d32768 |        150.48 ± 1.51 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |   tg32 @ d32768 |         15.57 ± 0.01 |

Qwen3 Coder 30B A3B Instruct Q8


spark2@spark-015f:~/llama.cpp$ ./build-cuda/bin/llama-bench -m ~/.cache/huggingface/hub/models--ggml-org--Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF/snapshots/d2a13024f16ef1985cb8347c3dbad6114e0ecef6/qwen3-coder-30b-a3b-instruct-q8_0.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |          pp2048 |      2704.86 ± 47.29 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |            tg32 |         54.56 ± 1.32 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |  pp2048 @ d4096 |      2335.24 ± 37.37 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |    tg32 @ d4096 |         49.22 ± 1.01 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |  pp2048 @ d8192 |       2095.88 ± 2.55 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |    tg32 @ d8192 |         41.98 ± 0.40 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 | pp2048 @ d16384 |      1630.78 ± 17.80 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |   tg32 @ d16384 |         34.56 ± 0.09 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 | pp2048 @ d32768 |      1168.68 ± 13.35 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |   tg32 @ d32768 |         25.99 ± 0.33 |

bash-5.3#  llama-bench -m /workspace/.cache/huggingface/hub/models--ggml-org--Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF/snapshots/d2a13024f16ef1985cb8347c3dbad6114e0ecef6/qwen3-coder-30b-a3b-instruct-q8_0.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    0 |          pp2048 |        910.48 ± 1.57 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    0 |            tg32 |         52.57 ± 0.01 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    0 |  pp2048 @ d4096 |        597.82 ± 0.86 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    0 |    tg32 @ d4096 |         39.77 ± 0.02 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    0 |  pp2048 @ d8192 |        432.54 ± 0.45 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    0 |    tg32 @ d8192 |         34.15 ± 0.02 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    0 | pp2048 @ d16384 |        265.82 ± 1.40 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    0 |   tg32 @ d16384 |         26.55 ± 0.01 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    0 | pp2048 @ d32768 |        108.33 ± 1.09 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    0 |   tg32 @ d32768 |         16.35 ± 0.01 |

Two Sparks Are Not Fast

So it is possible to run something like Qwen235B across two sparks. This Will Not Be Fast. I am going to put the numbers here or that but I would encourage you to instead think about this agentically as opposed to ‘moar bigger models’ – you could run a larger model in this “small” hardware to debug something but really what having 2x128gb memory space does for you is give you more space for agents in the Agentic AI.

Imagine being able to run a whole little fleet of 7-35b parameter models that work together vs one monolithic model. That’s the real benefit of scaling to multiple sparks.

Also, being constrained in this way, forces you, the developer, to make good choices with scaling your app vs ‘pay more, throw more hardware at it’ (imho)

 python3 level1_trensorrt_bench.py

=== Mode: pp2048-like ===
Prompt target tokens (approx): 2048, max_tokens: 1
Total requests: 64, concurrency: 8
Wall time (batch): 226.905 s
Requests: 64
Prompt tokens total:     131584
Completion tokens total: 64
Total tokens:            131648
Prompt tokens/sec:       579.91
Completion tokens/sec:   0.28
Total tokens/sec:        580.19
Latency stats (s):
  mean: 28.154
  p50 : 0.819
  p90 : 217.427
  p95 : 218.664
  p99 : 219.088

Some math about this:

  • Tokens per request:
    131,584/64≈2,056131,584 / 64 ≈ 2,056131,584/64≈2,056 tokens per prompt.
  • With concurrency 8 and 64 requests, one has 8 “waves” of requests.
    Wall time per wave: 226.9/8≈28.4226.9 / 8 ≈ 28.4226.9/8≈28.4 seconds.
  • That matches our mean latency ≈ 28.15 s. So roughly:
    • Each 2k-token prompt takes ~28 s end-to-end.
    • Token throughput per request: 2,056/28.15≈732,056 / 28.15 ≈ 732,056/28.15≈73 tokens/s.
    • With concurrency 8 → ~580 tokens/s aggregate, which matches the printed total tokens/sec.

So the numbers are self-consistent:

  • Roughly 73 tokens/s of prompt ingestion per request, aggregated to ~580 tokens/s with concurrency 8.
  • The ugly p90/p95/p99 ~ 217–219 s means some requests sit in the queue for several waves before they actually get processed (classic long tail under load).

**This is “slow-ish” for a big model on an HTTP-served distributed setup, but also not insane for a 235B-style monster across 2 nodes, especially with chat overhead + network + dynamic batching. 64 concurrency / 8 threads is right out. **

32/4 looks a bit better:

=== Mode: pp2048-like ===
Prompt target tokens (approx): 2048, max_tokens: 1
Total requests: 32, concurrency: 4
Wall time (batch): 4.365 s
Requests: 32
Prompt tokens total:     65792
Completion tokens total: 32
Total tokens:            65824
Prompt tokens/sec:       15072.58
Completion tokens/sec:   7.33
Total tokens/sec:        15079.91
Latency stats (s):
  mean: 0.542
  p50 : 0.565
  p90 : 0.572
  p95 : 0.572
  p99 : 0.574

With the reduced load (32 requests, concurrency 4), the same ~2,056-token prompts are going through about 26× faster end-to-end: mean latency dropped from ~28.2 s per request to ~0.54 s, and aggregate prompt throughput jumped from ~580 tok/s to ~15,073 tok/s. In the original run, 8 “waves” of 8 requests were effectively serialized by queueing, which is why mean latency (~28 s) matched wall-time per wave and p90+ latencies blew out to ~217–219 s — the system was badly overloaded and requests were waiting multiple waves in the queue. In the lighter run, p50–p99 latencies are all tightly clustered around ~0.56 s, so I think that’s closer to a “true” steady-state performance of the 2-Spark deployment instead of a queueing-dominated long-tail mess.

=== Mode: tg32-like ===
Prompt target tokens (approx): 64, max_tokens: 32
Total requests: 32, concurrency: 4
Wall time (batch): 408.873 s
Requests: 32
Prompt tokens total:     2304
Completion tokens total: 1024
Total tokens:            3328
Prompt tokens/sec:       5.63
Completion tokens/sec:   2.50
Total tokens/sec:        8.14
Latency stats (s):
  mean: 51.101
  p50 : 8.277
  p90 : 11.301
  p95 : 343.857
  p99 : 343.857

and messing with concurrency more because the results don’t 100% make sense to me

root@spark-2903:/app# python3 tensorrt_llm.py

=== Mode: pp2048-like ===
Prompt target tokens (approx): 2048, max_tokens: 1
Total requests: 16, concurrency: 2
Wall time (batch): 340.913 s
Requests: 16
Prompt tokens total:     32896
Completion tokens total: 16
Total tokens:            32912
Prompt tokens/sec:       96.49
Completion tokens/sec:   0.05
Total tokens/sec:        96.54
Latency stats (s):
  mean: 42.608
  p50 : 0.499
  p90 : 0.511
  p95 : 337.385
  p99 : 337.385

=== Mode: tg32-like ===
Prompt target tokens (approx): 64, max_tokens: 32
Total requests: 16, concurrency: 2
Wall time (batch): 52.042 s
Requests: 16
Prompt tokens total:     1152
Completion tokens total: 512
Total tokens:            1664
Prompt tokens/sec:       22.14
Completion tokens/sec:   9.84
Total tokens/sec:        31.97
Latency stats (s):
  mean: 6.493
  p50 : 6.000
  p90 : 7.261
  p95 : 9.026
  p99 : 9.026

Here’s the script I cobbled together for testing multiple Sparks in case it helps someone:

tensorrt_llm_test.py.txt (4.1 KB)

Samsung 7 million parameter model

GitHub - SamsungSAILMontreal/TinyRecursiveModels

This is what we were training and testing on our dual spark setup. It worked fantastically well!


more to come.

1 Like

I guess GLM 4.5 Air could be nice.

hf link? ggufs available? (I could do vllm again I suppose)

  1. Original repo - zai-org/GLM-4.5-Air · Hugging Face
  2. Unsloth quants - unsloth/GLM-4.5-Air-GGUF · Hugging Face

adding a note here for training with two sparks

ddp or fdsp2 (torchtitan)

At this size the gradients align “okay” with the memory bandwidth of the sparks.

I Have a Dell GB10 and no idea what to do with it. Yes there is a lot of getting started content but no clear desirable target so far. Have onsite nvidia training in 2 days though.

Found 3 L1T forum threads for GB10 so far.

Did the Steam install (3D mark Timespy not working, and emulation like in the old days with border frame). Did try Win11 insider (I saw Geekbench with W11 but am not able to install, doesn’t detect any mouse or keyboard, screenshot).

Anything feeding bottle (like pfsense setup) for AI agents would be very welcome and good to get more people interested.

Thanks for doing this :slight_smile:

1 Like

the jupyter notebook examples are handy and practical. is a coding assistant useful to you? image gen? DGX Spark

I see coding in the cloud. Everything local is probably sensitive info, like intellectual property or personal data.

My dream is to have a QNAP pivate Cloud (Drive, Photos, Mails, (Maps history?)), seachable by local AI secretary. If open source and well used (like pfsense subscription), it could also have write access to the NAS. Agentic, it would have modules like Enterprise Research, AI Lawyer and such. Not forgetting things after 800 pages, linking sources, using the web, hallucination curbed. Doesn’t need to be fast! I think it can take some hours, to help me with a legal case next step.

build.nvidia.com is a great recipe book, but nothing mindblowing like running Cyberpunk with 3 clicks (install steam script + CP2077 Install + Start button :smiley: ) Nvidia devs where so excited about it, they bought me the Crysis trilogy on spot to run it, and it did. Sure you could buy a gaming rig for that money, but not in this form factor, and certainly not at this performance per watt. This is a Steam Machine on steroids. This is the n+1 gen gaming Desktop / Notebook (N1X / SD X2).

They do have great infrastructure demos like Build an AI Agent for Enterprise Research Blueprint by NVIDIA | NVIDIA NIM that runs for 3.5 USD /h.

I have yet to find a model that I can apply in a productive show and tell to make me & people go “woah”.

It left some maintenance questions open, like mass deploy & management features, out of band.

Also I had it, that I logged in after boot, but the screen stays black with “X” mouse pointer, after power button click changing to regular mouse pointer, but still no Desktop. Solution was to wait about 10minutes until it’s locked for inactivity, second login worked then. Final solution was reinstalling DGX OS
 I am sure we lack some Ubuntu skills here to solve but this was not very satisfying.

Lastly, as German I am usually sceptical when there is no real root access, and I am asked to entrust sensitive data. Sure it’s “just” the nvidia ecosystem, and there should be some governance around AI development. Especially German automotive has a troubled history of trusting big tech.

Now I know these guys are persona non grata around here but it’s relevant to say they have been working on some sort of distributed ML and inference over Thunderbolt 5.

Awesome videos and tests on the Sparks! I also have two ASUS GX10 units (same GB10 chip, different chassis) and a MikroTik CRS812 DDQ sitting here waiting to be tested, but I haven’t yet had the time to dive deeper into the topic beyond the obligatory Nvidia playbooks.
If anyone wants me to run something on the ASUS for comparison, let me know.