One Spark Two Spark Red Strix Blue Arc?

wendell · November 16, 2025, 3:22am

I am just making some notes here for an upcoming video. What model(s) would you suggest testing via llama bench?

Spark Platforms

MSI EdgeXpert (No Thermal limits!)

Nvidia DGX Spark (repasted with PTM7950)

Strix Halo Platforms in test:

Nvidia DGX Spark

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048	1608.25 ± 5.97
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32	48.51 ± 0.33
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048 @ d4096	1557.78 ± 4.35
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32 @ d4096	45.99 ± 0.16
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048 @ d8192	1506.69 ± 8.17
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32 @ d8192	43.34 ± 0.77
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048 @ d16384	1308.31 ± 9.09
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32 @ d16384	40.03 ± 0.19
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048 @ d32768	1061.15 ± 7.27
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32 @ d32768	34.46 ± 0.14

MSI EdgeXpert DGX Spark

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048	1729.39 ± 19.50
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32	52.59 ± 1.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048 @ d4096	1680.80 ± 8.96
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32 @ d4096	49.10 ± 1.35
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048 @ d8192	1581.90 ± 13.02
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32 @ d8192	46.88 ± 0.69
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048 @ d16384	1356.77 ± 130.05
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32 @ d16384	42.87 ± 0.75
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048 @ d32768	1073.65 ± 138.28
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32 @ d32768	37.25 ± 0.36

Strix Halo Using the “Strix Halo Toolboxes”

bash-5.3#  llama-bench --mmap 0 -m ./gpt-oss-120b-mxfp4-00001-of-00003.gguf   -fa 1 -d 0,4096,8192,16384,32768
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |  1 |    0 |           pp512 |       760.45 ± 54.88 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |  1 |    0 |           tg128 |         51.81 ± 0.00 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |  1 |    0 |   pp512 @ d4096 |        687.37 ± 1.62 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |  1 |    0 |   tg128 @ d4096 |         44.72 ± 0.02 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |  1 |    0 |   pp512 @ d8192 |        599.81 ± 5.67 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |  1 |    0 |   tg128 @ d8192 |         37.45 ± 0.03 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |  1 |    0 |  pp512 @ d16384 |        462.82 ± 2.02 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |  1 |    0 |  tg128 @ d16384 |         28.91 ± 0.02 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |  1 |    0 |  pp512 @ d32768 |        304.50 ± 1.03 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |  1 |    0 |  tg128 @ d32768 |         19.81 ± 0.02 |

Strix Halo Like For Like

bash-5.3#  llama-bench --mmap 0 -m ./gpt-oss-120b-mxfp4-00001-of-00003.gguf   -fa 1 -d 0,4096,8192,16384,32768  -p 2048 -n 32 -ub 2048 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |          pp2048 |       1008.28 ± 0.97 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |            tg32 |         51.84 ± 0.01 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |  pp2048 @ d4096 |        868.00 ± 2.51 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |    tg32 @ d4096 |         42.42 ± 0.03 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |  pp2048 @ d8192 |        681.53 ± 1.87 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |    tg32 @ d8192 |         36.58 ± 0.03 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 | pp2048 @ d16384 |        492.90 ± 1.55 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |   tg32 @ d16384 |         28.95 ± 0.02 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 | pp2048 @ d32768 |        309.00 ± 0.73 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |   tg32 @ d32768 |         19.84 ± 0.01 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |          pp2048 |       1061.18 ± 1.80 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |            tg32 |         51.87 ± 0.01 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |  pp2048 @ d4096 |        869.10 ± 1.53 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |    tg32 @ d4096 |         42.59 ± 0.03 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |  pp2048 @ d8192 |        681.64 ± 1.61 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |    tg32 @ d8192 |         36.60 ± 0.01 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 | pp2048 @ d16384 |        492.73 ± 0.77 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |   tg32 @ d16384 |         28.92 ± 0.03 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 | pp2048 @ d32768 |        309.35 ± 0.23 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm       |  99 |     2048 |  1 |    0 |   tg32 @ d32768 |         19.84 ± 0.03 |

Gemma 3 12b it qat Q4

 ./build-cuda/bin/llama-bench -m ~/.cache/huggingface/hub/models--ggml-org--gemma-3-12b-it-qat-GGUF/snapshots/05c2df468ad7a0bb1284b3d6fe2bdf495a885567/gemma-3-12b-it-qat-Q4_0.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | CUDA       |  99 |     2048 |  1 |    0 |          pp2048 |      1782.56 ± 13.32 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | CUDA       |  99 |     2048 |  1 |    0 |            tg32 |         26.95 ± 0.22 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | CUDA       |  99 |     2048 |  1 |    0 |  pp2048 @ d4096 |      1639.08 ± 10.67 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | CUDA       |  99 |     2048 |  1 |    0 |    tg32 @ d4096 |         25.39 ± 0.23 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | CUDA       |  99 |     2048 |  1 |    0 |  pp2048 @ d8192 |       1498.74 ± 8.98 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | CUDA       |  99 |     2048 |  1 |    0 |    tg32 @ d8192 |         23.38 ± 0.24 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | CUDA       |  99 |     2048 |  1 |    0 | pp2048 @ d16384 |      1328.67 ± 11.42 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | CUDA       |  99 |     2048 |  1 |    0 |   tg32 @ d16384 |         22.52 ± 0.46 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | CUDA       |  99 |     2048 |  1 |    0 | pp2048 @ d32768 |       1112.42 ± 8.28 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | CUDA       |  99 |     2048 |  1 |    0 |   tg32 @ d32768 |         19.24 ± 0.10 |

bash-5.3# llama-bench -m /workspace/.cache/huggingface/hub/models--ggml-org--gemma-3-12b-it-qat-GGUF/snapshots/05c2df468ad7a0bb1284b3d6fe2bdf495a885567/gemma-3-12b-it-qat-Q4_0.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |          pp2048 |        987.75 ± 2.10 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |            tg32 |         27.13 ± 0.00 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |  pp2048 @ d4096 |        731.58 ± 5.75 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |    tg32 @ d4096 |         24.22 ± 0.02 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |  pp2048 @ d8192 |       554.30 ± 10.88 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |    tg32 @ d8192 |         21.66 ± 0.01 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 | pp2048 @ d16384 |        307.08 ± 2.10 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |   tg32 @ d16384 |         20.23 ± 0.01 |
Memory access fault by GPU node-1 (Agent handle: 0x85d7ed0) on address 0x79bc94306000. Reason: Page not present or supervisor privilege.

it was very time consuming to get these results as I would face issues like:

[48476.062006] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48476.062016] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48478.734485] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48478.734488] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48481.406821] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48481.406826] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48484.078855] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48484.078857] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48486.754327] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48486.754333] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48489.426367] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48489.426369] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48492.099739] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48492.099744] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48494.771692] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48494.771695] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48497.443736] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48497.443739] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48500.115835] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48500.115837] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48502.793322] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48502.793328] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48505.466503] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48505.466510] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48515.976288] amdgpu 0000:c3:00.0: amdgpu: Dumping IP State
[48515.977056] amdgpu 0000:c3:00.0: amdgpu: Dumping IP State Completed
[48515.977076] amdgpu 0000:c3:00.0: amdgpu: ring sdma0 timeout, signaled seq=129, emitted seq=132
[48515.977088] amdgpu 0000:c3:00.0: amdgpu: Starting sdma0 ring reset
[48515.977106] amdgpu 0000:c3:00.0: amdgpu: reset sdma queue (0:0:0)
[48516.211306] amdgpu 0000:c3:00.0: amdgpu: failed to wait on sdma queue reset done
[48516.211310] [drm:amdgpu_mes_reset_legacy_queue [amdgpu]] *ERROR* failed to reset legacy queue
[48516.211528] amdgpu 0000:c3:00.0: amdgpu: Ring sdma0 reset failure
[48516.211530] amdgpu 0000:c3:00.0: amdgpu: GPU reset begin!
[48662.409460] INFO: task kworker/u132:15:4723 blocked for more than 122 seconds.
[48662.409482]       Not tainted 6.14.0-1015-oem #15-Ubuntu
[48662.409486] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

while trying to get comparative numbers with llama-bench. (vLLM was more stable fwiw)

bash-5.3#  llama-bench -m /workspace/.cache/huggingface/hub/models--ggml-org--gemma-3-12b-it-qat-GGUF/snapshots/05c2df468ad7a0bb1284b3d6fe2bdf495a885567/gemma-3-12b-it-qat-Q4_0.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |          pp2048 |        989.10 ± 1.74 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |            tg32 |         27.21 ± 0.01 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |  pp2048 @ d4096 |        734.50 ± 2.93 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |    tg32 @ d4096 |         24.27 ± 0.11 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |  pp2048 @ d8192 |       556.55 ± 14.63 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |    tg32 @ d8192 |         21.70 ± 0.01 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 | pp2048 @ d16384 |        308.02 ± 0.67 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |   tg32 @ d16384 |         20.24 ± 0.01 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 | pp2048 @ d32768 |        150.48 ± 1.51 |
| gemma3 12B Q4_0                |   6.64 GiB |    11.77 B | ROCm       |  99 |     2048 |  1 |    0 |   tg32 @ d32768 |         15.57 ± 0.01 |

Qwen3 Coder 30B A3B Instruct Q8


spark2@spark-015f:~/llama.cpp$ ./build-cuda/bin/llama-bench -m ~/.cache/huggingface/hub/models--ggml-org--Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF/snapshots/d2a13024f16ef1985cb8347c3dbad6114e0ecef6/qwen3-coder-30b-a3b-instruct-q8_0.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |          pp2048 |      2704.86 ± 47.29 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |            tg32 |         54.56 ± 1.32 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |  pp2048 @ d4096 |      2335.24 ± 37.37 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |    tg32 @ d4096 |         49.22 ± 1.01 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |  pp2048 @ d8192 |       2095.88 ± 2.55 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |    tg32 @ d8192 |         41.98 ± 0.40 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 | pp2048 @ d16384 |      1630.78 ± 17.80 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |   tg32 @ d16384 |         34.56 ± 0.09 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 | pp2048 @ d32768 |      1168.68 ± 13.35 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |   tg32 @ d32768 |         25.99 ± 0.33 |

bash-5.3#  llama-bench -m /workspace/.cache/huggingface/hub/models--ggml-org--Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF/snapshots/d2a13024f16ef1985cb8347c3dbad6114e0ecef6/qwen3-coder-30b-a3b-instruct-q8_0.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    0 |          pp2048 |        910.48 ± 1.57 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    0 |            tg32 |         52.57 ± 0.01 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    0 |  pp2048 @ d4096 |        597.82 ± 0.86 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    0 |    tg32 @ d4096 |         39.77 ± 0.02 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    0 |  pp2048 @ d8192 |        432.54 ± 0.45 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    0 |    tg32 @ d8192 |         34.15 ± 0.02 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    0 | pp2048 @ d16384 |        265.82 ± 1.40 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    0 |   tg32 @ d16384 |         26.55 ± 0.01 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    0 | pp2048 @ d32768 |        108.33 ± 1.09 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |     2048 |  1 |    0 |   tg32 @ d32768 |         16.35 ± 0.01 |

Two Sparks Are Not Fast

So it is possible to run something like Qwen235B across two sparks. This Will Not Be Fast. I am going to put the numbers here or that but I would encourage you to instead think about this agentically as opposed to ‘moar bigger models’ – you could run a larger model in this “small” hardware to debug something but really what having 2x128gb memory space does for you is give you more space for agents in the Agentic AI.

Imagine being able to run a whole little fleet of 7-35b parameter models that work together vs one monolithic model. That’s the real benefit of scaling to multiple sparks.

Also, being constrained in this way, forces you, the developer, to make good choices with scaling your app vs ‘pay more, throw more hardware at it’ (imho)

 python3 level1_trensorrt_bench.py

=== Mode: pp2048-like ===
Prompt target tokens (approx): 2048, max_tokens: 1
Total requests: 64, concurrency: 8
Wall time (batch): 226.905 s
Requests: 64
Prompt tokens total:     131584
Completion tokens total: 64
Total tokens:            131648
Prompt tokens/sec:       579.91
Completion tokens/sec:   0.28
Total tokens/sec:        580.19
Latency stats (s):
  mean: 28.154
  p50 : 0.819
  p90 : 217.427
  p95 : 218.664
  p99 : 219.088

Some math about this:

Tokens per request:
131,584/64≈2,056131,584 / 64 ≈ 2,056131,584/64≈2,056 tokens per prompt.
With concurrency 8 and 64 requests, one has 8 “waves” of requests.
Wall time per wave: 226.9/8≈28.4226.9 / 8 ≈ 28.4226.9/8≈28.4 seconds.
That matches our mean latency ≈ 28.15 s. So roughly:
- Each 2k-token prompt takes ~28 s end-to-end.
- Token throughput per request: 2,056/28.15≈732,056 / 28.15 ≈ 732,056/28.15≈73 tokens/s.
- With concurrency 8 → ~580 tokens/s aggregate, which matches the printed total tokens/sec.

So the numbers are self-consistent:

Roughly 73 tokens/s of prompt ingestion per request, aggregated to ~580 tokens/s with concurrency 8.
The ugly p90/p95/p99 ~ 217–219 s means some requests sit in the queue for several waves before they actually get processed (classic long tail under load).

**This is “slow-ish” for a big model on an HTTP-served distributed setup, but also not insane for a 235B-style monster across 2 nodes, especially with chat overhead + network + dynamic batching. 64 concurrency / 8 threads is right out. **

32/4 looks a bit better:

=== Mode: pp2048-like ===
Prompt target tokens (approx): 2048, max_tokens: 1
Total requests: 32, concurrency: 4
Wall time (batch): 4.365 s
Requests: 32
Prompt tokens total:     65792
Completion tokens total: 32
Total tokens:            65824
Prompt tokens/sec:       15072.58
Completion tokens/sec:   7.33
Total tokens/sec:        15079.91
Latency stats (s):
  mean: 0.542
  p50 : 0.565
  p90 : 0.572
  p95 : 0.572
  p99 : 0.574

With the reduced load (32 requests, concurrency 4), the same ~2,056-token prompts are going through about 26× faster end-to-end: mean latency dropped from ~28.2 s per request to ~0.54 s, and aggregate prompt throughput jumped from ~580 tok/s to ~15,073 tok/s. In the original run, 8 “waves” of 8 requests were effectively serialized by queueing, which is why mean latency (~28 s) matched wall-time per wave and p90+ latencies blew out to ~217–219 s — the system was badly overloaded and requests were waiting multiple waves in the queue. In the lighter run, p50–p99 latencies are all tightly clustered around ~0.56 s, so I think that’s closer to a “true” steady-state performance of the 2-Spark deployment instead of a queueing-dominated long-tail mess.

=== Mode: tg32-like ===
Prompt target tokens (approx): 64, max_tokens: 32
Total requests: 32, concurrency: 4
Wall time (batch): 408.873 s
Requests: 32
Prompt tokens total:     2304
Completion tokens total: 1024
Total tokens:            3328
Prompt tokens/sec:       5.63
Completion tokens/sec:   2.50
Total tokens/sec:        8.14
Latency stats (s):
  mean: 51.101
  p50 : 8.277
  p90 : 11.301
  p95 : 343.857
  p99 : 343.857

and messing with concurrency more because the results don’t 100% make sense to me

root@spark-2903:/app# python3 tensorrt_llm.py

=== Mode: pp2048-like ===
Prompt target tokens (approx): 2048, max_tokens: 1
Total requests: 16, concurrency: 2
Wall time (batch): 340.913 s
Requests: 16
Prompt tokens total:     32896
Completion tokens total: 16
Total tokens:            32912
Prompt tokens/sec:       96.49
Completion tokens/sec:   0.05
Total tokens/sec:        96.54
Latency stats (s):
  mean: 42.608
  p50 : 0.499
  p90 : 0.511
  p95 : 337.385
  p99 : 337.385

=== Mode: tg32-like ===
Prompt target tokens (approx): 64, max_tokens: 32
Total requests: 16, concurrency: 2
Wall time (batch): 52.042 s
Requests: 16
Prompt tokens total:     1152
Completion tokens total: 512
Total tokens:            1664
Prompt tokens/sec:       22.14
Completion tokens/sec:   9.84
Total tokens/sec:        31.97
Latency stats (s):
  mean: 6.493
  p50 : 6.000
  p90 : 7.261
  p95 : 9.026
  p99 : 9.026

Here’s the script I cobbled together for testing multiple Sparks in case it helps someone:

tensorrt_llm_test.py.txt (4.1 KB)

…more to come.

lambda · November 17, 2025, 4:32am

I guess GLM 4.5 Air could be nice.

wendell · November 17, 2025, 5:41am

hf link? ggufs available? (I could do vllm again I suppose)

lambda · November 17, 2025, 2:48pm

Original repo - zai-org/GLM-4.5-Air · Hugging Face
Unsloth quants - unsloth/GLM-4.5-Air-GGUF · Hugging Face

wendell · December 5, 2025, 4:11pm

adding a note here for training with two sparks

ddp or fdsp2 (torchtitan)

At this size the gradients align “okay” with the memory bandwidth of the sparks.