I am just making some notes here for an upcoming video. What model(s) would you suggest testing via llama bench?
Spark Platforms
MSI EdgeXpert (No Thermal limits!)
Nvidia DGX Spark (repasted with PTM7950)
Strix Halo Platforms in test:
Nvidia DGX Spark
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
| model | size | params | backend | ngl | n_ubatch | fa | mmap | test | t/s |
|---|---|---|---|---|---|---|---|---|---|
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | 0 | pp2048 | 1608.25 ± 5.97 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | 0 | tg32 | 48.51 ± 0.33 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | 0 | pp2048 @ d4096 | 1557.78 ± 4.35 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | 0 | tg32 @ d4096 | 45.99 ± 0.16 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | 0 | pp2048 @ d8192 | 1506.69 ± 8.17 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | 0 | tg32 @ d8192 | 43.34 ± 0.77 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | 0 | pp2048 @ d16384 | 1308.31 ± 9.09 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | 0 | tg32 @ d16384 | 40.03 ± 0.19 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | 0 | pp2048 @ d32768 | 1061.15 ± 7.27 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | 0 | tg32 @ d32768 | 34.46 ± 0.14 |
MSI EdgeXpert DGX Spark
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
| model | size | params | backend | ngl | n_ubatch | fa | mmap | test | t/s |
|---|---|---|---|---|---|---|---|---|---|
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | 0 | pp2048 | 1729.39 ± 19.50 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | 0 | tg32 | 52.59 ± 1.01 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | 0 | pp2048 @ d4096 | 1680.80 ± 8.96 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | 0 | tg32 @ d4096 | 49.10 ± 1.35 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | 0 | pp2048 @ d8192 | 1581.90 ± 13.02 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | 0 | tg32 @ d8192 | 46.88 ± 0.69 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | 0 | pp2048 @ d16384 | 1356.77 ± 130.05 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | 0 | tg32 @ d16384 | 42.87 ± 0.75 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | 0 | pp2048 @ d32768 | 1073.65 ± 138.28 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | 0 | tg32 @ d32768 | 37.25 ± 0.36 |
Strix Halo Using the “Strix Halo Toolboxes”
bash-5.3# llama-bench --mmap 0 -m ./gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | 0 | pp512 | 760.45 ± 54.88 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | 0 | tg128 | 51.81 ± 0.00 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | 0 | pp512 @ d4096 | 687.37 ± 1.62 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | 0 | tg128 @ d4096 | 44.72 ± 0.02 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | 0 | pp512 @ d8192 | 599.81 ± 5.67 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | 0 | tg128 @ d8192 | 37.45 ± 0.03 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | 0 | pp512 @ d16384 | 462.82 ± 2.02 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | 0 | tg128 @ d16384 | 28.91 ± 0.02 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | 0 | pp512 @ d32768 | 304.50 ± 1.03 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | 0 | tg128 @ d32768 | 19.81 ± 0.02 |
Strix Halo Like For Like
bash-5.3# llama-bench --mmap 0 -m ./gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | n_ubatch | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 | 1008.28 ± 0.97 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | tg32 | 51.84 ± 0.01 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d4096 | 868.00 ± 2.51 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d4096 | 42.42 ± 0.03 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d8192 | 681.53 ± 1.87 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d8192 | 36.58 ± 0.03 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d16384 | 492.90 ± 1.55 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d16384 | 28.95 ± 0.02 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d32768 | 309.00 ± 0.73 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d32768 | 19.84 ± 0.01 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 | 1061.18 ± 1.80 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | tg32 | 51.87 ± 0.01 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d4096 | 869.10 ± 1.53 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d4096 | 42.59 ± 0.03 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d8192 | 681.64 ± 1.61 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d8192 | 36.60 ± 0.01 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d16384 | 492.73 ± 0.77 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d16384 | 28.92 ± 0.03 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d32768 | 309.35 ± 0.23 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d32768 | 19.84 ± 0.03 |
Gemma 3 12b it qat Q4
./build-cuda/bin/llama-bench -m ~/.cache/huggingface/hub/models--ggml-org--gemma-3-12b-it-qat-GGUF/snapshots/05c2df468ad7a0bb1284b3d6fe2bdf495a885567/gemma-3-12b-it-qat-Q4_0.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
| model | size | params | backend | ngl | n_ubatch | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | CUDA | 99 | 2048 | 1 | 0 | pp2048 | 1782.56 ± 13.32 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | CUDA | 99 | 2048 | 1 | 0 | tg32 | 26.95 ± 0.22 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | CUDA | 99 | 2048 | 1 | 0 | pp2048 @ d4096 | 1639.08 ± 10.67 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | CUDA | 99 | 2048 | 1 | 0 | tg32 @ d4096 | 25.39 ± 0.23 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | CUDA | 99 | 2048 | 1 | 0 | pp2048 @ d8192 | 1498.74 ± 8.98 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | CUDA | 99 | 2048 | 1 | 0 | tg32 @ d8192 | 23.38 ± 0.24 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | CUDA | 99 | 2048 | 1 | 0 | pp2048 @ d16384 | 1328.67 ± 11.42 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | CUDA | 99 | 2048 | 1 | 0 | tg32 @ d16384 | 22.52 ± 0.46 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | CUDA | 99 | 2048 | 1 | 0 | pp2048 @ d32768 | 1112.42 ± 8.28 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | CUDA | 99 | 2048 | 1 | 0 | tg32 @ d32768 | 19.24 ± 0.10 |
bash-5.3# llama-bench -m /workspace/.cache/huggingface/hub/models--ggml-org--gemma-3-12b-it-qat-GGUF/snapshots/05c2df468ad7a0bb1284b3d6fe2bdf495a885567/gemma-3-12b-it-qat-Q4_0.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | n_ubatch | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 | 987.75 ± 2.10 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | ROCm | 99 | 2048 | 1 | 0 | tg32 | 27.13 ± 0.00 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d4096 | 731.58 ± 5.75 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d4096 | 24.22 ± 0.02 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d8192 | 554.30 ± 10.88 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d8192 | 21.66 ± 0.01 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d16384 | 307.08 ± 2.10 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d16384 | 20.23 ± 0.01 |
Memory access fault by GPU node-1 (Agent handle: 0x85d7ed0) on address 0x79bc94306000. Reason: Page not present or supervisor privilege.
it was very time consuming to get these results as I would face issues like:
[48476.062006] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48476.062016] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48478.734485] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48478.734488] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48481.406821] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48481.406826] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48484.078855] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48484.078857] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48486.754327] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48486.754333] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48489.426367] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48489.426369] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48492.099739] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48492.099744] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48494.771692] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48494.771695] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48497.443736] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48497.443739] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48500.115835] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48500.115837] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48502.793322] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48502.793328] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48505.466503] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[48505.466510] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[48515.976288] amdgpu 0000:c3:00.0: amdgpu: Dumping IP State
[48515.977056] amdgpu 0000:c3:00.0: amdgpu: Dumping IP State Completed
[48515.977076] amdgpu 0000:c3:00.0: amdgpu: ring sdma0 timeout, signaled seq=129, emitted seq=132
[48515.977088] amdgpu 0000:c3:00.0: amdgpu: Starting sdma0 ring reset
[48515.977106] amdgpu 0000:c3:00.0: amdgpu: reset sdma queue (0:0:0)
[48516.211306] amdgpu 0000:c3:00.0: amdgpu: failed to wait on sdma queue reset done
[48516.211310] [drm:amdgpu_mes_reset_legacy_queue [amdgpu]] *ERROR* failed to reset legacy queue
[48516.211528] amdgpu 0000:c3:00.0: amdgpu: Ring sdma0 reset failure
[48516.211530] amdgpu 0000:c3:00.0: amdgpu: GPU reset begin!
[48662.409460] INFO: task kworker/u132:15:4723 blocked for more than 122 seconds.
[48662.409482] Not tainted 6.14.0-1015-oem #15-Ubuntu
[48662.409486] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
while trying to get comparative numbers with llama-bench. (vLLM was more stable fwiw)
bash-5.3# llama-bench -m /workspace/.cache/huggingface/hub/models--ggml-org--gemma-3-12b-it-qat-GGUF/snapshots/05c2df468ad7a0bb1284b3d6fe2bdf495a885567/gemma-3-12b-it-qat-Q4_0.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | n_ubatch | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 | 989.10 ± 1.74 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | ROCm | 99 | 2048 | 1 | 0 | tg32 | 27.21 ± 0.01 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d4096 | 734.50 ± 2.93 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d4096 | 24.27 ± 0.11 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d8192 | 556.55 ± 14.63 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d8192 | 21.70 ± 0.01 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d16384 | 308.02 ± 0.67 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d16384 | 20.24 ± 0.01 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d32768 | 150.48 ± 1.51 |
| gemma3 12B Q4_0 | 6.64 GiB | 11.77 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d32768 | 15.57 ± 0.01 |
Qwen3 Coder 30B A3B Instruct Q8
spark2@spark-015f:~/llama.cpp$ ./build-cuda/bin/llama-bench -m ~/.cache/huggingface/hub/models--ggml-org--Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF/snapshots/d2a13024f16ef1985cb8347c3dbad6114e0ecef6/qwen3-coder-30b-a3b-instruct-q8_0.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
| model | size | params | backend | ngl | n_ubatch | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | 0 | pp2048 | 2704.86 ± 47.29 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | 0 | tg32 | 54.56 ± 1.32 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | 0 | pp2048 @ d4096 | 2335.24 ± 37.37 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | 0 | tg32 @ d4096 | 49.22 ± 1.01 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | 0 | pp2048 @ d8192 | 2095.88 ± 2.55 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | 0 | tg32 @ d8192 | 41.98 ± 0.40 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | 0 | pp2048 @ d16384 | 1630.78 ± 17.80 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | 0 | tg32 @ d16384 | 34.56 ± 0.09 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | 0 | pp2048 @ d32768 | 1168.68 ± 13.35 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | 0 | tg32 @ d32768 | 25.99 ± 0.33 |
bash-5.3# llama-bench -m /workspace/.cache/huggingface/hub/models--ggml-org--Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF/snapshots/d2a13024f16ef1985cb8347c3dbad6114e0ecef6/qwen3-coder-30b-a3b-instruct-q8_0.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | n_ubatch | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 | 910.48 ± 1.57 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | tg32 | 52.57 ± 0.01 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d4096 | 597.82 ± 0.86 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d4096 | 39.77 ± 0.02 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d8192 | 432.54 ± 0.45 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d8192 | 34.15 ± 0.02 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d16384 | 265.82 ± 1.40 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d16384 | 26.55 ± 0.01 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d32768 | 108.33 ± 1.09 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d32768 | 16.35 ± 0.01 |
Two Sparks Are Not Fast
So it is possible to run something like Qwen235B across two sparks. This Will Not Be Fast. I am going to put the numbers here or that but I would encourage you to instead think about this agentically as opposed to ‘moar bigger models’ – you could run a larger model in this “small” hardware to debug something but really what having 2x128gb memory space does for you is give you more space for agents in the Agentic AI.
Imagine being able to run a whole little fleet of 7-35b parameter models that work together vs one monolithic model. That’s the real benefit of scaling to multiple sparks.
Also, being constrained in this way, forces you, the developer, to make good choices with scaling your app vs ‘pay more, throw more hardware at it’ (imho)
python3 level1_trensorrt_bench.py
=== Mode: pp2048-like ===
Prompt target tokens (approx): 2048, max_tokens: 1
Total requests: 64, concurrency: 8
Wall time (batch): 226.905 s
Requests: 64
Prompt tokens total: 131584
Completion tokens total: 64
Total tokens: 131648
Prompt tokens/sec: 579.91
Completion tokens/sec: 0.28
Total tokens/sec: 580.19
Latency stats (s):
mean: 28.154
p50 : 0.819
p90 : 217.427
p95 : 218.664
p99 : 219.088
Some math about this:
- Tokens per request:
131,584/64≈2,056131,584 / 64 ≈ 2,056131,584/64≈2,056 tokens per prompt. - With concurrency 8 and 64 requests, one has 8 “waves” of requests.
Wall time per wave: 226.9/8≈28.4226.9 / 8 ≈ 28.4226.9/8≈28.4 seconds. - That matches our mean latency ≈ 28.15 s. So roughly:
- Each 2k-token prompt takes ~28 s end-to-end.
- Token throughput per request: 2,056/28.15≈732,056 / 28.15 ≈ 732,056/28.15≈73 tokens/s.
- With concurrency 8 → ~580 tokens/s aggregate, which matches the printed
total tokens/sec.
So the numbers are self-consistent:
- Roughly 73 tokens/s of prompt ingestion per request, aggregated to ~580 tokens/s with concurrency 8.
- The ugly
p90/p95/p99 ~ 217–219 smeans some requests sit in the queue for several waves before they actually get processed (classic long tail under load).
**This is “slow-ish” for a big model on an HTTP-served distributed setup, but also not insane for a 235B-style monster across 2 nodes, especially with chat overhead + network + dynamic batching. 64 concurrency / 8 threads is right out. **
32/4 looks a bit better:
=== Mode: pp2048-like ===
Prompt target tokens (approx): 2048, max_tokens: 1
Total requests: 32, concurrency: 4
Wall time (batch): 4.365 s
Requests: 32
Prompt tokens total: 65792
Completion tokens total: 32
Total tokens: 65824
Prompt tokens/sec: 15072.58
Completion tokens/sec: 7.33
Total tokens/sec: 15079.91
Latency stats (s):
mean: 0.542
p50 : 0.565
p90 : 0.572
p95 : 0.572
p99 : 0.574
With the reduced load (32 requests, concurrency 4), the same ~2,056-token prompts are going through about 26× faster end-to-end: mean latency dropped from ~28.2 s per request to ~0.54 s, and aggregate prompt throughput jumped from ~580 tok/s to ~15,073 tok/s. In the original run, 8 “waves” of 8 requests were effectively serialized by queueing, which is why mean latency (~28 s) matched wall-time per wave and p90+ latencies blew out to ~217–219 s — the system was badly overloaded and requests were waiting multiple waves in the queue. In the lighter run, p50–p99 latencies are all tightly clustered around ~0.56 s, so I think that’s closer to a “true” steady-state performance of the 2-Spark deployment instead of a queueing-dominated long-tail mess.
=== Mode: tg32-like ===
Prompt target tokens (approx): 64, max_tokens: 32
Total requests: 32, concurrency: 4
Wall time (batch): 408.873 s
Requests: 32
Prompt tokens total: 2304
Completion tokens total: 1024
Total tokens: 3328
Prompt tokens/sec: 5.63
Completion tokens/sec: 2.50
Total tokens/sec: 8.14
Latency stats (s):
mean: 51.101
p50 : 8.277
p90 : 11.301
p95 : 343.857
p99 : 343.857
and messing with concurrency more because the results don’t 100% make sense to me
root@spark-2903:/app# python3 tensorrt_llm.py
=== Mode: pp2048-like ===
Prompt target tokens (approx): 2048, max_tokens: 1
Total requests: 16, concurrency: 2
Wall time (batch): 340.913 s
Requests: 16
Prompt tokens total: 32896
Completion tokens total: 16
Total tokens: 32912
Prompt tokens/sec: 96.49
Completion tokens/sec: 0.05
Total tokens/sec: 96.54
Latency stats (s):
mean: 42.608
p50 : 0.499
p90 : 0.511
p95 : 337.385
p99 : 337.385
=== Mode: tg32-like ===
Prompt target tokens (approx): 64, max_tokens: 32
Total requests: 16, concurrency: 2
Wall time (batch): 52.042 s
Requests: 16
Prompt tokens total: 1152
Completion tokens total: 512
Total tokens: 1664
Prompt tokens/sec: 22.14
Completion tokens/sec: 9.84
Total tokens/sec: 31.97
Latency stats (s):
mean: 6.493
p50 : 6.000
p90 : 7.261
p95 : 9.026
p99 : 9.026
Here’s the script I cobbled together for testing multiple Sparks in case it helps someone:
tensorrt_llm_test.py.txt (4.1 KB)
…more to come.