Ahh thanks, that one is easy enough then even if TTFT is not so useful without knowledge of the prompt/context size.
So I finally have a moment this morning after drinking some coffee to look at the fine print of that intel sglang benchmark results. To keep it simple let’s look only at the DeepSeek-R1-671B INT8 quant report:
| MODEL | DATA TYPE | SOCKETS | llama.cpp TTFT (ms) | llama.cpp TPOT (ms) | SGLang TTFT (ms) | SGLang TPOT (ms) | Speedup TTFT | Speedup TPOT |
|---|---|---|---|---|---|---|---|---|
| DeepSeek-R1-671B | INT8 | 2 | 24546.76 | 172.01 | 1885.25 | 67.99 | 13.0x | 2.5x |
So in their testing they claim llama.cpp is getting about (1/0.17201) = 5.81 tok/sec generation speed while the updated sglang is getting ~14.71 tok/sec.
In my own benchmarking almost four months ago I could get 7.5 tok/sec generation on an old version of ik_llama.cpp running full fat Q8_0 quantization on a single socket using the same SNC=Disable they use (which I had also suggested prior). Mainline llama.cpp was getting slightly less at 6.2 tok/sec despite some AMX extensions support.
If we look at the fine-print even closer the paper says:
As llama.cpp does not implement Multi Numa Parallelism we mentioned above, running 1 instance on dual sockets is even slower than on a single socket. To be fair, we use 2 instances for llama.cpp on dual sockets, 1 for each, and collect metrics of TTFT and TPOT.
So to be fair I should double my benchmark results as both sockets could achieve the rate I measured and add it together and I end up with: 7.5 + 7.5 = 15 tok/sec. So four month old ik_llama.cpp would be slightly faster than their new sglang implementation in terms of aggregate tok/sec generation speed across a two socket rig.
My other observation is they chose to use quants such as int8 and bf16 dtypes which is likely because that Xeon 6980P lscpu flags include amx_bf16 amx_int8 amx_tile. So without a special quant it would be even slower.
One last important detail they acknowledge is:
Since the decoding phase tends to be memory bandwidth bound, the speedup ratio in TPOT is much smaller
So the tl;dr; here for me is that while nice they are able to approach the speeds of ik_llama.cpp when using a special quant dtype in a single generation across both CPU sockets, the overall aggregate tokens/second generation seems no faster or possibly even slightly slower, if I’m reading this right.
Interesting.