DeepSeek Deep Dive R1 at Home!

ubergarm · July 30, 2025, 1:21pm

Ahh thanks, that one is easy enough then even if TTFT is not so useful without knowledge of the prompt/context size.

So I finally have a moment this morning after drinking some coffee to look at the fine print of that intel sglang benchmark results. To keep it simple let’s look only at the DeepSeek-R1-671B INT8 quant report:

MODEL	DATA TYPE	SOCKETS	llama.cpp TTFT (ms)	llama.cpp TPOT (ms)	SGLang TTFT (ms)	SGLang TPOT (ms)	Speedup TTFT	Speedup TPOT
DeepSeek-R1-671B	INT8	2	24546.76	172.01	1885.25	67.99	13.0x	2.5x

So in their testing they claim llama.cpp is getting about (1/0.17201) = 5.81 tok/sec generation speed while the updated sglang is getting ~14.71 tok/sec.

In my own benchmarking almost four months ago I could get 7.5 tok/sec generation on an old version of ik_llama.cpp running full fat Q8_0 quantization on a single socket using the same SNC=Disable they use (which I had also suggested prior). Mainline llama.cpp was getting slightly less at 6.2 tok/sec despite some AMX extensions support.

If we look at the fine-print even closer the paper says:

As llama.cpp does not implement Multi Numa Parallelism we mentioned above, running 1 instance on dual sockets is even slower than on a single socket. To be fair, we use 2 instances for llama.cpp on dual sockets, 1 for each, and collect metrics of TTFT and TPOT.

So to be fair I should double my benchmark results as both sockets could achieve the rate I measured and add it together and I end up with: 7.5 + 7.5 = 15 tok/sec. So four month old ik_llama.cpp would be slightly faster than their new sglang implementation in terms of aggregate tok/sec generation speed across a two socket rig.

My other observation is they chose to use quants such as int8 and bf16 dtypes which is likely because that Xeon 6980P lscpu flags include amx_bf16 amx_int8 amx_tile. So without a special quant it would be even slower.

One last important detail they acknowledge is:

Since the decoding phase tends to be memory bandwidth bound, the speedup ratio in TPOT is much smaller

So the tl;dr; here for me is that while nice they are able to approach the speeds of ik_llama.cpp when using a special quant dtype in a single generation across both CPU sockets, the overall aggregate tokens/second generation seems no faster or possibly even slightly slower, if I’m reading this right.

Interesting.