Oh I was interested in this. I am still on Threadripper 5000 with DDR4 and was asking myself how much higher the memory bandwidth would be compared to my setup. Are those 6800MT ECC DIMMs or regular UDIMMs?
Interesting read, I hope this limitation will be resolved in the future as imo we need that to keep the local inference relevant seeing how good these big moe are.
What can I say? kimi a q8 at 40tk/s? lol
More seriously, last month, I would have said deepseek somewhere around q4 (or one of your r4 that looks promising) as fast as our budget allows.
But things have changed, kimi got release and we have much much better performance on our more complex project. Now we got used to it…
For now we are using openrouter when we need those big moe and we are using dense models (<125B) locally for confidential projects (and a lot of our brain ofc).
We had a budget of around 30k, I was planning to order a 9575F or similar and a rtx pro and call it a day. But honestly when I saw what kind of speeds we could expect I kept reporting that purchase.
We’ve reconsidered our budget seeing what kind of performance we can get from those models. The shortened time to production alone makes me think it could be profitable.
So yeah to answer your question, I’d say aim at kimi around q4 as fast as 50k usd allows. may be 4 rtx pro and… I was thinking a single socket 9005. And hope it is kind of future proof.
I’ve seen some gh200 with 141gb vram at around 30k which are really h200 we a free cpu and ram. But I fear having a niche arm system will get me in troubles I don’t need really. and the mobo I found only outputs 3 pcie x16, so I’d say noway.
idk about intel xeon 6. Wendel videos was pretty underwelming but may be software support will get better.
Also we are 3 using this system in our office. From my understanding batch < 1 doesn’t yield better results anyway on cpu… So there’s something else.
especially not the 7995WX lol, what ram are you using? I couldn’t find >6000 with more than 32gb iirc.
V-Color ECC, specifically tuned for WRX90 setups.
Also, now that the new 9000 series are out, I’d be interested in what even the lower core-count SKUs can push as far as GB/s.
see above, ECC from V-Color! Great sticks.
My 5975wx was getting ~100GB/s on 3200 sticks, so it was quite an improvement.
Sure is, but I can’t justify 1TB of memory setting me back $12k at the moment. Maybe further down the road when I have an actual product that makes money I can consider it. It’s not a money making machine for me yet.
have you looked into that?
Yes I have but my interest has died down a bit because these companies do not make the best impression that I would use them for anything with confidential data. Hence I won’t consider them for anything more then basic testing.
Okay so this is only tangentially relevant, but I need a sanity check here; How useful would a Ryzen 9 9950X be, by itself, as an upgrade, for this purpose (and other general usage) over a 7800X3D? The thing is quite less expensive at slight discount than an RTX 3090 of questionable vintage, but still not cheap.
Motherboard is the MSI B650 Tomahawk Wifi with 14 CPU phases, which I take is sufficient as long as I don’t overclock (In fact, I would put a reduced temperature limit on it like I do with my current 7800X3D.) Cooler is Thermalright PS120, which is supposed to be “good enough”.
What worried me a little is the memory, which is 4x64GB DDR5, in the dreaded 2R 2DPC configuration, pushed to stable 4800MT/s with the current motherboard/CPU combo. I understand there is a fair chance of the new processor not getting even that speed stable, thereby netting a general performance regression for what is supposed to be an upgrade.
Do I upgrade, or - since I don’t have the room or goodwill for server stuff - save the effort for when Zen 3 TR Pro and motherboard at last drop close to that price? To be honest, upgrading only a single generation for no new capability (unlike RAM, which is already maxed out, would for this), what might be 2-3x prompt processing speed in this specific use case, and maybe double throughput at best in everything else not otherwise bandwidth- or thread-bound, felt extravagant when the current setup already works.
Hey @wendell, @ubergarm check this out:
Cost Effective Deployment of DeepSeek R1 with Intel® Xeon® 6 CPU on SGLang
Multi-Numa Parallelism via Tensor Parallelism (TP).
I wasn’t aware the intel team was working on this. From that paper:
To be fair, we use 2 instances for llama.cpp on dual sockets, 1 for each, and collect metrics of TTFT and TPOT.
Hah, we did do the same thing which I commented on in the paper. Ubergarm and I have each been doing wild and crazy experiments on the dual Intel 6980P system (I murdered one, apparently, gigabyte loaned me another chassis to experiment with which worked great). Socket-to-socket was indeed a problem at runtime too and at least I had fun loading two identical copies on each numa node so it was always “local” access but the whole time thinking “at what cosstttttttt” in my head haha.
I think they might have had even more success using ik_llama.cpp and q8_k_r8: experimental AVX512 version by ikawrakow · Pull Request #610 · ikawrakow/ik_llama.cpp · GitHub
…I kinda want to do a video on this the intel team did since it is “mainline” but yes… lots of low hanging fruit for crazy people running on cpu with lots of memory bandwidth.
Need more socket to socket bandwiddthhhhhhhhh lol.
Ouch
that’s because it’s new I guess lol
Don’t worry just the cost to provision ram for the next 1500B moe haha
If you have the time don’t hesitate I’m sure there is an audience for that
Yeah a guy on reddit sent me that link to that sglang on intel 6980p paper. I had some discussions early on with Mingfei Ma, the Intel Pytorch developer who was doing that project. He did the AMX support on mainline llama.cpp as well. Back in mid-march he said after improving support on sglang he would try to improve the situation on mainline llama.cpp as well in this closed PR here: Eval bug: does llama.cpp support Intel AMX instruction? how to enable it · Issue #12003 · ggml-org/llama.cpp · GitHub
I still need to translate their numbers e.g. TTFT/TPOT (apparently used by enterprise?) into “normal” Prompt Processing and Token Generation tok/sec numbers that I use for comparisons at some point.
My home rig is a 9950x and I have an extensive thread on this forum elsewhere how I clocked 2x48GB DDR5@6400MT/s and tweaked infinity fabric to get “gear 1” and around 86GB/s ram bandwidth.
The main advantage I see would be zen5 9950x could get some improved prompt processing speed using that experimental PR. But your memory bandwidth would likely be the same, especially using the “verboten 4x dimm” configuration forcing you to clock lower.
Probably not worth the hassle unless you already have a use case that would benefit from a little more prompt processing speed (guessing maybe 1.5x at best but honestly not sure) but the same token generation speed.
Guess I’d better learn to be happy with what I have, then. ![]()
Hopefully the next generation would do something about the 2DPC 2R bandwidth, and the IF per-chiplet bottleneck too.
And now there’s GLM-4.5 at 3550G params for the full version. That would be… a 724GiB quant at 1.75-bit, and that’s assuming - in llama.cpp terms - that anyone would get a good imatrix off of it for such a deep quant to be any good, and that said quant would be any better than the Q4 quants of mere 1T-scale models which are about the same size. Yeesh.
The “light” version is comparable to Kimi-k2 in size, but with several times more active parameters at 120B. Mistral Large or thereabouts in scale.
EDIT: Off by exactly one order of magnitude. Thank you, machine translation…
Going to be interesting to see how these do in independent benchmarks.
Sorry just checked the big one is 355B for 32B active weights and the air is 106B for 12B active…
So the big one is twice as small as deepseek and the small one should run with around idk 120gig in q4. I can seem to find if it uses MLA or not, I guess not
I blame machine translation. ![]()
What I saw
GLM-4.5 series models are foundational models designed specifically for agents. GLM-4.5 has a total of 3550 billion parameters, with 320 billion active parameters; GLM-4.5-Air adopts a more compact design, with a total of 1060 billion parameters, and 120 billion active parameters. The GLM-4.5 model unifies reasoning, encoding, and agent capabilities to meet the complex needs of agent applications.
I believe it took the ideograph meaning 100M for 1B. And I thought it looked competent…
Serves me right for trusting the
.
Still, gonna be interesting to see how their claims hold up.
That’s Time To First Token and Time Per Output Token. 1/TPOT to get tg tok/s seems straightforward enough, but it’s hard to convert TTFT into pp tok/s without knowing the initial input size. Also TTFT might contain a tg token duration whereas llama.cpp’s pp time probably-maybe doesn’t.
Ahh thanks, that one is easy enough then even if TTFT is not so useful without knowledge of the prompt/context size.
So I finally have a moment this morning after drinking some coffee to look at the fine print of that intel sglang benchmark results. To keep it simple let’s look only at the DeepSeek-R1-671B INT8 quant report:
| MODEL | DATA TYPE | SOCKETS | llama.cpp TTFT (ms) | llama.cpp TPOT (ms) | SGLang TTFT (ms) | SGLang TPOT (ms) | Speedup TTFT | Speedup TPOT |
|---|---|---|---|---|---|---|---|---|
| DeepSeek-R1-671B | INT8 | 2 | 24546.76 | 172.01 | 1885.25 | 67.99 | 13.0x | 2.5x |
So in their testing they claim llama.cpp is getting about (1/0.17201) = 5.81 tok/sec generation speed while the updated sglang is getting ~14.71 tok/sec.
In my own benchmarking almost four months ago I could get 7.5 tok/sec generation on an old version of ik_llama.cpp running full fat Q8_0 quantization on a single socket using the same SNC=Disable they use (which I had also suggested prior). Mainline llama.cpp was getting slightly less at 6.2 tok/sec despite some AMX extensions support.
If we look at the fine-print even closer the paper says:
As llama.cpp does not implement Multi Numa Parallelism we mentioned above, running 1 instance on dual sockets is even slower than on a single socket. To be fair, we use 2 instances for llama.cpp on dual sockets, 1 for each, and collect metrics of TTFT and TPOT.
So to be fair I should double my benchmark results as both sockets could achieve the rate I measured and add it together and I end up with: 7.5 + 7.5 = 15 tok/sec. So four month old ik_llama.cpp would be slightly faster than their new sglang implementation in terms of aggregate tok/sec generation speed across a two socket rig.
My other observation is they chose to use quants such as int8 and bf16 dtypes which is likely because that Xeon 6980P lscpu flags include amx_bf16 amx_int8 amx_tile. So without a special quant it would be even slower.
One last important detail they acknowledge is:
Since the decoding phase tends to be memory bandwidth bound, the speedup ratio in TPOT is much smaller
So the tl;dr; here for me is that while nice they are able to approach the speeds of ik_llama.cpp when using a special quant dtype in a single generation across both CPU sockets, the overall aggregate tokens/second generation seems no faster or possibly even slightly slower, if I’m reading this right.
Interesting.
Darn, I was hoping that would turn out to be more of a breakthrough. Good to know ik_llama.cpp is putting up a good fight, at least.
I think I am missing something here, what exactly is the reason to benchmark two instances of ik_llama.cpp on the two sockets and compare it? Like what use are two instances to me at 100% speed, the big thing here is that they have one instance at 200% speed isn’t it? Like that the total bandwidth of one instance can be distributed over two sets of memory or am I getting this wrong?