9950x3D Reviewed and Benchmarked. -- Tokens?

eece_Ret · March 11, 2025, 3:17pm

I think an interesting benchmark would be llamma.cpp. Load a decent size model, (no gpu) run suite of prompts. evaluate tokens/sec.
~Thoughts?

tk11 · March 11, 2025, 6:24pm

I imagine the results would say a lot more about the memory configuration than the CPU.

ubergarm · March 12, 2025, 8:30pm

Yeah as @tk11 says, the best way to run LLMs is generally completely in VRAM on a GPU.

That said, I do a lot of CPU/GPU offload inferencing of larger models that are too big to fit into my single 3090 TI FE 24GB VRAM GPU. For example the unsloth R1 671B UD-Q2_K_XL quant MoE at 3.5 tok/sec on ktransformers or some kind of dense 70B IQ_XS at maybe 4.5 tok/sec.

In that situation the bottleneck is almost always memory i/o bandwidth and not CPU issue.

Generally the weights are too big to take advantage of extra CPU cache, though I’ve seen a few folks manually tuning exact placement and number of threads with llama.cpp to try to take advantage of this for R1, but its super experimental.

That said, my 9950X is amazing for gaming, workstations, and LLM inferencing already.

Though it would be sweet to have the extra cache for gaming!