You may want to know that llama.cpp is the backend Ollama is ultimately running off of; It is more or less an API/repo/UI layer on top of it. More recent releases of Ollama should have built-in llama.cpp runtime optimized to the latest instruction set your processor may have, as well as GPU/non-GPU variations.
Your UI should be able to provide tokens per second and other information in their response, and I think it might be in the logs as well.
I understand that llama.cpp and by extension Ollama performance on CPU/RAM, in a non-NUMA scenario like yours, is ultimately bottlenecked by memory bandwidth. That’s one of the (smaller) reasons why virtually all new large-scale open-weight models, released since maybe May with DeepSeek-R1-0528, has been MoE. It’s one of the only means to get usable performance out of consumer-grade CPU/RAM at the kind of parameter counts those models tend to have.
As an aside, Ollama model grouping is confusing. R1-0528 is very different from the original R1 in terms of response style and quality, and, none but the 671B versions of DeepSeek-V3 and R1 are the “real” DeepSeek models, but much less powerful finetunes of other LLMs on DeepSeek outputs. Though you need at least 128GB RAM to start trying a deep quant the full version.
I’ve been having better performance on Ubuntu than Windows, especially when model size start to overhang available RAM.
There is no such thing as noob questions; We all had to start somewhere. ![]()