Ollama on CPU Performance - Some Data and a Request for More

So recently I’ve been evaluating my options when it comes to running local LLMs and was surprised to find that I couldn’t seem to find good benchmark data on threadripper 7K vs PRO 7K vs EPYCs.
Now I understand(ish?) the performance limitations of running models on CPUs vs running on a GPU, however given the scaling of GPU prices and the VRAM, it seems to me that if you want to run larger models (anything larger than 80GB in size) the best case scenario is that you’ll be paying about ~CAD$5000 if you go for 4 AMD 7900XTX (or 3090s) or ~CAD$10K if you went with two of the workstation cards with 48GB of vram. So for some scenarios it really seems that you do quite a bit better with a CPU only system with a ton of ram. Particularly when it can pull double duty for HPC simulations.

I’ve got a system with a TR7980X and 256GB of ram and one with a Ryzen 5 7600 with 32GB of ram, so I ran some basic tests to check if the performance would scale linearly with memory bandwidth. Lo and behold, it does seem to scale directly with memory channels on CPUs, with larger models getting slightly better than linear performance gains. The table below shows only the evaluation tokens/second as they seemed to sufficiently capture the data that came from the run.

The systems were both running up-to-date Fedora Silverblue with Ollama running in an Arch toolbx. The prompt was “write a 200 word summary on category theory and the growing field of applied category theory.”

7980X+256GB (tok/s) 7600+32GB (tok/s) Perf Gain
llama3:8b 20.49 10.72 1.91
Dolphin-mixtral:8x7b 15.37 6.6 2.33
Dolphin-mixtral:8x22b 5.83 INSF_MEM NA
Llama3.1:405b 0.61 INSF_MEM NA

You may notice however that this is only really two data points. So I was wondering if anyone in the community had access to a threadripper pro system with 8 memory channels or a newer EPYC system with 12 channels or (even better!) a 2P EPYC system with 24 channels that they’d be able to run a quick test on. I’d love to know how this stuff performs on just CPUs with much higher memory bandwidth.

4 Likes
RTX 3060 (tok/s) 7950x 96GB (tok/s) Perf Gain
llama3:8b 62.17 15.38 4.04
Dolphin-mixtral:8x7b 12.17 (+CPU) 9.4 1.29

I’ll try to fill in the rest later, since I don’t have Dolphin-Mixtral downloaded. However, GPUs are clearly king. An RTX 3060 costs less than half a 7950x excluding memory.

EDIT: I ran dolphin-mixtran:8x7b with some GPU offloading (10/12 GB used on GPU; 50% use on GPU and lower CPU use – ollama reports 12/33 layers offloaded), but still got a nice performance improvement. The GPU offloading seems to make sense (at least for this size of model). Hard to estimate what adding a 3090/4090/RTX 6000 (ada) would do for llama3.1:70b or 405b, but it would be interesting to see.

Note that the 2 CCD zen 4 CPUs do have significantly higher memory bandwidth compared to the 1 CCD SKUs (since they each have an infinity fabric link to the IO die), which may explain the results partly. I’m using 6000 MT memory with tight timings (CL30 but also max tREFI and tRFC at 500 (as low as I can go)).

I’m not running Ollama, instead i’m using ML-studio on windows

WIth my RX 6600 I get 30.56 tok/s with Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf. And goes all the way down to 9.64 toks/s with the Q8 with 24/32 layers. Don’t do CPU only cause RN i got an AMD Ryzen 5 3500(OEM).

Want to know how this topic folds out, cause I want to build a local server. And not gonna lie a Mac looks tempting.

1 Like

This makes a lot of sense. I guess lookng at the higher core count TR Pro CPUs would then be needed to get the real memory bandwidth. I came across this reddit post that seems to confirm this aspect, but doesn’t actually indicate the specs on the memory being used.
reddit post

I’d really love to see this tested with a TR Pro 7995WX and 7985WX…

I was having this same thought and kept hoping Apple would announce mac studios with M4 chips, but alas, still waiting, and haven’t found many useful benchmarks.