@twin_savage What kind of system is that? If it’s a numa issue maybe taskset can help?
Mac mini M4 (pro) with TB5 is perhaps the play here. Fast memory, fast network. But minisforum should work too. Memory is definitely cheaper on minisforum, but m4 pro gets about 250GB/s memory bandwidth.
Outing myself as a noob in this area with this question: Which SIMD extensions are actually used if one tries to run DeepSeek on a CPU? Does any of the versions out there use, for example, AVX2 and AVX 512 instructions, and if yes, which version uses which?
Depends on the code you use and how it’s compiled. The most popular option being llama.cpp
In principle it supports AVX2, AVX512, and AMX. Last I checked there was not much benefit in AVX512 but that might have changed more recently. For bigger models you are usually more bottlenecked by memory speed (and smaller you‘d want to run on a gpu any way). Even at AVX2 a modern CPU can process more data than memory can supply.
I haven’t disabled specific instruction and done a/b testing/benchmarking though. Again repeating quilt, you’re likely limited by memory read bandwidth inferencing big LLMs.
SwapMoE can reduce the memory consumption from 14.2 GiB to 4.7 GiB - arxiv SwapMoE
Exciting! If SwapMoE could get those 671 GB full fat weights running in ~220GB RAM, I’d switch to red cap milk in a heartbeat lol…
The closest looking issue/PR I could find was from a few days ago ggerganov/llama.cpp#11532 suggesting :
The MoE architecture is interesting and likely presents a lot of opportunities for clever optimizations.
Also hoping #11557 Flash Attention support for R1 lands soon so we can quantize both kv cache e.g. --cache-type-v q4_0 to reduce the memory footprint for long context.
I can get close to 10 tokens/sec on Q8 with the X vcache cpus in nps0. core count doesnt seem to matter but I overclocked 5600. I was surprised avx512 seems not to make much difference right now. Corecount doesnt seem to matter much either. There seems like there is a lot of room for software optimization.
the tl;dr; is in my opinion and use cases, it may have some promise for some hardware configurations, but needs more time to cook.
It doesn’t seem to support GGUF as well, doesn’t seem to offer data types less than 16bit, the CLI interface is a bit unwieldy, the benchmark executable seems to hang, inference seems to hang after so many tokens generated, and most importantly it didn’t seem any faster in anecdotal testing running full offload on VRAM as compared to llama.cpp with as similar of a setup as I could get.
Maybe it is good for mac “metal” and “accelerate” stuff? I dunno…
The general advice as I understand it is if your LLM is too big to fit in VRAM go with llama.cpp. If you can fit everything across all your GPUs VRAM, then go with ExLlamaV2. I also keep tabs on vllm to see how it is coming along too.
not sure if its applicable, though saw this advice from a llama.cpp dev from a few days ago:
One thing you can try is to set NUMA per Socket to NPS4 and enable ACPI SRAT L3 as NUMA domain in BIOS and run llama.cpp with --numa distribute, then try to find the optimal number of threads. (make sure to disable numa balancing first) -gh/llama.cpp #11446
Great news, and lines up with what I’m grokking from the github llamma.cpp issues. They seem to be prioritizing flash attention support for R1 to get kv-cache quantization going. Then possibly Multi-Head Latent Attention (MLA) which seems to increase generation speed (especially at longer contexts) but possibly at a cost to prompt processing speed at the moment. (dtails in the same gh issue link above).
Yeah, there are a few downstream projects that rely on llama.cpp under the hood. They lower the barrier to entry being easier to download a binary and run it while providing a decent UI experience.