DeepSeek R1 671B backed by fast read IOPS?

How much cpu usage did you have? I wonder about running it on the gpu part of an apu. Like what Bobzdar did https://www.reddit.com/r/LocalLLaMA/comments/1efhqol/testing_ryzen_8700g_llama31/

People have run 96gb of ram in the Minisforum UM790 Pro. I wonder if distributed inference with a cluster of UM790 Pros using a USB4 mesh network would work? High-speed 10Gbps full-mesh network based on USB4 for just $47.98 – Fang-Pen's coding note

@twin_savage What kind of system is that? If it’s a numa issue maybe taskset can help?

Mac mini M4 (pro) with TB5 is perhaps the play here. Fast memory, fast network. But minisforum should work too. Memory is definitely cheaper on minisforum, but m4 pro gets about 250GB/s memory bandwidth.

1 Like

It’s a dual 9480 system in SNC4 mode with no DDR5 installed, so 8 NUMA nodes.

I had tried numactl to wrangle the process into the best configuration I could think of but I didn’t see much improvement.

2 Likes

Outing myself as a noob in this area with this question: Which SIMD extensions are actually used if one tries to run DeepSeek on a CPU? Does any of the versions out there use, for example, AVX2 and AVX 512 instructions, and if yes, which version uses which?

1 Like

Depends on the code you use and how it’s compiled. The most popular option being llama.cpp

In principle it supports AVX2, AVX512, and AMX. Last I checked there was not much benefit in AVX512 but that might have changed more recently. For bigger models you are usually more bottlenecked by memory speed (and smaller you‘d want to run on a gpu any way). Even at AVX2 a modern CPU can process more data than memory can supply.

2 Likes

Yeah what quilt already said about what is compiled into llama.cpp. I compiled on Ryzen 9950X and here is what it spits out on startup:

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

I haven’t disabled specific instruction and done a/b testing/benchmarking though. Again repeating quilt, you’re likely limited by memory read bandwidth inferencing big LLMs.

If we had some use cases, knowledge distillation can tune a smaller model, which also means faster token performance.

Have already seen a few llama models distilled, but a coding focused one could be neat.

SwapMoE can reduce the memory consumption from 14.2 GiB to 4.7 GiB - arxiv SwapMoE

Exciting! If SwapMoE could get those 671 GB full fat weights running in ~220GB RAM, I’d switch to red cap milk in a heartbeat lol…

The closest looking issue/PR I could find was from a few days ago ggerganov/llama.cpp#11532 suggesting :

The MoE architecture is interesting and likely presents a lot of opportunities for clever optimizations.

Also hoping #11557 Flash Attention support for R1 lands soon so we can quantize both kv cache e.g. --cache-type-v q4_0 to reduce the memory footprint for long context.

Also saw an interesting chart on context length vs tok/sec by a llama.cpp dev u/fairydreaming. That dev supposedly is getting access to a dual Epyc 9654 but no IPMI (they want to test a patch with various BIOS NUMA configs). Its not clear how much benefit dual CPUs can deliver yet over single socket configs.

I’m not sure what exact model/quant, but one guy over there says:

7 token/sec on my single 9534 (8 ccd, 64 core) with 12 channel memory. [no meaningful speed gains going from using 32 to 64 cores]

I’m gonna dig around for that deepseek-r1-mla-q4_k_s.gguf, but otherwise that is the latest I’ve heard.

Also for anyone using ollama, keep in mind it is just using upstream llama.cpp. So if you want the bleeding edge, go straight to the source hah.

I can get close to 10 tokens/sec on Q8 with the X vcache cpus in nps0. core count doesnt seem to matter but I overclocked 5600. I was surprised avx512 seems not to make much difference right now. Corecount doesnt seem to matter much either. There seems like there is a lot of room for software optimization.

3 Likes

I kicked the tires on that rust based llm inference engine, here are some scripts and documentation showing how to run it vs llama.cpp cli commands.

the tl;dr; is in my opinion and use cases, it may have some promise for some hardware configurations, but needs more time to cook.

It doesn’t seem to support GGUF as well, doesn’t seem to offer data types less than 16bit, the CLI interface is a bit unwieldy, the benchmark executable seems to hang, inference seems to hang after so many tokens generated, and most importantly it didn’t seem any faster in anecdotal testing running full offload on VRAM as compared to llama.cpp with as similar of a setup as I could get.

Maybe it is good for mac “metal” and “accelerate” stuff? I dunno…

The general advice as I understand it is if your LLM is too big to fit in VRAM go with llama.cpp. If you can fit everything across all your GPUs VRAM, then go with ExLlamaV2. I also keep tabs on vllm to see how it is coming along too.

1 Like

not sure if its applicable, though saw this advice from a llama.cpp dev from a few days ago:

One thing you can try is to set NUMA per Socket to NPS4 and enable ACPI SRAT L3 as NUMA domain in BIOS and run llama.cpp with --numa distribute, then try to find the optimal number of threads. (make sure to disable numa balancing first) -gh/llama.cpp #11446

Great news, and lines up with what I’m grokking from the github llamma.cpp issues. They seem to be prioritizing flash attention support for R1 to get kv-cache quantization going. Then possibly Multi-Head Latent Attention (MLA) which seems to increase generation speed (especially at longer contexts) but possibly at a cost to prompt processing speed at the moment. (dtails in the same gh issue link above).

It’s possible to test the MLA implementation fork, but requires a custom quantized model. I found some instructions on how to do that here.

With luck this bleeding edge stuff will coalesce into main within a couple weeks…

1 Like

oh nice, I use ollama, so guess I’ll keep using llama cpp. The max vram on a gpu is ~24 gb

1 Like

Yeah, there are a few downstream projects that rely on llama.cpp under the hood. They lower the barrier to entry being easier to download a binary and run it while providing a decent UI experience.

  • koboldcpp
  • ollama
  • lm-studio
1 Like

yep been using lm studio and ollama on the side, ollama doesn’t have licensing concerns like lm studio

I like lm studio because they actively release new models and have a model compatibility filters

1 Like