DeepSeek R1 671B backed by fast read IOPS?

tl;dr;

Has anyone here with a fast read IOPS disk array with good sized RAM disk cache tried serving the real DeepSeek R1 671B (quantized) model?

I’d love to know how fast you can inference and compare price/performance over expensive unobtainable VRAM given the MoE (mixture of experts) model architecture.

For example, that 256GB RAM ASUS ProArt Z890 with some kind of PCIe 5.0 ICY DOCK Quad NVMe Adapters may be able to deliver decent inferencing speeds for these big MoE LLMs at a price more acceptable for home lab as compared to say 2x H100 80GB VRAM GPUs which reportably only gets 10~15 tok/sec single inference speeds anyway.

Details

Most of us home lab folks have been running various “Distill” versions of the DeepSeek R1 model. These LLMs are actually not R1 at all, but other base models like Qwen2.5-32B which were fine-tuned on some CoT <think></think> outputs from the big R1.

Then a few days ago the unsloth team managed to dynamicly quantizate the original full 671B fp8 full model (~600+GiB file size) into more usable size (130~212GiB file size) GGUF models.

For fun I tried to run one on my 96GB RAM + 24GB VRAM AM5 gaming rig but Linux kept firing the OOMkiller haha… I got it to work with a giant Linux Swap file on a Gen 5 x4 NVMe SSD but it was slow (0.3 tok/sec) and thrashing the disk and the system was unstable.

Then I learned about the llama.cpp mmap() feature which leaves the model files on disk and maps the address into memory but doesn’t actually malloc() any RAM. So the weights are read directly off of the block device at inference time. I had to do some cgroup shenanigans to get it to fire up, but this actually worked fairly well given my 2TB Crucial T700 NVMe SSD has sequential read speeds of up to 12,000 MB/s and 1.5 GB random read IOPS.

This has the advantage of most of system RAM to act as disk cache to keep the most actively used MoE weights warm at 88GB/s RAM. During inferencing I can watch btop and when the model switches modes (e.g. counting words, doing math, different language) it burps and slows down while the cache is dumped and refilled with the newly active weights.

I made a r/LocalLLaMA post about it yesterday that blew up to almost 1k likes (I gave a shout out to Level1Techs too haha).

Benchmarks

While it isn’t fast nor the context long, it is cool to run the actual R1 at home for fun experiments and writing amusing knock-knock jokes lol.

ctx-size n-gpu-layers expert_used_count Cached high water mark (GiB) generation (tok/sec)
2048 5 8 ~82 1.45
2048 5 4 ~82 2.28
2048 0 8 ~82 1.28
2048 0 4 ~82 2.20
8192 5 8 ~67 1.25
8192 5 4 ~67 2.12
8192 0 8 ~66 1.10
8192 0 4 ~66 1.81

References

Conclusion

Thanks for humoring me in my quest for 2TB of ~30GB/s “VRAM” given all that sweet sweet GDDR7 is probably earmarked for StarGate if they can get chevron seven to actually lock. xD

Cheers!

4 Likes

I have it running. I am using Ollama with CPU computation on my Threadripper 3995WX. When the model is loaded into main memory it runs and generated about two to three words per second of the answer. Not as fast as a GPU but on the other hand I can run the full model at all, so I am happy.

Only current downside is that my main SSD is too slow. Loading the full four hundred something gigabytes takes too long currently. I am at the moment looking to buy a couple faster PCIe4.0 SSD and put them on a mdadm RAID1 to load the models faster.

Edit: If you are wondering I am currently sporting 512GB of DDR4 as main memory.

1 Like

The easy option is to use an API provider like together.ai for personal access to large models.

Engineered solution:
Could be a solution were GPUdirect loads directly off pcie attached nvme drives, with GPU’s holding a few “expert” layers. This removes the CPU bottleneck as it’s directly addressed.

Oh nice! So maybe 4-5 tok/sec generation of the 4bit quant I presume? Inferencing on CPU with DDR4 that is cool to hear. How long of a context are you using?

The unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL weighs in at ~212GiB on disk and might be faster. They claim it is a “dynamic quant” so didn’t lobotomize the most important weights as much.

I use exactly that ‘image’ here from Ollama. It says Q4_K_M. I am not sure how I can find out about the length of the context with using Ollama.

Oh cool, so you are talking about DeepSeek-V3. It has the same architechture but it is not the new R1 CoT reasoning model.

I’m not sure that the one I’m talking about R1 is available yet on ollama, as I use llama.cpp. There is a post on the hugging face repo with a person who just made their own quants to try in ollama, but that is beyond my current experience.

Your reported generation speed should still be similar I’m guessing!

Right, even azure and perplexity are getting in on serving DeepSeek R1 671B I’ve heard. I’m enjoying trying to run it locally for now. Though if anyone does buy API access, make darn sure you know what quant and the exact model parameters they are selling you because --override-kv deepseek2.expert_used_count=int:4 inferences faster (likely lower quality output) than the default value of 8 experts used…

Interesting paper, and amusing that some researches have similar hardware restraints as us home lab enthusiasts hahah…

The experimental results show that LoHan is the first to fine-tune a 175B model on an RTX 4090 and 256 GB main memory

My crystal ball suggests that as LLMs grow we’ll see more optimization to support CPU inference backed by fast IOPS block storage… There isn’t enough GDDR7 to go around the whole world yet haha… xD

I want to try benchmarking my system with this but I’ve been overwhelmed by all the options on how to run it.
I’ve got 128GB of memory that’ll do 1,950GB/s in stream triad so theoretically the CPU would be faster than a GPU if the workload is truly memory-bound?

1 Like

Partially OT, but this might be of interest for anyone interested in running DeepSeek who also owns or has access to a 7900 XTX or Radeon Pro:

I was really surprised by the apparently good performance of RDNA3 here. AMD even claims that the 7900 XTX beat the performance of a 4090.

Yeah the landscape of inference engines and models is overwhelming. I’ve tried many of them out locally and in general llama.cpp is the upstream source for many of the other projects. It also has some of the best capability for doing partial offload between GPU/RAM and most importantly for this test is it supports mmap() letting the big model files sit on a disk while taking advantage of the filesystem cache.

Your 128GB is not quite enough to fully hold even the smallest dynamic quants of the real R1, but if you have a decent disk also you could give it a try for fun anyway.

The smallest version of the actual real DeepSeek R1 that I know of is UD-IQ1_S 1.58bpw that weighs in at 131GiB.

You could start with that and give it a try before moving up to a larger sized quant.

If you are interested, give this a spin:

Procedure

First, download all these 131GiB of GGUF files and put into a directory on your fastest drive: unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S

While that is going, get yourself setup with llama.cpp inference engine like so:

# download the source (i rebuild daily at this point lol)
git clone [email protected]:ggerganov/llama.cpp.git
# if above doesn't work use the url instead e.g.:
# git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# build for CPU inference only to keep it simple for now (no CUDA etc.)
cmake --build build --config Release -j $(nproc)

# confirm it built by looking for all the fresh binaries
ls -lah build/bin/

# try to start up the API server (which hosts a nice web interface too)
./build/bin/llama-server \
    --model "/mnt/fastdrive/models/unsloth/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf" \
    --ctx-size 8192 \
    --override-kv deepseek2.expert_used_count=int:8 \
    --cache-type-k q4_0 \
    --cache-type-v f16 \
    --parallel 1 \
    --threads 16 \
    --host 127.0.0.1 \
    --port 8080
  • --model points to the first file of whatever GGUF model you decided to download
  • --ctx-size more is better but takes up precious RAM quickly, try between 2k to 16k to start out
  • --override-kv can reduce expert_used_count from default of 8 to say 4 for faster inferencing at a cost of likely lower quality
  • --threads i’d recommend to set to however many real cores your CPU has (not SMT/hyperthreads)

If our good friend OOMkiller keeps killing the process, you can check with sudo dmesg -T. I tried fiddling with the various sudo sysctl -a | grep overcommit settings, but in the end used Linux cgroups to put an artifical RAM cap on the llama-server process which made it decide to mmap() instead of malloc() and barfing.

So only if you are OOMing, you could try prepending this to the command e.g. sudo systemd-run --scope -p MemoryMax=88G -p MemoryHigh=85G ./build/bin/llama-server .......

Once it is up and running, it will spit out the IP address and port. Open that up in your web browser, in this case http://127.0.0.1:8080 and you can try it out.

If this is your first time running an LLM, you could start out with a smaller GGUF model like one of the little Distill variants mentioned by @eastcoastpete . Keep in mind the Distill’s are NOT R1. It’s confusing haha… I’ve had good luck with the IQ4_XS quant of bartowski/FuseO1-DeepSeekR1-QwQ-SkyT1-Flash-32B-Preview-GGUF which is a 32B CoT frankenmerge of some recent models including the Qwen2.5 Distill…

Good luck and curious how you make out! Holler if you get stuck on anything! Cheers!

6 Likes

Show me a benchmark, when I tested flash attention the 7900xtx was slower than a 3090.

The link to AMDs site has links to videos. I don’t have a 7900XTX myself, so I can’t run benchmarks to verify their claim. However, it would open AMD up to ridicule if they were that far off the mark. They also posted something about it on “X”, however, I am off that platform and try to stay away from it .

I went to a ag2 talk with a MIT professor, how does this compare on your hardware compared to llama.cpp

supposedly written in rust so it’s greased lighting

1 Like

lol rust, yeah i do use uv and alacritty which are a couple of slick rust projects… small well written concurrency rust tools can be darn fast for sure…

I’ll try to test it out this weekend. It seems to support some GGUF models that I already have so I can compare tok/sec against latest llama.cpp.

Regarding the original topic, here is the best server class build I’ve seen suggesting ~6 tok/sec on the full unquantized fp8 DeepSeek R1 671B in 768GB DDR5 RAM and claiming $6k hardware cost.

1 Like

I think Luke mentioned either your build/post last night during wan show when they were talking about using a epyc with loads of system memory

1 Like

Oh hey interesting yes, I found the reference where Linus mentions the what sounds like the twitter post above with the EPYC CPUs!

Amusing they mention Wendell too with Coming Soon :tm: , nice teaser!

2 Likes

I do wonder, if you go for the route of a bunch of RAM instead of VRAM in gpu’s. does it make a difference doing the compute on a gpu or on a cpu? when is the cutoff, maybe 30% in VRAM or 75% in VRAM?
bandwidth on pci-e 4.0 would be 31gb/s where as even dual channel ddr5 hits over 85gb/s. (comparing like the gddr6X in a 4080 having 716 GB/s bandwidth)

Example of my own: RTX 4080 on an x570 with 2x16gb DDR4.
Running Deepseek-r1 dist:14b fits in vram : 14 tokens/s (with 99% gpu core usage)
Running Deepseek-r1 dist:32b 7,5gb spills into ram: 3.5 tokens/s (with ~25% gpu core usage)

Deepseek R1 has arrived with Ollama, downloading deepseek-r1:671b right now. Will be able to report more on it tomorrow.

1 Like

I got llama.cpp working but was only getting about 0.9T/s in R1-UD-IQ1_S so I assumed the compiler did something very wrong.
I tried out the ollama version today and it peaks out at about 0.9T/s on the 70b R1 model as well, so it turns out it wasn’t the compiler.

The inference definitely isn’t memory bandwidth limited, at least as I ran it. There appears to be some problems in it’s threading, if I specified any more than 8 threads the performance degraded very steeply so I disabled over 100 cores.

It’s possible there is a problem with memory locality rather than a threading problem. I’m fairly new to this so it’s entirely possible I’m missing some magic argument that lets it run better in a numa environment, my background is more HPC than anything else.

1 Like

I’ve been working on this same line of thinking with reducing what’s actually required for deepseek full fat.

also looking at some patches to llama.cpp toward making the swapmoe paper a reality

the vcsche CPUs in numa 0 can do around 10 tokens per second

3 Likes