tl;dr;
Has anyone here with a fast read IOPS disk array with good sized RAM disk cache tried serving the real DeepSeek R1 671B (quantized) model?
I’d love to know how fast you can inference and compare price/performance over expensive unobtainable VRAM given the MoE (mixture of experts) model architecture.
For example, that 256GB RAM ASUS ProArt Z890 with some kind of PCIe 5.0 ICY DOCK Quad NVMe Adapters may be able to deliver decent inferencing speeds for these big MoE LLMs at a price more acceptable for home lab as compared to say 2x H100 80GB VRAM GPUs which reportably only gets 10~15 tok/sec single inference speeds anyway.
Details
Most of us home lab folks have been running various “Distill” versions of the DeepSeek R1 model. These LLMs are actually not R1 at all, but other base models like Qwen2.5-32B which were fine-tuned on some CoT <think></think>
outputs from the big R1.
Then a few days ago the unsloth team managed to dynamicly quantizate the original full 671B fp8 full model (~600+GiB file size) into more usable size (130~212GiB file size) GGUF models.
For fun I tried to run one on my 96GB RAM + 24GB VRAM AM5 gaming rig but Linux kept firing the OOMkiller haha… I got it to work with a giant Linux Swap file on a Gen 5 x4 NVMe SSD but it was slow (0.3 tok/sec) and thrashing the disk and the system was unstable.
Then I learned about the llama.cpp mmap()
feature which leaves the model files on disk and maps the address into memory but doesn’t actually malloc()
any RAM. So the weights are read directly off of the block device at inference time. I had to do some cgroup shenanigans to get it to fire up, but this actually worked fairly well given my 2TB Crucial T700 NVMe SSD has sequential read speeds of up to 12,000 MB/s and 1.5 GB random read IOPS.
This has the advantage of most of system RAM to act as disk cache to keep the most actively used MoE weights warm at 88GB/s RAM. During inferencing I can watch btop
and when the model switches modes (e.g. counting words, doing math, different language) it burps and slows down while the cache is dumped and refilled with the newly active weights.
I made a r/LocalLLaMA post about it yesterday that blew up to almost 1k likes (I gave a shout out to Level1Techs too haha).
Benchmarks
While it isn’t fast nor the context long, it is cool to run the actual R1 at home for fun experiments and writing amusing knock-knock jokes lol.
ctx-size |
n-gpu-layers |
expert_used_count |
Cached high water mark (GiB) |
generation (tok/sec) |
---|---|---|---|---|
2048 | 5 | 8 | ~82 | 1.45 |
2048 | 5 | 4 | ~82 | 2.28 |
2048 | 0 | 8 | ~82 | 1.28 |
2048 | 0 | 4 | ~82 | 2.20 |
8192 | 5 | 8 | ~67 | 1.25 |
8192 | 5 | 4 | ~67 | 2.12 |
8192 | 0 | 8 | ~66 | 1.10 |
8192 | 0 | 4 | ~66 | 1.81 |
References
Conclusion
Thanks for humoring me in my quest for 2TB of ~30GB/s “VRAM” given all that sweet sweet GDDR7 is probably earmarked for StarGate if they can get chevron seven to actually lock. xD
Cheers!