What GPUs does this work with?
Ideally a 24+ gb GPU, but it can work with a 16gb GPU, but the context window will be smaller.
Does this work with llama.cpp?
No, you need the “ik” version of llama.cpp
…Faster CPU prompt processing for Trellis quants and MoE models…
Love to see these kinds of things in the changelogs.
This project is designed to speed up the quants – smaller versions of big models. This is how we can run the “real” 500+ billion parameter models on “only” 128gb ram.
What is the build in the video?
Getting Started
These are the quants to try with ik_llama.cpp:
- ubergarm/DeepSeek-R1-0528-GGUF IQ1_S_R4 130.203 GiB (1.664 BPW) - Full thinking
- ubergarm/DeepSeek-V3-0324-GGUF IQ1_S_R4 130.203 GiB 1.664 BPW - No thinking
- ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF IQ1_S 132.915 GiB (1.699 BPW) - Short Thinking
Ubuntu 24.04 LTS fresh install steps
Ubuntu 24.04 LTS – on install – if you check the checkbox to download drivers automatically, it will get nvidia drivers 570. That should work fine for 4000 and 5000 series cards.
# install any os dependencies
# might be more than this, if it complains about libcurl let me know i'll find the package it wants
apt-get update
apt-get install git build-essential cmake ccache nvidia-cuda-toolkit
# Optional -- also install
# linux-oem-24.04c
# if you have a 5000 series GPU *and* need to run
# 575 or newer drivers directly from nvidia.com
# get code
git clone https://github.com/ikawrakow/ik_llama.cpp.git
cd ik_llama.cpp
# build it for deepseek on CUDA
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)
# confirm it runs
$ ./build/bin/llama-server --version
version: 3819 (f6d33e82)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
# download model
wget https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/resolve/main/IQ1_S_R4/DeepSeek-R1-0528-IQ1_S_R4-00001-of-00003.gguf
wget https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/resolve/main/IQ1_S_R4/DeepSeek-R1-0528-IQ1_S_R4-00002-of-00003.gguf
wget https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/resolve/main/IQ1_S_R4/DeepSeek-R1-0528-IQ1_S_R4-00003-of-00003.gguf
# run api-server and connect with client or open browser to https://127.0.0.1:8080
# adjust --threads to be number of physical cores e.g. 8 etc
./build/bin/llama-server \
--model DeepSeek-R1-0528-IQ1_S_R4-00001-of-00003.gguf \
--alias ubergarm/DeepSeek-R1-0528-IQ1_S_R4 \
-fa -fmoe \
-mla 3 -amb 512 \
-ctk q8_0 \
-c 16384 \
-ngl 99 \
-ot "blk\.(3|4|5)\.ffn_.*=CUDA0" \
-ot exps=CPU \
--parallel 1 \
--threads 16 \
--host 127.0.0.1 \
--port 8080
A brief explanation of each argument:
--model # point it at the first GGUF file of a model
--alias # whatever short name u want to show up in client
-fa # flash attention gives speed boost
-fmoe # fused mixture of experts optimization speed boost
-mla 3 # which multihead latent attention implementation to use (3 is best for both CUDA and CPU for now)
-amb 512 # 512MiB VRAM buffer size, can reduce to 256 in a pinch
-ctk q8_0 # q8_0 quantized kv-cache saves half ram over full fp16
-c 16384 # context length, adjust up as desired using more VRAM
-ngl 99 # number GPU layers to offload, 99 is a nice big number more than actual layers in the model to offload everything
-ot "blk\.(3|4|5)\.ffn_.*=CUDA0" # override offloading everything and specify these three routed expert layers to also go onto CUDA VRAM
-ot exps=CPU # override all other routed experts onto CPU instead of VRAM
--parallel 1 # for Deepseek keep it 1, for small models u can have concurrent slots dividing up the total context for higher aggregate throughput tok/sec
--threads 16 # set to number of physical cores
--host 127.0.0.1 # use 0.0.0.0 to access from different computer
--port 8080 # whatever port u want
-ot "blk\.(3|4|5)\.ffn_.*=CUDA0" # override offloading everything and specify these three routed expert layers to also go onto CUDA VRAM
What about one, or two 16gb GPUs?
You can experiment with these lines to offload to both CUDA0 and CUDA1:
-ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0" \
-ot "blk.(8|9|10|11).ffn.=CUDA1"
…but ideally some experimentation is in order?
What about the 7900xtx+rocm?
It can work – it has the vram for it – but this isn’t a nut we’ve cracked yet. It throws some NaNs during the perplexity test.
No Free Lunches
Understand that there is a loss of fidelity when quantizing. When using for tasks that requires precision or precise output, it is easy to run into frustrating situations. These types of really-reduced models are not recommended for code completion or code generation because it will tend to make a lot of small mistakes.
At the same time it can still be a fast and highly useful tool for sharpening your own mind as we saw in the video.