Full DeepSeek Q1 with the "IK" version of llama.cpp on AM5 -- No Distills; Just a Quant

What GPUs does this work with?

Ideally a 24+ gb GPU, but it can work with a 16gb GPU, but the context window will be smaller.

Does this work with llama.cpp?

No, you need the “ik” version of llama.cpp

…Faster CPU prompt processing for Trellis quants and MoE models…

Love to see these kinds of things in the changelogs.

This project is designed to speed up the quants – smaller versions of big models. This is how we can run the “real” 500+ billion parameter models on “only” 128gb ram.

What is the build in the video?

Getting Started

These are the quants to try with ik_llama.cpp:

Ubuntu 24.04 LTS fresh install steps

Ubuntu 24.04 LTS – on install – if you check the checkbox to download drivers automatically, it will get nvidia drivers 570. That should work fine for 4000 and 5000 series cards.


# install any os dependencies
# might be more than this, if it complains about libcurl let me know i'll find the package it wants
apt-get update
apt-get install git build-essential cmake ccache nvidia-cuda-toolkit 

# Optional -- also install 
# linux-oem-24.04c 
# if you have a 5000 series GPU *and* need to run
# 575 or newer drivers directly from nvidia.com 

# get code
git clone https://github.com/ikawrakow/ik_llama.cpp.git
cd ik_llama.cpp

# build it for deepseek on CUDA
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)

# confirm it runs
$ ./build/bin/llama-server --version
version: 3819 (f6d33e82)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

# download model
wget https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/resolve/main/IQ1_S_R4/DeepSeek-R1-0528-IQ1_S_R4-00001-of-00003.gguf
wget https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/resolve/main/IQ1_S_R4/DeepSeek-R1-0528-IQ1_S_R4-00002-of-00003.gguf
wget https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/resolve/main/IQ1_S_R4/DeepSeek-R1-0528-IQ1_S_R4-00003-of-00003.gguf

# run api-server and connect with client or open browser to https://127.0.0.1:8080
# adjust --threads to be number of physical cores e.g. 8 etc


./build/bin/llama-server \
  --model DeepSeek-R1-0528-IQ1_S_R4-00001-of-00003.gguf \
  --alias ubergarm/DeepSeek-R1-0528-IQ1_S_R4 \
  -fa -fmoe \
  -mla 3 -amb 512 \
  -ctk q8_0 \
  -c 16384 \
  -ngl 99 \
  -ot "blk\.(3|4|5)\.ffn_.*=CUDA0" \
  -ot exps=CPU \
  --parallel 1 \
  --threads 16 \
  --host 127.0.0.1 \
  --port 8080

A brief explanation of each argument:

  --model # point it at the first GGUF file of a model
  --alias # whatever short name u want to show up in client
  -fa # flash attention gives speed boost
  -fmoe # fused mixture of experts optimization speed boost
  -mla 3 # which multihead latent attention implementation to use (3 is best for both CUDA and CPU for now)
  -amb 512 # 512MiB VRAM buffer size, can reduce to 256 in a pinch
  -ctk q8_0 # q8_0 quantized kv-cache saves half ram over full fp16
  -c 16384 # context length, adjust up as desired using more VRAM
  -ngl 99 # number GPU layers to offload, 99 is a nice big number more than actual layers in the model to offload everything
  -ot "blk\.(3|4|5)\.ffn_.*=CUDA0" # override offloading everything and specify these three routed expert layers to also go onto CUDA VRAM
  -ot exps=CPU # override all other routed experts onto CPU instead of VRAM
  --parallel 1 # for Deepseek keep it 1, for small models u can have concurrent slots dividing up the total context for higher aggregate throughput tok/sec
  --threads 16 # set to number of physical cores
  --host 127.0.0.1 # use 0.0.0.0 to access from different computer
  --port 8080 # whatever port u want

  -ot "blk\.(3|4|5)\.ffn_.*=CUDA0" # override offloading everything and specify these three routed expert layers to also go onto CUDA VRAM

What about one, or two 16gb GPUs?

You can experiment with these lines to offload to both CUDA0 and CUDA1:

-ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0"  \
-ot "blk.(8|9|10|11).ffn.=CUDA1"

…but ideally some experimentation is in order?

What about the 7900xtx+rocm?

It can work – it has the vram for it – but this isn’t a nut we’ve cracked yet. It throws some NaNs during the perplexity test.

No Free Lunches

Understand that there is a loss of fidelity when quantizing. When using for tasks that requires precision or precise output, it is easy to run into frustrating situations. These types of really-reduced models are not recommended for code completion or code generation because it will tend to make a lot of small mistakes.

At the same time it can still be a fast and highly useful tool for sharpening your own mind as we saw in the video.

24 Likes

i guess there is no chance of running this on 32gb ram and rtx 3090, right?

Time to take my TR 2990WX with 256gb ram off server duty to be my new AI rig. I’ll combine with my Radeon VII.

2 Likes

I might try to tinker with this in a VM on TrueNAS Scale with 20 CPU threads and 64GB of memory. It’s an older HP Z840 with 2 x Xeon e5-2630v4, 128GB of memory and a GTX 1080 (first gen tensor cores). I had been running OpenWebUI, but GPU pass-through was proving to be unreliable.

Alternatively, I could use my main rig to play around with it as well. This one is on a R9 5950X (stable all core 4.6GHz), 128GB of DDR4-3600 and an AMD RX 6800 XT.

Wow — talk about topical and on point! Really glad this video dropped.

For anyone trying to interact with ik_llama.cpp via an API layer (FastAPI, Gradio, [insert your frontend here]), I ran into a pretty specific challenge: Gradio can’t dynamically generate the API schema when wrapping a compiled backend like ik_llama.cpp.

The issue isn’t with ik_llama.cpp itself — it’s just the shift from Python to C/C++ that breaks the automatic metadata discovery Gradio relies on.

To help bridge that gap, I’ve been building a wrapper/interface layer to expose the necessary metadata in a way that downstream tools like Gradio or FastAPI can use to dynamically construct APIs on the fly. If you’re facing the same hurdle, or just curious, here’s the very-much-WIP repo:

:link: https://github.com/Yummyfudge/medparswell

It’s not fully running yet (I was actively working on it when this video dropped), but I figure if I’m hitting this wall, someone else out there probably is too.

Big thanks to @ubergarm for the detailed breakdown, and to @wendell + the whole Level1Techs community for being the constant spark for this kind of exploration.

2 Likes

time to spend a few bucks upgrading the system ram :smiley:

@wendell - Would it be worth trying on the GTX 1080, or should I try to brute force my 6800 XT into kind of working? Should I allocate more cores on the Xeon platform, or should the 20 cores allocated be fine?

About to rebuild my system, but will try it on 192gb ram + 5090&3090 combo.

Is there a way to somehow provide it to clion/pycharm/copilot as model? Right now I have ollama but any way ?
Thanks!

1 Like

Limited to AM5? Or could I run it on AM4 (my system: 5950x, 128 GB) too? Plus, of course, a 24 GB graphics card…

a bit slower, likely, due to reduced memory bandwidth. but it should run.

it gets a bit sketchy < 16gb. 16gb vram doable but… I dont think this is enough for this model. you can run smaller models though.

Pretty cool. Impressed by the speeds given it uses CPU, even if it is only the Q1-equivalent of DeepSeek. Been using KoboldCPP to run LLMs previously (on 3x GPUs). Could see myself trying this instead, though I can imagine this fork probably won’t be as smart about the KV-cache as KoboldCPP is. Is there guidance on how to getting this working on Windows? Tried building it, but I am not so familiar with the tooling so was unable to get it working.

If not, I could certainly see myself trying to install Ubuntu on a secondary drive to try it out.

1 Like

It’s funny sometimes.
The video came up just as I was about to look for parts for my “mini ai build”.
I’m torn though. I have a 7900XT and tried some basic image generation stuff on it.
And it was a pain getting it to work properly on ROCM and Ubuntu 24.04.
It did work faily well in the end though. Now I’m thinking about 5080 or even 5070 ti.
As it should be faily easy compared to AMD.
I don’t see myself getting a used GPU as there isn’t so many options here and its faily expensive.
3090 is available though. Just scared they’re not working properly or something,
of course a 5070ti is more expensive, but at least its new.
For CPU, does 3D V-cache matter? I have a 5800X3D i my current rig, but it would be better to get something higher clocked like a 9700X, right?
I really want to run it in a small package with low power consumption as well.
9700X
GB B650I AX
2x64gb ram 6000mhz ram, cl30-36
5070ti or 5080
4tb storage or more.
750W PSU
mini itx case

Why wouldn’t it work? What card do you have? 24gb is perfect. If on 7900XTX it might be hard for now at least. Worth trying out though.

If you are running 192GB RAM + 5090/3090 56 GB RAM and want to use an local LLM for coding assistant, some folks are reporting success with our latest ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF · qwen3 coder test with RooCode.

Some better tool calling / function calling support was just added to ik_llama.cpp so hopefully can be the “LLM engine” powering your “local ai stack” connecting to it through the API endpoint.

So yeah if you can run this DeepSeek-671B quant at ~131GiB size then def check out the two latest mixture-of-expert models just released:

Supposedly Qwen is dropping new thinking version of the 235B as well so I will try to get those quantized and uploaded as they become available!

5 Likes

KoboldCPP is a pretty good fork with its own optimizations. There i also GitHub - Nexesenex/croco.cpp: Croco.Cpp is fork of KoboldCPP infering GGML/GGUF models on CPU/Cuda with KoboldAI's UI. It's powered partly by IK_LLama.cpp, and compatible with most of Ikawrakow's quants except Bitnet. which offers some windows pre-compiled binaries on CUDA 12.9, but I don’t think it supports the MoE models yet but does support some of ik’s new SOTA quants. It may get MoE support soon though as the dev mentions to me here: ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF · Compatibility with Nexesenex/croco.cpp ???

Also this guy might have some windows pre-built releases of ik_llama.cpp Releases · Thireus/ik_llama.cpp · GitHub

The old 3090 with 24GB VRAM running around 1TB/s memory bandwidth at around 350Watts is still a decent price point. I use this for my home rig.

You can do a lot in 16GB VRAM, but ideally more will be useful if you’re into the AI stuff like LLMs and ComfyUI etc.

Yes the CUDA support unfortunately still tends to be better and easier than the AMD GPUs, but with ROCm/HIP and vulkan backends coming along hopefully that story is changing and we can get some cheaper VRAM solutions.

So the 3D VCache isn’t going to help you much when there are 30GB of active weights that have to be processed per token generated for example. So I wouldn’t expect it to help token generation speeds no. The name of the game for token generation is memory bandwidth - how fast can you get those 30GB of weights from RAM through the CPU for each and every new token predicted haha…

As for faster clocked CPUs it may help you on Prompt Processing speed (how fast the LLM can read your long prompts before responding). This does scale with more CPUs so a 16x core will have faster Prompt processing than an 8x core and can take advantage of multi-cores.

Have fun!

1 Like

Are the ROCm issues specific to the radeon 7000 series or ROCm as a whole? Would be interesting to try on this on a Strix Halo system.
Fast quad channel memory and all.

You’re right. 16gb VRAM might be a little low. 3090 is also 30% cheaper than 5070 ti but used of course.
I’ve looked at Intel cpus as well. Recently got the newer T14 Gen 5 for work featuring a Ultra 7 155U which work suprisingly well. I’m tempted to try using a Ultra 7 265K for the build, same price as 9700X, but faster in AI workloads, according to techpowerup. But since its running on GPU mostly it might not matter or might not even be properly supported in linux: