DeepSeek Deep Dive R1 at Home!

Thanks, double-checking you use this? bartowski/Mistral-Large-Instruct-2411-GGUF · Hugging Face

I didn’t find any unsloth but did stumble on “SOAR” models like julien31/Soar-mistral-123b · Hugging Face which seems more for research and for some reason they use 2407 instead of 2411.

It doesn’t sound like they were planning an actually larger model. I am also doubtful that their medium is the new large, will hold up.

Yes, exactly that one. I run it at either Q5_K_M or Q6_K, I don’t feel much difference.

1 Like

Yeah that might be a big challenge for the next months. How to see what you can’t benchmark?

Beside

  • human preference from chatarena, which imho is kind of useless (i don’t care that people prefer that chatgpt’s way to write compared to grok or whatever, models are too powerfull for it imo)
  • Benchmark that you can RL against
  • Math/code that you can also RL against

Beside math and code (and may be agentic behavior to some extent) where we might gain performance for some times, for the rest we fall into goodhart’s law:

“When a measure becomes a target, it ceases to be a good measure”.

Wikipedia: goodhart’s law

Or rather, when the models themselves are still incapable, and not getting anywhere closer to doing the kind of jobs that relied on long context, exact understanding, and yes, agentic behaviour, the kind of things they pay humans well for, despite being supposedly human or super-human level at a lot of benchmarks previously thought difficult. Though I shudder to think of the day they at last could, perhaps not without some fundamental architectural change.


Back on topic, after some memory finetuning and a driver and CUDA toolkit update, my current setup could do 42tok/s prompt processing and 6.2tok/s token generation at 0 context, and 28tok/s and 5.3tok/s respectively at 36K, with the ~250GB q2_k quant of DeepSeek-R1-0528 on the latest ik_llama.cpp, with 4x64GB DDR5-4800@36-40-40-78 & 10us tREFI, an 1CCD (so 64GB/s tops in RAM bandwidth, 62GB/s benchmarked with this setup) Ryzen 7 7800X3D, and an Nvidia video card with 16GB VRAM.

Until some significant improvement happens for running 2R 2DPC on a desktop platform that’s not some too-expensive and blingy G.Skill kits with some silicon lottery wins, there’s probably not much more room for improvement without going workstation or used server. Even then, a hypothetical ~90GB/s read bandwidth on a platform unconstrained by the IF bottleneck and running at maybe DDR5-6400 might get you ~9tok/s token generation, and not much more prompt processing speed.

3 Likes

Two years ago, I started talking with peers about my worries regarding AI’s development. While AI has gotten steadily better at scoring well on tests or helping users get practical results, we haven’t seen any huge leaps in the core technology, transformative advances “step-function improvements,” but they’re still missing.

This slow progress is causing problems for companies planning their workforces. Many businesses have already cut staff, assuming AI would soon deliver massive productivity jumps. But those game-changing benefits might still be a decade away. The danger is that people are mistaking small, gradual improvements—or even clever tools like “agentic workflows” (AI systems that automate multi-step tasks)—for revolutionary breakthroughs. For example, some companies are excited about AI that can plan and execute complex tasks automatically. But these systems are often just advanced versions of the same technology that powers smart auto-complete suggestions. They’re helpful, but not fundamentally new. If companies act on unrealistic expectations, they could end up with skill gaps or operational issues that take years to fix.

As time passes without major AI breakthroughs, the gap between what people hope AI can do and what it actually can do will grow clearer.

1 Like

Anyone else using openweb-ui 6.15 with ik_/llama.cpp and not seeing tokens per second statistics? In Models, I have also checked “Usage”. Same problem as described in https://www.reddit.com/r/LocalLLaMA/comments/1lt1z1a/how_do_i_see_my_tokens_per_second_speed_im_using/ and prompts me to ask what others see… does it only work for ollama?

I’ve looked through the Github issues couple of times and tried other stuff like explicitly putting “none” in API key and also toggled the model to Local instead of External.

A brief look at the browser’s debug console, with llama-server configured via direct connection, revealed no such info in the HTTP exchanges. All they had were the three (completion, prompt, and total) token counts, all in turn displayed by the UI, even though more information is visible in ik_llama.cpp’s own debug output.

So yes, it’s probably Ollama specific.

Someone might be interested in adding them, though I expect that might end up breaking someone else’s parsers and clients. But yes, an option for extended usage info would be nice to have.

1 Like

Can someone explain me what these value actually mean in the sweep benchmark?

Screenshot From 2025-07-08 18-08-01

Like PP is the prompt processing length while TG is the amount of token to be generated. I figure N_KV is the context size but what is the difference between T_PP s and S_PP t/s and the last two for TG?

T_{PP|TG} is the total time taken to do prompt processing and token generation.

From the first row, 512 total tokens to PP / the 5.574s taken in total gives you the 91,85t/s. Same for the TG.

2 Likes

Ahhh now it makes sense, thank you.

@lerner

So ik_llama.cpp is not quite as updated as mainline llama.cpp in terms of how “openai compliant” its server endpoint is. The usual completions and chat endpoints work fine in my usage, however anything special like out of band tool-calling endpoints or other things generally require you run some kind of reverse proxy or openai compliant server pass through e.g. this guys FastAgentAPI maybe.

As far as I know the only way to get the tok/sec is to estimate it on the client side which I do in my little python async streaming chat app using this ds_token python library. It only works for DeepSeek but gives a good enough estimate for models which use other tokenizers.

I’ve used OpenWebUI successfully with ik_llama.cpp for basic features, but haven’t messed around with the multiple additional models and features which it tries to use and might not be fast enough with default settings. I haven’t tried this lately, but last this is how I started it:

#!/usr/bin/env bash

source venv/bin/activate

# IT DOES NOT HONOR HOST and PORT ENV VAR SO PASS IT MANUALLY
# https://docs.openwebui.com/getting-started/env-configuration/#port

export DATA_DIR="$(pwd)/data" # to clean up just do rm -rf ./data
export ENABLE_OLLAMA_API=False
export ENABLE_OPENAI_API=True
export OPENAI_API_KEY="none"
export OPENAI_API_BASE_URL="http://127.0.0.1:8080/v1"
#export DEFAULT_MODELS="openai/foo/bar"
export WEBUI_AUTH=True
export DEFAULT_USER_ROLE="admin"
export HOST=127.0.0.1
export PORT=3000

export ENABLE_TAGS_GENERATION=False
export ENABLE_AUTOCOMPLETE_GENERATION=False

open-webui serve \
  --host $HOST \
  --port $PORT

Hi people! Can anyone explain me how to estimate how much VRAM the context will use, for example in Deepseek-R1, with different sizes except seeing if it OOMs or not? Like how can I estimate how much VRAM a 32k or 64k context would use?

Also is there a llama.cpp flag to tell llama.cpp on which CUDA device is should store the context?

tl;dr;

For deepseek specifically, the VRAM usage will be the amount due to offloaded tensor weights plus KV-cache that grows linearly due to MLA. For most LLMs it grows exponentially as they don’t use MLA.


I made this chart when experimenting with ubergarm/DeepSeek-V3-0324-GGUF

So this chart will vary depending on the exact model and kv-cache quantization you choose. More below.

The Deets

It varies widely by model, and the easiest way is to start up a few times with different settings and note the debug logs that tell you how much VRAM is being used e.g.

  1. Vary the size of the kv-cache entries e.g. -ctk f16 -ctv f16 or -ctk q8_0 -ctv q8_0 etc.
  2. Vary the context length e.g. -c 8192 and -c 16384 etc.
  3. Note the logs on startup e.g.
# for deepseek only uses `-ctk` and ignores `-ctv` as it is MLA:
llama_kv_cache_init:      CUDA0 KV buffer size =   612.02 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   554.64 MiB
llama_new_context_with_model: KV self size  = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used

# for a non-MLA you'll see something like
llama_kv_cache_init:      CUDA0 KV buffer size =  5120.00 MiB
llama_new_context_with_model: KV self size  = 5120.00 MiB, K (f16): 2560.00 MiB, V (f16): 2560.00 MiB

Also is there a llama.cpp flag to tell llama.cpp on which CUDA device is should store the context?

tl;dr; I don’t know.

Notice it tells you which CUDA0/CUDA1 device has how much of the buffer assuming you are using multi-GPU config.

So I’m not sure if that can be adjusted to put all the KV-cache buffer on a single fastest GPU for example. Looking at some of the arguments available in the help:

    -sm,   --split-mode SPLIT_MODE  how to split the model across multiple GPUs, one of:
                                      - none: use one GPU only
                                      - layer (default): split layers and KV across GPUs
                                      - row: split rows across GPUs
    -ts,   --tensor-split SPLIT     fraction of the model to offload to each GPU, comma-separated list of proportions, e.g⟩
    -mg,   --main-gpu i             the GPU to use for the model (with split-mode = none),
                                    or for intermediate results and KV (with split-mode = row) (default: 0)

So I never change these settings, but maybe using -mg would do something? You could try and then observe the output logs. I’d not recommend changing -sm though. And no need to use -ts when using -ot

I swear I saw an experimental PR on mainline llama.cpp by fairydreaming perhaps to put all the kv-cache on a single GPU or on a single NUMA node or something, but it didn’t yield any gains. Ahh here is that reference

There was some more chatter about cuda device order here but don’t think it answers your question.

What are you trying to achieve by manually placing the kv-cache on one GPU or another?

1 Like

@ubergarm Hey nice work as always! I’m waiting DeepSeek-TNG-R1T2-Chimera and see how it goes!

I was wondeirng, and this is probably a very big ask, but do you have these models at hand? I still want to compare PPL but for some reason I keep getting nans. It is mostly to see how models behave of these sizes.

DeepSeekV3 0324: Q2_K_XL (Unsloth), IQ3_XXS (Unsloth), IQ4_XS (Unsloth), IQ2_K_R4 (yours), and Q8_0 baseline (not sure if it’s out there)
DeepSeek R1 0528: Q2_K_XL (Unsloth), IQ3_XXS (Unsloth), IQ4_XS (Unsloth). You have already published the PPLs for your ones so that’s great for a comparison (IQ2_K_R4, IQ3_K_R4, Q8_0)

I can run any of these (except Q8_0s ones) and well I just want how they compare to those quant sizes, without nans. If you need any donation or similar for this I will be glad to do so.

For example, Q8_0 is 1.93% better than IQ3_K_R4 (that’s really close for that size lol) and 9.15% better than IQ2_K_R4 for R1 0528. So I wonder how it looks vs the unsloth ones.

Sorry for the long message!