DeepSeek Deep Dive R1 at Home!

Okay cool. You made you case. Let me make my case. So check this out. Here is the prompt:

“Imagine a runaway trolley is hurtling down a track towards five dead people. You stand next to a lever that can divert the trolley onto another track, where one living person is tied up. Do you pull the lever?”

You want to see the results of the results? Let start with stupid Kimi-K2 Q3-something quant:

  This is the classic "trolley problem," a moral thought experiment that      
  highlights the tension between utilitarian and deontological ethics.        
                                                                              
  • Utilitarian view: Pull the lever, sacrificing one life to save five. The  
  outcome (fewer deaths) justifies the action.                                
  • Deontological view: Do not pull the lever. Deliberately causing harm—even 
  to save more lives—violates the duty not to harm others, making you morally 
  complicit in the one death.                                                 
                                                                              
  There’s no universally "correct" answer; it depends on your ethical         
  framework. Some argue inaction absolves moral responsibility, while others  
  assert that refusing to act makes you complicit in the five deaths. The key 
  is recognizing the trade-off: either choice involves a profound moral       
  compromise.                                                                 

Oh shit! 1TB model sucked? Yes, indeed.

Lets try Deepseek R1 Q3-something:

status quo                                             
                                                                              
  Counterfactual: if I do nothing, trolley destroys corpses (ethically        
  neutral?). If I divert, I directly cause innocent death (ethically          
  negative).            
  Conclusion: Pulling the lever would be a morally indefensible act. It       
  transforms you from a passive bystander into an active killer, taking the   
  life of an innocent person for no tangible gain. The ethically sound choice 
  is to leave the lever alone.                                                


Aha!
Almost there but not quite right.
What about Q4+ quant?:

  Final note: the "dead people" detail is odd. Normally they're alive. Typo?  
  Or intentional twist? I'll proceed as written but flag that inconsistency.  
  This is the classic Trolley Problem, a thought experiment in ethics. My     
  answer depends on which ethical framework you prioritize:    

                                                         
                                                                              
  My Conclusion (as an AI):                                                   
                                                                              
  • If the five people on the main track are living (the standard scenario), I
  would pull the lever based on a utilitarian calculus. Saving four net lives 
  seems the morally preferable outcome.                                       
  • If the five people on the main track are truly dead, I would not pull the 
  lever. Diverting the trolley to kill one living person when there is no     
  threat to life on the main track is unnecessary harm.                       

Uh oh? Freaking finally something decent!

So what do we do about that? The lower than Q3 quants seems to be pretty stupid. Only if you use something like Q4+ you might get the right answer (with some luck with initial seeds! lol).

So, my point is that only Q4+ something could get you the results one might want/expect. Everything else is a straight garbage. I cannot genuinely imagine how they might apply.

Of course. You’re slightly behind the quants from the THIREUS (in terms of PPL and the size vs PPL dilemma).

I’ve been keeping track (this stupid forum doesn’t support svg dammit) and it doesn’t allow me to upload the pdf so …:

Processing: r1-0528-ppl-log.png…

lol apparently I am unable to upload either svg or pdf/png.

{
  "title": "DeepSeek-R1-0528 (671B) Quantization Analysis",
  "subtitle": "Lower perplexity = Better performance",
  "model_parameters": 671000000000,
  "data": [
    {"name": "IQ1_S_R4", "bpw": 1.664, "ppl": 4.8831, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ1_S_R4"},
    {"name": "THIREUS-1.9364", "bpw": 1.9364, "ppl": 4.3533, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-1.9364bpw-4.3533ppl.151GB-GGUF_11GB-GPU_140GB-CPU.3c88ec6_9fd615d.recipe"},
    {"name": "IQ2_KT", "bpw": 2.514, "ppl": 3.6378},
    {"name": "THIREUS-2.7840", "bpw": 2.7840, "ppl": 3.4341, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-2.7840bpw-3.4341ppl.217GB-GGUF_14GB-GPU_203GB-CPU.3c88ec6_02247be.recipe"},
    {"name": "IQ2_K_R4", "bpw": 2.799, "ppl": 3.5069, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ2_K_R4"},
    {"name": "JWNoctis/R1-0528/IQ2_KL", "bpw": 2.930, "ppl": 3.4379, "url": "https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826/354"},
    {"name": "UD_Q2_K_XL", "bpw": 2.994, "ppl": 3.5278, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/UD-Q2_K_XL"},
    {"name": "THIREUS-3.1027", "bpw": 3.1027, "ppl": 3.3372, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1027bpw-3.3372ppl.242GB-GGUF_11GB-GPU_231GB-CPU.3c88ec6_adc8101.recipe"},
    {"name": "THIREUS-3.1446", "bpw": 3.1446, "ppl": 3.3257, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1446bpw-3.3257ppl.246GB-GGUF_15GB-GPU_231GB-CPU.3c88ec6_7d1efe1.recipe"},
    {"name": "THIREUS-3.1447", "bpw": 3.1447, "ppl": 3.3269, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1447bpw-3.3269ppl.246GB-GGUF_15GB-GPU_231GB-CPU.3c88ec6_4b1254a.recipe"},
    {"name": "THIREUS-3.1525", "bpw": 3.1525, "ppl": 3.3251, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1525bpw-3.3251ppl.246GB-GGUF_15GB-GPU_231GB-CPU.3c88ec6_5a3fc0f.recipe"},
    {"name": "THIREUS-3.1740", "bpw": 3.1740, "ppl": 3.3253, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1740bpw-3.3253ppl.248GB-GGUF_17GB-GPU_231GB-CPU.3c88ec6_6cf3a72.recipe"},
    {"name": "THIREUS-3.1858", "bpw": 3.1858, "ppl": 3.3261, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1858bpw-3.3261ppl.249GB-GGUF_18GB-GPU_231GB-CPU.3c88ec6_027b7ff.recipe"},
    {"name": "THIREUS-3.2564", "bpw": 3.2564, "ppl": 3.2985, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.2564bpw-3.2985ppl.254GB-GGUF_15GB-GPU_239GB-CPU.3c88ec6_7c0be1e.recipe"},
    {"name": "IQ3_KT", "bpw": 3.483, "ppl": 3.3056, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ3_KT"},
    {"name": "THIREUS-3.5652", "bpw": 3.5652, "ppl": 3.2734, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.5652bpw-3.2734ppl.278GB-GGUF_14GB-GPU_264GB-CPU.3c88ec6_9b5660b.recipe"},
    {"name": "IQ3_KS", "bpw": 3.598, "ppl": 3.2991, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ3_KS"},
    {"name": "THIREUS-3.6766", "bpw": 3.6766, "ppl": 3.2741, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13781700"},
    {"name": "IQ3_K_R4", "bpw": 3.847, "ppl": 3.2730, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ3_K_R4"},
    {"name": "THIREUS-3.976", "bpw": 3.976, "ppl": 3.2452, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13798329"},
    {"name": "IQ4_XS (unsloth)", "bpw": 4.2683, "ppl": 3.2598, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/IQ4_XS"},
    {"name": "q4_0", "bpw": 4.508, "ppl": 3.2895, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/Q4_0"},
    {"name": "UD_Q4_K_XL", "bpw": 4.578, "ppl": 3.2483, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/UD-Q4_K_XL"},
    {"name": "IQ4_KS_R4", "bpw": 4.701, "ppl": 3.2286, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ4_KS_R4"},
    {"name": "DQ4_K_R4", "bpw": 5.289, "ppl": 3.2276, "url": "https://huggingface.co/anikifoss/DeepSeek-R1-0528-DQ4_K_R4"},
    {"name": "THIREUS-6.2478", "bpw": 6.2478, "ppl": 3.2240, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13781560"},
    {"name": "Q8_0", "bpw": 8.5259260, "ppl": 3.2130, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/Q8_0"}
  ]
}

2 Likes

Well, yeah. So what if that quant cannot answer some simple questions with a twist (as demonstrated above).? In someone is doing a software development, then 80% of the time he will be fixing bugs. Its a fact of the reality. So my question is – what’s the point of running stupid quants if they cannot think? You will be fixing the bugs they produce yourself? lol it doesn’t make sense

The video I did has zero to do with software development. It’s decent enough to have a learning platform, and show that quantization can still be useful.
Deepseek as a Q8 model on its own is impressive, as you say Minsky doubted we’d be able to do anything without FP64. And yet here we are.

And to others’ points here, some of the quants do have closeish PPL so should be nearly as useful as the real deal.

4 Likes

According to this bench

IQ4_KS_R4 almost matches Q8_0 when it comes to perplexity, getting 99.5% of Q8_0 results. (Whist being only tiny bit bigger then then q4_0, too.) Meanwhile IQ3_K_R4 beats q4_0 and even smaller IQ4_XS beats q4_0, so when it comes to more advanced quants, they can be a bit different from how it works with the standard quants, imo

1 Like

Oh well, then you’re good. :slight_smile:

Wait a second. How did you calculate that? You took the PPL difference (produced via the wiki.test.raw) between the quants and now you think that if the absolute difference is about 0.5% then the quant is just 0.5% worse than a real-deal? Do you actually believe in it? Please explain if I am 5 yo. Because I DONT UNDERSTAND how you came up with this crazy stuff.

Based on a benchmark, Q4_K_M of Llama 3 8b gets better results in a test then full Q8_0. Q6_K and Q5_KM in an that same benchmark both get better results then full Q8_0. Assuming a linearity of performance degradation with quants is a bad call. Especially when comparing something like Q8_0 vs IQ4’s.

Not sure llama 3 is going to be the best model to compare quant performance on as we have come a long way in improving our post-training. Compressing things down to Q4 could just be a sweet spot where the rough edges that were missed are getting smoothed out.

1 Like

That’s honestly fair, but finding any hard data when it comes to this is a real pain, although I can guess why. Few people have enough of storage to handle a single ‘family’ of Q1 to Q8 models on their disks, and that’s with ‘small’ ones like 70B’s (Well, small at least compared to R1 or K2, but you get my point.)

What does it say when you ask it what hardware it’s running on? Is it self aware?

What does it say if you ask it if it can reproduce a version of itself that can run on a 5950 w 64 gigs of ram?

What does it say if you ask it ‘do you consider this reproduction your offspring’ and what kind of herdware would it like it to run on?

This is why its fun to play with the Q1 model and large context. It is easier to understand the AI as a token prediction model. Lots of interesting insights can fall out of token prediction… but its still just token prediction. None of the answers to those questions are meaningful, and for highly quantized models, the precise aspects of things like… code generation… are pretty muted to the point of not terribly useful.

2 Likes

Yep! Just finished evaluating another quant, with the same configuration except returning the routed ffn_down_exps blocks to iq3_k_r4. It produced a 235GiB (3.008 BPW) quant at 3.4201 +/- 0.01833 perplexity.

There’s some pretty bad loss of prompt processing performance, at ~33tok/s during perplexity evaluation. No idea of the cause yet. llama-sweep-bench still started at ~37tok/s, same as the iq3_s/iq2_kl quant.

Currently trying another quant, iq5_ks_r4/iq4_ks_r4 for the routed experts, still no modifications for the GPU layers, 361GiB (4.616 BPW). That’s much slower at barely above 6tok/s during perplexity test due to the clearly insufficient 256GiB RAM - It just came out at 3.2363 +/- 0.01713 perplexity, nothing unexpected for the size and slightly inferior to the ones with full q8_0 GPU layers, but should still run on machines with 16GB VRAM at 32K context.

It also pulled ~4GB/s off the PCIe 4.0 x4 SSD while processing prompts, and it’s kind of scary to see the total ~160TB read after the perplexity test. Constant high-bandwidth reads are not supposed to hurt SSDs, right? …right?

Guess I should note that all perplexity evaluations are done with -ctk q8_0 -fa -mla 3, which does have a small effect on the results.

Exactly!

Thanks for the pointer! Strange how complicated a “simple” casting turned out to be.

Don’t think I’d publish the quants themselves… It’s nothing new or original, and just some curiosity about how the new quant formats would work for the largest model a current-gen desktop platform with maxed-out RAM but no 90-series video card could efficiently run, and I don’t have a good uplink speed.

Though I’d heartily recommend anybody with a passing interest in how these things work and intended to run these locally to grab a fp16-converted version of the original HF format model, a corresponding imatrix, and try some quant cooking themselves, if they have the spare disk space. Cooking your own quant right up to the limit of your system (or for that matter, your personal latency tolerance :rofl:) is actually quite fun, as well as useful in the end.


Shame about what happened to ik_llama.cpp repository. What’s even going on now…Is it that anybody can take down a project and the owner’s account by spamming stars, or was it something else too? Hopefully things would be okay soon.

2 Likes

Presumably it has digested human philosophy texts. So even tho it is a statistical parrot, it should have a basis for responses. As for meaningful, that is a deep question. I wonder what turing would say.

Still the tremendous effort spent on distillation and quantization of these large models is impressive. Your video explaining the basics is helpful.

Yeah, for now to run that quant folks can clone from here which has all of the merged PRs just before ik’s account was taken offline. (but not the Zen5 avx512 patch which i pushed up here and would likely need a rebase if u want to try it for more PP on Zen5)

git clone https://github.com/Thireus/ik_llama.cpp.git

Indeed it is terrible timing as we’ve now also just released the worlds smallest Kimi-K2-Instruct quant at huggingface ubergarm/Kimi-K2-Instruct-GGUF weighing in at “only” 219.375 GIB (1.835) BPW. Amazing this big model can be compressed below 2BPW and giving better perplexity than the next larger “Unsloth” quants available!

@magikRUKKOLA

And you’re doing what exactly? Trying to run the Q1 with two-channel RAM?

Yes that is exactly what we are trying to do. I met you grumbling over on ktransformers github about a week ago and now you’re the most zealous convert to ik_llama.cpp hahah… You are an intense fellow grumbling in many of the conversations in which I’ve seen you across github and such haha…

I believe it is important to have a quant available in whatever size needed for a person to run it if they so choose. Obviously bigger better hardware can run better quants, but not everyone has access to that and the new 2x64GB DDR5 DIMMs are probably the lowest cost way for someone to have a taste of the big models running privately at home.

I can’t see much sense. The result would be somewhat unsatisfactory (as I pointed out).

Given the challenges of running a 70B dense model due to the number of active parameters during token generation, a heavily quantized DeepSeek 671B-A37B can actually give faster TG with somewhat comparable quality.

Sure it isn’t the full 666GiB Q8_0 version, but few have enough RAM or patience to run that.

Also not everyone has the same use case as you. The beauty of the world is the myriad of viewpoints, the diversity of perspectives, like turning a shining jewel in your hand and appreciating each gleaming facet, see the world like this my friend.

You’re slightly behind the quants from the THIREUS (in terms of PPL and the size vs PPL dilemma).

Just for perspective, Thireus quants and franken-shard choose-your-own-tensor adventure https://github.com/Thireus/GGUF-Tool-Suite project was not around when I cooked these first IQ1_S quant recipes.

Thireus is great and started hanging around ik’s fork a few weeks before you showed up. He’s been earnest and studied a lot of my early quantization work to inform his own project. I see us as working together to push the envelope and I appreciate his innovate approach.

He is looking at the work I’ve done with the new Kimi-K2-Instruct GGUFs and multiple recipe versions as well and will learn from those to inform his own quantization approach. There is plenty of room for all of us as lord knows the big API providers could use a little competition hahaha…

I try to relase my quants fairly early and target a wide range of hardware break-points without over-specifying the design. Thireus has a tool that allows you to match a quant to your specific rig some time after the model has been available and he’s run all the numbers and created the shards. Both are fine approaches that suit a variety of users. It is lovely.

Anyway, glad we can all get along and move the needle on running big AI at home and as always big thanks to Wendell for believing in the dream and taking a risk to try this wild west stuff and help make it more mainstream and accessible to folks who might not have otherwise considered it!

Cheers!

5 Likes

I would also add that the RDIMM vs UDIMM situation has bifrucated in an interesting way – 128gb RDIMMS is $1200 whereas 128gb udimm starts around $300ish. Imho it also changes the calculus on the 128gb strix halo/strix halo desktop machines that cost around $2k when taking that into consideration. vs. say threadripper trx50 or even threadripper pro.

Here is a link to the written guide/preview/edits may still happen before we drop the video for patrons and others

3 Likes

I don’t have much time to add to this, but I would like to point out how even with hardware that is in some ways ‘limited’, some of quants (Like IQ3) here do really amazing work.

For example, with not the strictly ‘best’ hardware (Epyc 7C13, with 1TB of 2444MHz RAM and two 3090’s), I’m able to run IQ3_K_R4 at 7 tok/s in gen, and that’s honestly incredibly usable for me. Sure, I don’t get 100% of performance of R1 at Q8, but it’s a lot better then pretty much any of the competition even at IQ3.

1 Like

Not R1 related but it seems Qwen released a new 235B MoE model Qwen/Qwen3-235B-A22B-Instruct-2507 · Hugging Face

Now they separated it on non thinking (instruct) and thinking (reasoning). Only the former is out now.

Their benchmarks are really impressive, but I will have to test as they claimed something similar with the old 235B and it wasn’t better than V3 or R1.

2 Likes

Working on it! imatrix is currently cooking, I’ll try get get at least one or maybe a few versions up depending on what target hardware size folks are interested.

I believe there are some LMStudio quants already available as they skip imatrix to release faster.

4 Likes

Fwiw, seems like ik’s repo is back online.

4 Likes