Just released some “SOTA” quants of DeepSeek-V3-0324 up on huggingface (link to r/LocalLLaMA
post about it).
I have a quant comparison of showing the exact quants used for which layers and it is interesting to see the differences across the top quant cookers offerings.
Also have some perplexity benchmarks showing how these quants offer performance similar to the full Q8_0
model in a much smaller memory footprint without sacrificing on prompt processing or token generation speed.
These quants do require ikawrakow/ik_llama.cpp fork, which builds and runs about the same as vanilla mainline llama.cpp and offers the OpenAI API compatible llama-server
endpoint that works with all your favorite clients. Just note these GGUFs will not work with regular llama.cpp, ollama, lm studio, koboldcpp etc… (It’s because they have additional MLA tensors [which saves a ton of space on kv-cache]).
Thanks a ton to @wendell for all the hardware support and bandwidth going into doing doing research and cooking big quants! Also thanks to y’all sharing your results and findings in this fast moving special interest haha…