DeepSeek Deep Dive R1 at Home!

It really does appear that anybody can take down a project and the owner’s account by spamming stars…

2 Likes

Does anyone here know or does have any hint in regards to adding knowledge to an existing model with a post-training workflow. I am interested in this but have no experience or expertise in this regard. And nudging in the right direction would be much appreciated.

1 Like

You’ll find a number of discussions on the “Fine-Tuning vs RAG” topic on r/LocalLLaMA which is pretty much what I think you are asking.

In general if you want to add “knowledge” to an existing model you can use a RAG (Retrieval Augmented Generation). A RAG is essentially a database full of “chunks” of “knowledge” for whatever you want your model to know. Each of these chunks is assigned an embedding vector key.

When a user makes a search their query is turned into an embedding vector and the top N “chunks” of “knowledge” are injected into the prompt along with their query. So you can use a smaller LLM with less knowledge and then the RAG will add it into the query.

If you want to fine-tune a model, that is more for changing the formatting and tone of the output. It is not so much for learning knew facts or data as that takes a lot of time to train.

This is kind of high level overview, but you could probably experiment with one of the many ready-to-go RAG solutions that sit on top of your llama-server LLM. They provide “chunking” of your documents and handling the embedding database for semantic search.

If you really want a deep dive there are some good blogs by anthropic on the topic as well adding in keyword search and rerankers as well.

4 Likes

Phew, I finally made it to the bottom of this thread.

I have a xeon w5-3435x with 768GB of DDR5-6000 and 7x RTX A4000s (the 16GB Ampere cards, not the 20GB Ada ones). I’d like to pull one and swap the RTX 6000 Ada from my AM5 desktop into it as well, but that requires messing around with the water loop, so I haven’t bothered yet.

With that setup, I’m able to run DeepSeek-V3 and Kimi K2 IQ4_XS with ik_llama.cpp at 8-8.5 tok/s, which I find just fast enough for interactive use. I’m not sure if that’s the best quant to be using. I have room for larger variants, and I’ve tried the _R4 quant / --run-time-repack, but neither have performed as well as IQ4_XS.

So far I haven’t managed to figure out a better layer offload scheme than --override-tensor exps=CPU, which is leaving my GPUs quite underutilized. Is there a way to ensure that the most important parts are loaded onto GPU, then opportunistically load additional layers by tuning -ngl? I tried something to that effect and it turned out to be slower. I realize that’s not much to go on; I’ll have to try it out again and record results so that I can come up with a better question.

2 Likes

Thank you I think fine-tuning is in fact what I want to look at since I want to add quite a bit of knowledge to those models and I am running into the limits of what is possible with RAG. From the Reddit thread I went into the unsloth notebooks who offer some explanation and easy ways to get started. I am not sure if it will be feasible for me to generate datasets that will produce a good quality in fine-tuning but figuring this out will be an important first step.

Afaik the idea here is that -ngl 99 (for example) loads everything to GPU and then with -ot exps=CPU you tell it to put all expert layer on CPU. What remains on the GPU should be context, kv-cache and some special extra dense layers, like the first three layers in Deepseek-R1 for example.

So basically the most important stuff goes to GPU the rest to CPU. If you can’t offload the entire model the idea is to use the -ot parameter together to specify as many layers as possible to offload the the GPU devices that you have, again with all that is not distributes to GPUs going to the CPU again.

Example, like you could use with Deepseek-R1:

    -ot "blk\.(3)\.ffn_.*=CUDA0" \
    -ot "blk\.(4)\.ffn_.*=CUDA1" \

… or even …

 -ot "blk\.(3|4)\.ffn_.*=CUDA0" \
 -ot "blk\.(5|6)\.ffn_.*=CUDA1" \
1 Like

That aligns with my understanding of how those features works, which is good and bad, haha. I swear that generation speed got worse when I tried to unload additional blocks to GPU. I will have to try it out again this evening so that I can provide a concrete example of what’s going on.

1 Like

Yeah I usually start with a baseline command, get a llama-sweep-bench saved, and then test different combinations. It is a bit tedious and time consuming but you can dial it in pretty good with that.

Okay so 7 CUDA devices with 16GB VRAM each and 768GB RAM nice.

First compile ik_llama.cpp like this now:

# no more need for -DGGML_CUDA_IQK_FORCE_BF16=1 but it might be faster on older CUDA if u want to test
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)

This will make sure it says llama_new_context_with_model: pipeline parallelism enabled (n_copies=1) on startup. The default value of 4 will cause VRAM OOMing on multi-GPU.

And yes just as @H-i-v-e says, for more TG speed try offloading more routed expert layers onto your GPUs. Keep in mind DeepSeek and Kimi are slightly different in where the exps/shexp begin.

# Kimi-K2
    -ngl 99 \
    -ot "blk\.(1)\.ffn_.*=CUDA0" \
    -ot "blk\.(2)\.ffn_.*=CUDA1" \
    -ot "blk\.(3)\.ffn_.*=CUDA2" \
    -ot "blk\.(4)\.ffn_.*=CUDA3" \
    -ot "blk\.(5)\.ffn_.*=CUDA4" \
    -ot "blk\.(6)\.ffn_.*=CUDA5" \
    -ot "blk\.(7)\.ffn_.*=CUDA6" \
    -ot exps=CPU \

# for DeepSeek start at -ot "blk\.(3)\.ffn_.*=CUDA0" \

Also if you want a lot more PP typically you can increase -ub 4096 -b 4096 for a little VRAM but then might not be able to offload as much so a small hit to TG. Its all trade-offs.

Finally I just experimented with a tiny draft model that did give almost a 1t/s speed-up on TG which seems nice. Just realized I accidently was starting from 3 on Kimi in that command there lol: jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v2.0-GGUF · Seeing some TG uplift on ik_llama.cpp

Anyway, yeah give it a go and feel free to share your full command for massaging.

I did not look into Kimi, which layers need special treatment with Kimi-K2?

Thanks for the recommendations, when I get home I’ll recompile and give it a shot.

Why is GGML_SCHED_MAX_COPIES=1 so important to avoid OOMs? From reading through PR comments, it seems like pipeline parallelism is only enabled during full GPU offload, which is assumed given -ngl 99, even though we end up offloading tensors with the --override-tensor arguments. Is that correct-ish?

Hold on, I thought about this and now I am less sure of what I assumed I knew before. Do earlier -ot arguments take precedence over later ones? That is, once one argument assigns a tensor to a buffer, it won’t get re-assigned by another? And then -ngl 99 basically takes anything not explicitly assigned and offloads it to GPU?

DeepSeek R1 and V3 have 3 leading dense blocks (0, 1, 2) whereas Kimi K2 has only 1 (block 0). Apparently. This is not something I knew ten minutes ago. :smile:

1 Like

Apparently I had already compiled with -DGGML_SCHED_MAX_COPIES=1.

Anyway, this is with DDR5-6000 CL42 memory on my w5-3435X w/ 4.6GHz all-core OC and 2700MHz mesh clock.

% ./llama-sweep-bench -t 32 -ngl 99 -c 32768 --host 0.0.0.0 --model ~/llm-models/unsloth_Kimi-K2-Instruct-GGUF/IQ4_XS/Kimi-K2-Instruct-IQ4_XS-00001-of-00012.gguf -a Kimi-K2 -ot "exps=CPU" -mla 3 -fa -fmoe

llama_kv_cache_init:      CUDA0 KV buffer size =   324.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   324.00 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =   324.00 MiB
llama_kv_cache_init:      CUDA3 KV buffer size =   324.00 MiB                                                                                                                                                                               llama_kv_cache_init:      CUDA4 KV buffer size =   324.00 MiB
llama_kv_cache_init:      CUDA5 KV buffer size =   324.00 MiB
llama_kv_cache_init:      CUDA6 KV buffer size =   252.00 MiB
llama_new_context_with_model: KV self size  = 2196.00 MiB, c^KV (f16): 2196.00 MiB, kv^T: not used                                                                                                                                          llama_new_context_with_model:  CUDA_Host  output buffer size =     0.62 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model:      CUDA0 compute buffer size =  3209.00 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  3260.25 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =  3260.25 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =  3260.25 MiB
llama_new_context_with_model:      CUDA4 compute buffer size =  3260.25 MiB
llama_new_context_with_model:      CUDA5 compute buffer size =  3260.25 MiB
llama_new_context_with_model:      CUDA6 compute buffer size =  3260.25 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   176.01 MiB
llama_new_context_with_model: graph nodes  = 3525
llama_new_context_with_model: graph splits = 128

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 32, n_threads_batch = 32

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |   11.001 |    46.54 |   15.145 |     8.45 |
|   512 |    128 |    512 |   15.337 |    33.38 |   17.793 |     7.19 |
|   512 |    128 |   1024 |   13.510 |    37.90 |   20.117 |     6.36 |
|   512 |    128 |   1536 |   11.516 |    44.46 |   22.558 |     5.67 |
|   512 |    128 |   2048 |   15.979 |    32.04 |   23.476 |     5.45 |
|   512 |    128 |   2560 |   14.345 |    35.69 |   32.007 |     4.00 |
|   512 |    128 |   3072 |   27.008 |    18.96 |   24.811 |     5.16 |
|   512 |    128 |   3584 |   33.492 |    15.29 |   25.577 |     5.00 |
|   512 |    128 |   4096 |   29.364 |    17.44 |   29.203 |     4.38 |
[...]

Rebuild with cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DGGML_SCHED_MAX_COPIES=1

% ./llama-sweep-bench -t 32 -ngl 99 -c 32768 --host 0.0.0.0 --model ~/llm-models/unsloth_Kimi-K2-Instruct-GGUF/IQ4_XS/Kimi-K2-Instruct-IQ4_XS-00001-of-00012.gguf -a Kimi-K2 -ot "exps=CPU" -mla 3 -fa -fmoe

[...]

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 32, n_threads_batch = 32

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |   13.413 |    38.17 |   15.813 |     8.09 |
|   512 |    128 |    512 |   21.902 |    23.38 |   16.377 |     7.82 |
|   512 |    128 |   1024 |   22.138 |    23.13 |   16.975 |     7.54 |
|   512 |    128 |   1536 |   28.578 |    17.92 |   19.841 |     6.45 |
|   512 |    128 |   2048 |   24.896 |    20.57 |   26.026 |     4.92 |
|   512 |    128 |   2560 |   21.297 |    24.04 |   24.700 |     5.18 |
|   512 |    128 |   3072 |   25.663 |    19.95 |   29.036 |     4.41 |
|   512 |    128 |   3584 |   31.695 |    16.15 |   29.243 |     4.38 |
|   512 |    128 |   4096 |   33.866 |    15.12 |   24.628 |     5.20 |

Let’s try -ub 4096 -b 4096

% ./llama-sweep-bench -t 32 -ngl 99 -c 32768 --host 0.0.0.0 --model ~/llm-models/unsloth_Kimi-K2-Instruct-GGUF/IQ4_XS/Kimi-K2-Instruct-IQ4_XS-00001-of-00012.gguf -a Kimi-K2 -ot "exps=CPU" -mla 3 -fa -fmoe -ub 4096 -b 4096

llama_kv_cache_init:      CUDA0 KV buffer size =   324.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   324.00 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =   324.00 MiB
llama_kv_cache_init:      CUDA3 KV buffer size =   324.00 MiB
llama_kv_cache_init:      CUDA4 KV buffer size =   324.00 MiB
llama_kv_cache_init:      CUDA5 KV buffer size =   324.00 MiB
llama_kv_cache_init:      CUDA6 KV buffer size =   252.00 MiB
llama_new_context_with_model: KV self size  = 2196.00 MiB, c^KV (f16): 2196.00 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.62 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model:      CUDA0 compute buffer size =  6592.02 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  4048.02 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =  4048.02 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =  4048.02 MiB
llama_new_context_with_model:      CUDA4 compute buffer size =  4048.02 MiB
llama_new_context_with_model:      CUDA5 compute buffer size =  4048.02 MiB
llama_new_context_with_model:      CUDA6 compute buffer size =  4048.03 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   624.05 MiB
llama_new_context_with_model: graph nodes  = 3525
llama_new_context_with_model: graph splits = 180

main: n_kv_max = 32768, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 32, n_threads_batch = 32

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |   90.204 |    45.41 |  220.816 |     4.64 |
|  4096 |   1024 |   4096 |   82.390 |    49.71 |  244.025 |     4.20 |
|  4096 |   1024 |   8192 |   93.099 |    44.00 |  239.935 |     4.27 |
|  4096 |   1024 |  12288 |   82.672 |    49.54 |  210.742 |     4.86 |

This one is super slow and the initial PP and TG numbers are much higher, so it’s hard to compare to the others. I guess it’s quite fast considering the prompt and output size, though? I’m going to leave those options off for now because I don’t feel like waiting a long time for results.

Anyway, disregarding the previous experiment, each GPU is at 5GB utilization so far, so I could fit about 10GB more on each one. I will try some offloading.

% ./llama-sweep-bench -t 32 -ngl 99 -c 32768 --host 0.0.0.0 --model ~/llm-models/unsloth_Kimi-K2-Instruct-GGUF/IQ4_XS/Kimi-K2-Instruct-IQ4_XS-00001-of-00012.gguf -a Kimi-K2 -ot "exps=CPU" -mla 3 -fa -fmoe \
    -ot "blk\.([1-9])\.ffn_.*=CUDA0" \
    -ot "blk\.(1[0-9])\.ffn_.*=CUDA1" \
    -ot "blk\.(2[0-0])\.ffn_.*=CUDA2" \
    -ot "blk\.(3[0-9])\.ffn_.*=CUDA3" \
    -ot "blk\.(4[0-9])\.ffn_.*=CUDA4" \
    -ot "blk\.(5[0-9])\.ffn_.*=CUDA5" \
    -ot "blk\.(6[0-9]|61)\.ffn_.*=CUDA6" \
    -ot exps=CPU

llama_kv_cache_init:      CUDA0 KV buffer size =   324.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   324.00 MiB                                                                                                                                                                               llama_kv_cache_init:      CUDA2 KV buffer size =   324.00 MiB                                                                                                                                                                               llama_kv_cache_init:      CUDA3 KV buffer size =   324.00 MiB
llama_kv_cache_init:      CUDA4 KV buffer size =   324.00 MiB
llama_kv_cache_init:      CUDA5 KV buffer size =   324.00 MiB
llama_kv_cache_init:      CUDA6 KV buffer size =   252.00 MiB
llama_new_context_with_model: KV self size  = 2196.00 MiB, c^KV (f16): 2196.00 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.62 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model:      CUDA0 compute buffer size =  3209.00 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  3260.25 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =  3260.25 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =  3260.25 MiB
llama_new_context_with_model:      CUDA4 compute buffer size =  3260.25 MiB
llama_new_context_with_model:      CUDA5 compute buffer size =  3260.25 MiB
llama_new_context_with_model:      CUDA6 compute buffer size =  3177.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   176.01 MiB
llama_new_context_with_model: graph nodes  = 3525
llama_new_context_with_model: graph splits = 182

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 32, n_threads_batch = 32

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |   15.494 |    33.05 |   15.744 |     8.13 |
|   512 |    128 |    512 |   13.658 |    37.49 |   15.909 |     8.05 |
|   512 |    128 |   1024 |   18.262 |    28.04 |   16.132 |     7.93 |
|   512 |    128 |   1536 |   14.347 |    35.69 |   20.759 |     6.17 |
|   512 |    128 |   2048 |   18.479 |    27.71 |   18.685 |     6.85 |
|   512 |    128 |   2560 |   24.365 |    21.01 |   19.000 |     6.74 |
|   512 |    128 |   3072 |   18.335 |    27.93 |   18.881 |     6.78 |
|   512 |    128 |   3584 |   16.302 |    31.41 |   19.126 |     6.69 |
|   512 |    128 |   4096 |   16.513 |    31.01 |   17.184 |     7.45 |
|   512 |    128 |   4608 |   18.861 |    27.15 |   19.571 |     6.54 |
|   512 |    128 |   5120 |   14.819 |    34.55 |   17.573 |     7.28 |
|   512 |    128 |   5632 |   23.243 |    22.03 |   17.835 |     7.18 |
|   512 |    128 |   6144 |   21.390 |    23.94 |   20.199 |     6.34 |
|   512 |    128 |   6656 |   19.496 |    26.26 |   20.759 |     6.17 |
|   512 |    128 |   7168 |   23.470 |    21.81 |   20.703 |     6.18 |
|   512 |    128 |   7680 |   15.408 |    33.23 |   24.935 |     5.13 |
|   512 |    128 |   8192 |   19.404 |    26.39 |   24.846 |     5.15 |
|   512 |    128 |   8704 |   23.759 |    21.55 |   20.405 |     6.27 |
|   512 |    128 |   9216 |   13.574 |    37.72 |   23.427 |     5.46 |
|   512 |    128 |   9728 |   32.420 |    15.79 |   19.000 |     6.74 |
|   512 |    128 |  10240 |   17.697 |    28.93 |   22.863 |     5.60 |
|   512 |    128 |  10752 |   24.136 |    21.21 |   19.176 |     6.68 |
|   512 |    128 |  11264 |   28.275 |    18.11 |   21.306 |     6.01 |
|   512 |    128 |  11776 |   26.229 |    19.52 |   20.842 |     6.14 |
|   512 |    128 |  12288 |   16.104 |    31.79 |   22.811 |     5.61 |
|   512 |    128 |  12800 |   12.099 |    42.32 |   25.267 |     5.07 |
|   512 |    128 |  13312 |   18.329 |    27.93 |   21.132 |     6.06 |
|   512 |    128 |  13824 |   22.779 |    22.48 |   21.731 |     5.89 |
|   512 |    128 |  14336 |   17.037 |    30.05 |   19.211 |     6.66 |
|   512 |    128 |  14848 |   18.296 |    27.98 |   18.788 |     6.81 |
|   512 |    128 |  15360 |   16.518 |    31.00 |   21.152 |     6.05 |
|   512 |    128 |  15872 |   16.562 |    30.91 |   21.052 |     6.08 |

Neat! What else can I offload? That made a big difference and used almost no memory.

[edit] Actually, I am quite confused by that. The memory usage is basically no different from the first run. Only CUDA6 changed (3260.25 → 3177.00MiB), which is strange considering I told that one to load 11 ffn* tensors vs 10 for all the others.

Ok… Looking at the overrides, at the beginning I only saw overrides specifying exps to be offloaded to CPU, now I see overrides to offload ffn* to GPUs and exps to CPU, so maybe the only difference now is the order in which tensors are offloaded to GPU?

[edit edit] I see, CUDA0 gets 9 block matches, CUDA6 gets 11, the rest get 10, and then the KV buffer is 324 MiB for all but CUDA6.

Probably, i’ve never looked into that far, but know it is the common solution for the big compute buffers on multi-gpu issue.

Are your new regex working correctly? I’d be a bit surprised if you’re fitting ~10 ffn layers onto a 16GB VRAM GPU and have more room? It should print out the tensor overrides on startup, make sure it says what layers are going where and it matches what you expect your regx are doing. Might need to remove the extra ([]) “capture blocks”? Just spitballing without knowing more from your logs.

I keep it simple to start out:

    -ngl 99 \
    -ot "blk\.(1|2)\.ffn_.*=CUDA0" \
    -ot "blk\.(3|4)\.ffn_.*=CUDA1" \
    -ot "blk\.(5|6)\.ffn_.*=CUDA2" \
    -ot "blk\.(7|8)\.ffn_.*=CUDA3" \
    -ot "blk\.(9|10)\.ffn_.*=CUDA4" \
    -ot "blk\.(11|12)\.ffn_.*=CUDA5" \
    -ot "blk\.(13|14)\.ffn_.*=CUDA6" \
    -ot exps=CPU \

Order matters yes, the first (top) entries catch first. The last entry goes last. So you’re saying in plain english:

  1. Offload everything onto GPU -ngl 999
  2. Specifically offload blk.1.ffn_(gate|down|up)_exps.weight onto CUDA0 …
  3. finally any remaining *.ffn_(gate|down|up)_exps.weight to CPU RAM

Kinda like that.

Makes more sense to look inside the model card at the actual tensor/layer names to understand the regex. e.g. ubergarm/Kimi-K2-Instruct-GGUF · Hugging Face

My new v0.2 Kimi-K2 quants are likely better perplexity than many of the unsloth ones too as shown in the model card graph.

Ah, I should have shown the output of the overrides. Here:

Tensor blk.1.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.1.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.1.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.1.ffn_gate_shexp.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_down_shexp.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_up_shexp.weight buffer type overriden to CUDA0

Tensor blk.2.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.2.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.2.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.2.ffn_gate_shexp.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_down_shexp.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_up_shexp.weight buffer type overriden to CUDA0

Tensor blk.3.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_gate_shexp.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_down_shexp.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_up_shexp.weight buffer type overriden to CUDA0

Tensor blk.4.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_gate_shexp.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_down_shexp.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_up_shexp.weight buffer type overriden to CUDA0

Tensor blk.5.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_gate_shexp.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_down_shexp.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_up_shexp.weight buffer type overriden to CUDA0

Tensor blk.6.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_gate_shexp.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_down_shexp.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_up_shexp.weight buffer type overriden to CUDA0

Tensor blk.7.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.7.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_gate_shexp.weight buffer type overriden to CUDA0
Tensor blk.7.ffn_down_shexp.weight buffer type overriden to CUDA0
Tensor blk.7.ffn_up_shexp.weight buffer type overriden to CUDA0

Tensor blk.8.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.8.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_gate_shexp.weight buffer type overriden to CUDA0
Tensor blk.8.ffn_down_shexp.weight buffer type overriden to CUDA0
Tensor blk.8.ffn_up_shexp.weight buffer type overriden to CUDA0

Tensor blk.9.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.9.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_gate_shexp.weight buffer type overriden to CUDA0
Tensor blk.9.ffn_down_shexp.weight buffer type overriden to CUDA0
Tensor blk.9.ffn_up_shexp.weight buffer type overriden to CUDA0

Tensor blk.10.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.10.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_gate_shexp.weight buffer type overriden to CUDA1
Tensor blk.10.ffn_down_shexp.weight buffer type overriden to CUDA1
Tensor blk.10.ffn_up_shexp.weight buffer type overriden to CUDA1

Tensor blk.11.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.11.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_gate_shexp.weight buffer type overriden to CUDA1
Tensor blk.11.ffn_down_shexp.weight buffer type overriden to CUDA1
Tensor blk.11.ffn_up_shexp.weight buffer type overriden to CUDA1

Tensor blk.12.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.12.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_gate_shexp.weight buffer type overriden to CUDA1
Tensor blk.12.ffn_down_shexp.weight buffer type overriden to CUDA1
Tensor blk.12.ffn_up_shexp.weight buffer type overriden to CUDA1

Tensor blk.13.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.13.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_gate_shexp.weight buffer type overriden to CUDA1
Tensor blk.13.ffn_down_shexp.weight buffer type overriden to CUDA1
Tensor blk.13.ffn_up_shexp.weight buffer type overriden to CUDA1

Tensor blk.14.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.14.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_shexp.weight buffer type overriden to CUDA1
Tensor blk.14.ffn_down_shexp.weight buffer type overriden to CUDA1
Tensor blk.14.ffn_up_shexp.weight buffer type overriden to CUDA1

Tensor blk.15.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.15.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_shexp.weight buffer type overriden to CUDA1
Tensor blk.15.ffn_down_shexp.weight buffer type overriden to CUDA1
Tensor blk.15.ffn_up_shexp.weight buffer type overriden to CUDA1

Tensor blk.16.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.16.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_gate_shexp.weight buffer type overriden to CUDA1
Tensor blk.16.ffn_down_shexp.weight buffer type overriden to CUDA1
Tensor blk.16.ffn_up_shexp.weight buffer type overriden to CUDA1

Tensor blk.17.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.17.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_gate_shexp.weight buffer type overriden to CUDA1
Tensor blk.17.ffn_down_shexp.weight buffer type overriden to CUDA1
Tensor blk.17.ffn_up_shexp.weight buffer type overriden to CUDA1

Tensor blk.18.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.18.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_gate_shexp.weight buffer type overriden to CUDA1
Tensor blk.18.ffn_down_shexp.weight buffer type overriden to CUDA1
Tensor blk.18.ffn_up_shexp.weight buffer type overriden to CUDA1

Tensor blk.19.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.19.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_gate_shexp.weight buffer type overriden to CUDA1
Tensor blk.19.ffn_down_shexp.weight buffer type overriden to CUDA1
Tensor blk.19.ffn_up_shexp.weight buffer type overriden to CUDA1

Okay, that’s a lot to copy and paste so I’ll leave out the rest, but they all follow the same pattern. That seems right given the regex, doesn’t it? The only things still on CPU are “exps”, which is basically the same as -ngl 99 -ot exps=CPU, aside from perhaps forcing the ordering / arrangement. Maybe I should leave that and try to place some of the exps tensors onto GPUs?

The goal is to add additional exps layers onto CUDA yes, I had thought the overrides i provided with ffn_ blah would do that, but you get the idea. Yes fixup your ot’s and offload more exps onto each CUDA as you can without OOMing.

I made a very silly mistake. I had an extra -ot "exps=CPU" before all the others. Sorry! I will have some proper results in a moment, hold on.

% ./llama-sweep-bench -t 32 -ngl 99 -c 32768 --host 0.0.0.0 --model ~/llm-models/unsloth_Kimi-K2-Instruct-GGUF/IQ4_XS/Kimi-K2-Instruct-IQ4_XS-00001-of-00012.gguf -a Kimi-K2 -mla
3 -fa -fmoe \
    -ot "blk\.(1)\.ffn_.*=CUDA0" \                                                                                                                                                                                                              -ot "blk\.(2)\.ffn_.*=CUDA1" \                                                                                                                                                                                                              -ot "blk\.(3)\.ffn_.*=CUDA2" \                                                                                                                                                                                                              -ot "blk\.(4)\.ffn_.*=CUDA3" \                                                                                                                                                                                                              -ot "blk\.(5)\.ffn_.*=CUDA4" \                                                                                                                                                                                                              -ot "blk\.(6)\.ffn_.*=CUDA5" \                                                                                                                                                                                                              -ot "blk\.(7)\.ffn_.*=CUDA6" \                                                                                                                                                                                                              -ot "exps=CPU"

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 32, n_threads_batch = 32

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    8.105 |    63.17 |   14.210 |     9.01 |
|   512 |    128 |    512 |    8.052 |    63.58 |   14.214 |     9.01 |
|   512 |    128 |   1024 |    8.197 |    62.46 |   14.297 |     8.95 |
|   512 |    128 |   1536 |    8.482 |    60.37 |   14.488 |     8.83 |
|   512 |    128 |   2048 |    8.539 |    59.96 |   14.370 |     8.91 |
|   512 |    128 |   2560 |    8.531 |    60.01 |   14.523 |     8.81 |
|   512 |    128 |   3072 |    8.775 |    58.35 |   14.775 |     8.66 |
|   512 |    128 |   3584 |    8.880 |    57.66 |   14.800 |     8.65 |
|   512 |    128 |   4096 |    8.958 |    57.16 |   14.886 |     8.60 |
|   512 |    128 |   4608 |    8.954 |    57.18 |   15.014 |     8.53 |
|   512 |    128 |   5120 |    9.126 |    56.10 |   15.277 |     8.38 |
|   512 |    128 |   6144 |    9.364 |    54.68 |   15.610 |     8.20 |
|   512 |    128 |   6144 |    9.364 |    54.68 |   15.610 |     8.20 |
|   512 |    128 |   6656 |    9.517 |    53.80 |   15.770 |     8.12 |

Fantastic! Thank you so much for your help, and sorry again for my mistake. Results are still coming in, but this looks great so far. Each card has a little over 13GB offloaded, and the machine is pulling about 975W peak from the wall while running this.

3 Likes

Yes, you are absolutely right! I am sorry for being annoying and stuff. Unfortunately its a part of my character which is extremely hard to regulate.

Thanks for letting me know! Not sure why I started to perceive the github as Quate3 multiplayer arena. :smiley:

Ha. You seem to be reading Viktor Pelevin or something? :slight_smile:

  The exact passage, spoken by the fox-spirit A Hu-Li to the werewolf         
  Alexander,                                                                  
  is:                                                                         
                                                                              
  │ "Imagine that your mind is a diamond with many facets. And each facet is a
  │ thought. Reality is the same way. It’s a diamond. And every thought is a  
  │ facet of this diamond. To see the whole thing, you have to turn it in your
  │ hand and examine each facet, one after the other. Then you’ll see that the
  │ diamond is shining. That’s how you have to look at reality, Alexander.    
  │ That’s what diamond consciousness is."    
2 Likes

Hello all,

For those familiar with the subject, Am I correct to assume that the numa domain challenges haven’t been resolved yet, thus making the threadripper pro platform with a much better price/perf compared to big epyc or xeon build?

I ask that because for the little speed numbers I could gathered I don’t see much improvement for epyc over threadripper pro. And was pretty much underwhelmed by the experiment wendel did on that big intel 6900P (even with the biggest cpu and fastest ram).

I feel it’s been months we are stuck on that software challenge and do not expect it to change in near future.

What do y’all think about that? Or am I completly wrong and those server platform are faster.

2 Likes

@magikRUKKOLA

Haha interesting, thanks for the reference on Viktor Pelevin - hey also apparently wrote one titled “Buddha’s Little Finger” which would make sense. I believe we’re both making allusions to variations on the Diamond Sutra:

As a lamp, a cataract, a star in space
an illusion, a dewdrop, a bubble
a dream, a cloud, a flash of lightening
view all created things like this
-Red Pine (Bill Porter) Diamond Sutra Chapter 32

The facets of a jewel comes up in other texts. All lovely fingers pointing at the same moon. :point_right: :new_moon_with_face: Aldous Huxley’s book “Perennial Philosophy” is a fairly early (by western standards) discussion addressing these universals.

And regarding ik_llama.cpp and quants, I appreciate all your energy and efforts computing and graphing perplexities! It is cool to see so many quants available now and get a better feel for the trade-offs of model size, perplexity, and inference speeds!

@pure_syn

There are a number of projects out there with attempted solutions. For example, some inference engines (like ktransformers) and some modified llama.cpp forks I’ve tried essentially load the weights into RAM N times, once for each N NUMA Nodes (usually 2). The idea to keep a copy close to the CPU cores that will use it, but at the cost of requiring double RAM. I call this approach “data-parallel”.

There are no multi-NUMA “tensor-parallel” solutions of which I’m aware yet no, at least not with GGUF style hybrid CPU+GPU inferencing.

You can go full VRAM GPU only with vLLM and sglang and use tensor-parallel though for very fast speeds on small models or in big multi-GPU nodes etc…

I haven’t done a comparison of single socket Epyc vs Thread Ripper Pro systems to see which one can pack the most RAM bandwidth into a single NUMA node at the lowest price though.

What size models are you interested in running and what kind of prompt processing and token generation speeds would your application need?

2 Likes

TR Pro will never win on price, that’s for sure. That said, if you’re using it as a workstation, it’s an infinitely nicer experience compared to Epyc (motherboards available with sound, USB4 (and plenty of USB in general), overclocking options, M.2 slots, etc. everything you’d like from a desktop platform. All while having 8 DDR5 channels and 6 x16 slots. That said, you can have Epyc much cheaper on a single slot system with 12 channels.

For reference, my TR Pro 7995wx shows ~400GBps memory bandwidth with 8x DDR5 6800.

1 Like

Oh some interesting discussion on multi-NUMA and Epyc systems and Zen5 benefits as well going on with ik overe here in a new discussion: