Using that cache-flush method, I can get 2-4 t/s running with 64 threads on Rome with an 8k context, which is very usable. Now I’m quite curious to know if my faster CPU will make any difference or if I’m memory-bandwidth-capped at this point.
./build/bin/llama-cli \
--model ./models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf \
--threads 64 \
--numa distribute \
--interactive \
--color \
--ctx-size 8192