DeepSeek Deep Dive R1 at Home!

Intro

There’s a lot of interest in running a version of the full DeepSeek R1 671B LLM at home on high end gaming rigs, workstations, and moderate home lab setups. This thread is a deep dive dump of a couple weeks experimentation including benchmarks, test metrics, and some thoughts on current limitations and potential future improvements.

Key Findings

The key findings here imo are as follows.

Unless you have enough GPUs to fit the entire model in VRAM (e.g. 256 to 768GB lol), the best bang for your buck today is enough fast i/o RAM to hold the entire model.

I originally had hoped that a quad PCIe Gen 5 NVMe RAID0 striped array could provide enough IOPS to feed the beast. However in lab testing, there seems to be a bottleneck given how the current llama.cpp@90e4dba4 uses buffered disk access.

Also I would advise against setting up a large Linux Swap file. In my experiments it technically works but was much slower, and as it writes page outs to your NVMe drive it will hurt the write cycle lifetime of the drive. The default llama.cpp mmap() implementation is READ ONLY so doesn’t kill your SSDs. (it may warm it up though hah).

In my lab tests, a fast read IOPS disk array does not provide significantly better performance over a single Gen 5 x4 NVMe drive today.

While O_DIRECT unbuffered reads are blazing fast on this kind of disk setup, that doesn’t yet translate into actual speed-ups for prompt processing or token generation, unfortunately.

I spent a lot of time and have metrics and stats below discussing attempts at tuning the Linux Kernel Page Cache, Transparent Huge Pages, tuning block read ahead, etc. Possibly in the future some kind of unbuffered i/o software implementation could unlock this potential speed.

With a big MoE model like DeepSeek-R1, adding a second GPU might not buy you as much speed as you’d hope.

Sure VRAM can be over 1TB/s bandwidth, but an extra 24GB or even 48GB is not enough to significantly move the needle given the size of this model. Your extra GPU will barely utilize at like <20% while it waits on the slower CPU RAM computations.

That said,

A single GPU will likely speed up prompt processing depending on context length, but even no GPU may be a viable setup.

Benchmarks

Okay you made it this far, so some pretty graphs!

I ran some llama-bench tests in different configurations on two different rigs. Big thanks to @wendell and the whole l1t crew for all the support and hardware wizardry to run many of these tests!

pp512

pp512 is how many tokens/second are processed on a 512 token prompt. For English, one word averages very roughly 1.5 tokens or so.

tg128

tg128 is how many tokens/second are generated on a 128 token output. This is a very short length and so these numbers are likely higher than what you’d see in real world use, especially on a reasoning model which creates a lot of <thinking></thinking> tokens before actually answering.

Methodology

If you want to run this on your rig, here is a quick guide.

Note

If this is your very first time running an LLM at home, check out the recent Level1Techs YT video discussing LM Studio or ollama etc. I’d recommend trying an 8B model first just to get a feel for loading models, selecting context lengths, figuring out GPU layer offloads, and inferencing speeds are much faster.

1. Download Model

All the above benchmarks were run on unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL. The UD_Q2_K_XL model is a 2.51bpw (bits per weight) quantization. Basically GGUF is a file format and quantization that allows for both CPUs and GPUs to process the weights. A 671B parameter model @ 2.51 bpw is about 212GB file size. Daniel and Michael (the unsloth brothers) released this interesting version of R1 which selectively quantizes different weights depending on how important they are. Check out their blog for more info and their hugging face repo has other quant sizes if you have a big dual Epyc with 768GB+ RAM haha…

2. Compile llama.cpp

The bleeding edge patches tend to come in on the upstream llama.cpp repo then trickle down into other projects. If you do have enough GPUs to hold the entire model in VRAM, there are other likely better options than llama.cpp and GGUF quants, but that is outside this scope.

I assume you are running on Linux or possibly WSL.

# Download the latest llama.cpp
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

# Option 1: configure build for CPU backend only
cmake -B build -DGGML_CUDA=OFF

# *OR*

# Option 2: configure build with CUDA backend enabled
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
GGML_CUDA_FA_ALL_QUANTS=1
cmake -B build -DGGML_CUDA=ON

# Compile it (you may have to `apt-get install build-essentials` etc...)
cmake --build build --config Release -j $(nproc)

# Confirm the binaries built properly
./build/bin/llama-bench --help
...

4. Run llama-bench

The sweet spot for number of threads seems to be equal to number of CPU physical cores.

As the bottleneck is likely RAM (or disk) i/o, using more threads (running on SMT/hyperthreads) does not seem to improve performance but may actually slightly degrades it with extra context switching.

The story is more confusing on dual Epyc systems with potentially multiple numa nodes and configurations, out of scope for this discussion.

# CPU only example with no layers offloaded to GPU
./build/bin/llama-bench \
    --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf
    --cache-type-k q4_0 \
    --cache-type-v f16 \
    --threads 24

It will first do pp512 then tg128 benchmarks and output a graph something like this.

model size params backend ngl threads type_k test t/s
deepseek2 671B Q2_K - Medium 211.03 GiB 671.03 B CPU 0 24 q4_0 pp512 27.57 ± 0.05
deepseek2 671B Q2_K - Medium 211.03 GiB 671.03 B CPU 0 24 q4_0 tg128 8.44 ± 0.01

To test with GPU offload, you can experimentally increase -ngl or --n-gpu-layers until it OOMs and crashes. Then back off by one and that is your max offload. Keep in mind max offload will likely be slightly less for actual usage with 8k+ context lengths.

Also as of today, llama.cpp does not support flash attention -fa for R1 models. After that gets merged, it will be possible to get more compressed kv-cache for longer context in less VRAM.

Metrics

Okay some screen shots of nvitop showing what it looks like offloading 23 layers on the 24x core Thread Ripper with 2x RTX 6000’s loading up almost 96GB VRAM.

pp512

Notice how for prompt processing only a single GPU is utilized, but all the VRAM is holding weights.

tg128

Notice how both GPUs have rather low <15% utilization as they are mostly waiting on CPU layers to finish given the massive size of this model.

Conclusion

This is the end of the higher level stuff, I’ll dump in some of my lower level findings and more technical metrics in another post to this thread below including fio, and Linux Kernel probe testing of i/o size distributions, page cache graphs, and other fun geeky things.

17 Likes

This is awesome info, thanks! I can’t remember from your other thread whether you tried to run the version of the 671b model from the ollama model library, or if it was too big to attempt to run from NVME.

2 Likes

video soon! :smiley:

For the “CPU Only” 2 processor system mentioned in the video, in NPS0 and NPS1 mode (depending on what we were doing)

AMD EPYC 9684X 96-Core Processor

The monster cache on these cpus seemed to help, but 96 cores didn’t do much for us. I think 16-32 cores per socket would be plenty for this type of experimentation. You do want all the memory channels and bandwidth. Fortunately AMD has certain Epyc CPUs where each chiplet has dual connections to the I/O die. Those work great for this kind of AI experiment and are quite cost effective.

Links

Intel memory latency checker of Monster 2x EPYC Genoa system

Intel(R) Memory Latency Checker - v3.11b
Measuring idle latencies for random access (in ns)...
                Numa node
Numa node            0       1
       0         110.3   204.2
       1         200.3   110.3

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      732468.7
3:1 Reads-Writes :      696304.6
2:1 Reads-Writes :      657870.2
1:1 Reads-Writes :      669829.8
Stream-triad like:      716756.2

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node            0       1
       0        371543.3        156375.0
       1        157367.2        367363.0

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject  Latency Bandwidth
Delay   (ns)    MB/sec
==========================
 00000  852.46   731760.8
 00002  852.36   731662.8
 00008  845.55   732118.1
 00015  827.45   735158.6
 00050  815.27   737682.1
 00100  815.39   737547.1
 00200  217.25   644552.3
 00300  173.79   448863.6
 00400  166.18   336215.9
 00500  160.49   273733.7
 00700  158.85   199471.9
 01000  154.20   141884.6
 01300  153.17   110138.3
 01700  151.85    84840.7
 02500  142.95    58221.3
 03500  143.28    41865.4
 05000  143.03    29518.8
 09000  143.58    16642.8
 20000  143.47     7749.8


Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency        26.1
Local Socket L2->L2 HITM latency        26.3
Remote Socket L2->L2 HITM latency (data address homed in writer socket)
                        Reader Numa Node
Writer Numa Node     0       1
            0        -   207.8
            1    206.8       -
Remote Socket L2->L2 HITM latency (data address homed in reader socket)
                        Reader Numa Node
Writer Numa Node     0       1
            0        -   209.5
            1    208.4       -






As this system had the “minimum” ram to fit the model, 768gb, it was not necessary to experiment with disk load times/swap to disk times.

9 Likes

Nitty Gritty

Okay the rest of this is a hodge podge of notes, benchmarks, and some graphs collected during this testing.

$ cowsay -s "this is exciting"
 __________________
< this is exciting >
 ------------------
        \   ^__^
         \  (**)\_______
            (__)\       )\/\
             U  ||----w |
                ||     ||

Hardware Configuration

I did most of the testing on a AMD Ryzen Threadripper PRO 7965WX 24-Core with 8x32GB DDR5 Kingston KF560R32-32 DIMMs for a total of 256GB RAM. It also has 4x Crucial CT4000T705SSD3 4TB PCIe Gen 5 NVMe drives configured with mdadm into a striped RAID0 array for maximum read performance.

Baseline Benchmarks

I used some publicly available benchmarking tools to get a feel for performance of the various hardware components.

RAM

Grab intel’s memory latency checker (mlc) for basically a Linux version of AIDA64 more or less.

This test shows about 225-250GB/s aggregate RAM i/o bandwidth and ~95ns latency.

$ echo 4000 | sudo tee /proc/sys/vm/nr_hugepages
$ sudo ./Linux/mlc | tee -a output.log

Intel(R) Memory Latency Checker - v3.11b
Measuring idle latencies for random access (in ns)...
                Numa node
Numa node            0
       0          95.3

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      224438.0
3:1 Reads-Writes :      254061.5
2:1 Reads-Writes :      248663.7
1:1 Reads-Writes :      227839.2
Stream-triad like:      250344.9

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node            0
       0        221419.9

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject  Latency Bandwidth
Delay   (ns)    MB/sec
==========================
 00000  412.79   222680.1
 00002  417.48   222670.9
 00008  429.14   222643.9
 00015  429.00   222613.8
 00050  430.61   222542.4
 00100  426.57   222426.4
 00200  113.76   138084.6
 00300  111.73    92931.0
 00400  110.78    68318.5
 00500  110.33    55330.5
 00700  109.71    39991.5
 01000  109.29    28329.2
 01300  105.82    22024.3
 01700  104.64    17036.3
 02500  104.34    11811.4
 03500  104.20     8626.4
 05000  104.12     6229.8
 09000  103.99     3739.3
 20000  103.91     2023.0

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency        18.3
Local Socket L2->L2 HITM latency        18.3

Next I grabbed google multichase and compiled it for another look at RAM latency suggesting about 90ns.

$ ./multichase -m 1g -n 20
89.694

NVMe RAID0 Array

Here is how I setup the RAID0 array and did some basic fio tests to get a feel for out of the box unoptimized performance on normal ext4 filesystem.

There may be some limited gains to be had using other file systems, changing chunk size, and tuning other knobs. But without an O_DIRECT unbuffered llama.cpp implementation the tons of page faults incurred for running an LLM too big to fit into RAM will still be the bottleneck. This is why for now, get more RAM not more SSDs…

$ mdadm --version
mdadm - v4.3 - 2024-02-15 - Ubuntu 4.3-1ubuntu2.1

# Good reference guide
# https://www.jeffgeerling.com/blog/2021/htgwa-create-raid-array-linux-mdadm

$ sudo wipefs --all --force /dev/nvme0n1
$ sudo wipefs --all --force /dev/nvme1n1
$ sudo wipefs --all --force /dev/nvme2n1
$ sudo wipefs --all --force /dev/nvme3n1

# partition_number:partition_start:partition_end
$ sudo sgdisk -n 1:0:0 /dev/nvme0n1
Creating new GPT entries in memory.
The operation has completed successfully.
$ sudo sgdisk -n 1:0:0 /dev/nvme1n1
Creating new GPT entries in memory.
The operation has completed successfully.
$ sudo sgdisk -n 1:0:0 /dev/nvme2n1
Creating new GPT entries in memory.
The operation has completed successfully.
$ sudo sgdisk -n 1:0:0 /dev/nvme3n1
Creating new GPT entries in memory.
The operation has completed successfully.

$ sudo mdadm --create --verbose /dev/md0 --level=0 --raid-devices=4 /dev/nvme0n1p1 /dev/nvme1n1p1 /dev/nvme2n1p1 /dev/nvme3n1p1
mdadm: chunk size defaults to 512K
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.

$ cat /proc/mdstat
Personalities : [raid0]
md0 : active raid0 nvme3n1p1[3] nvme2n1p1[2] nvme1n1p1[1] nvme0n1p1[0]
      15627540480 blocks super 1.2 512k chunks

$ sudo mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Fri Feb  7 16:13:27 2025
        Raid Level : raid0
        Array Size : 15627540480 (14.55 TiB 16.00 TB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

       Update Time : Fri Feb  7 16:13:27 2025
             State : clean
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0

            Layout : original
        Chunk Size : 512K

Consistency Policy : none

              Name : wTR24:0  (local to host wTR24)
              UUID : 5416209d:671789ee:296e46b5:bfa09730
            Events : 0

    Number   Major   Minor   RaidDevice State
       0     259        8        0      active sync   /dev/nvme0n1p1
       1     259        9        1      active sync   /dev/nvme1n1p1
       2     259       10        2      active sync   /dev/nvme2n1p1
       3     259       11        3      active sync   /dev/nvme3n1p1

$ sudo mkfs.ext4 -m 0 -E lazy_itable_init=0,lazy_journal_init=0 /dev/md0

fio random read IOPS

Almost 18GB/sec

# non destructive unbuffered random read 4k IOPS benchmark
$ sudo fio \
      --filename=/dev/md0 \
      --readonly \
      --rw=randread \
      --direct=1 \
      --bs=4k \
      --ioengine=libaio \
      --runtime=60 \
      --numjobs=24 \
      --time_based=1 \
      --group_reporting \
      --name=IOPS_4k \
      --iodepth=32

IOPS_4k: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.36
Starting 24 processes
Jobs: 24 (f=24): [r(24)][100.0%][r=16.7GiB/s][r=4374k IOPS][eta 00m:00s]
IOPS_4k: (groupid=0, jobs=24): err= 0: pid=559472: Sun Feb  9 18:36:39 2025
  read: IOPS=4352k, BW=16.6GiB/s (17.8GB/s)(996GiB/60001msec)
    slat (usec): min=2, max=3964, avg= 3.54, stdev= 2.95
    clat (nsec): min=580, max=4484.0k, avg=172308.91, stdev=108153.11
     lat (usec): min=10, max=4488, avg=175.85, stdev=108.20
    clat percentiles (usec):
     |  1.00th=[   21],  5.00th=[   41], 10.00th=[   59], 20.00th=[   83],
     | 30.00th=[  101], 40.00th=[  122], 50.00th=[  147], 60.00th=[  176],
     | 70.00th=[  212], 80.00th=[  262], 90.00th=[  330], 95.00th=[  375],
     | 99.00th=[  461], 99.50th=[  523], 99.90th=[  725], 99.95th=[  848],
     | 99.99th=[ 1045]
   bw (  MiB/s): min=15657, max=17387, per=100.00%, avg=17025.63, stdev= 8.68, samples=2856
   iops        : min=4008240, max=4451242, avg=4358560.48, stdev=2220.89, samples=2856
  lat (nsec)   : 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.85%, 50=6.81%
  lat (usec)   : 100=21.64%, 250=48.59%, 500=21.47%, 750=0.54%, 1000=0.08%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%
  cpu          : usr=18.33%, sys=74.16%, ctx=16667145, majf=0, minf=1075
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=261127526,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=16.6GiB/s (17.8GB/s), 16.6GiB/s-16.6GiB/s (17.8GB/s-17.8GB/s), io=996GiB (1070GB), run=60001-60001msec

Disk stats (read/write):
    md0: ios=261123443/0, sectors=2088987544/0, merge=0/0, ticks=39912514/0, in_queue=39912514, util=99.40%, aggrios=65281881/20, aggsectors=522255054/0, aggrmerge=0/0, aggrticks=9920376/11, aggrin_queue=9920399, aggrutil=90.17%
  nvme3n1: ios=65290024/20, sectors=522320192/0, merge=0/0, ticks=10342006/11, in_queue=10342028, util=89.91%
  nvme0n1: ios=65278033/20, sectors=522224264/0, merge=0/0, ticks=8944765/12, in_queue=8944789, util=89.17%
  nvme1n1: ios=65275278/20, sectors=522202224/0, merge=0/0, ticks=9868045/12, in_queue=9868068, util=90.17%
  nvme2n1: ios=65284192/20, sectors=522273536/0, merge=0/0, ticks=10526690/11, in_queue=10526713, util=89.80%

fio sequential read

Over 40GB/s

# non destructive unbuffered sequential read 1M block size benchmark
$ sudo fio \
      --filename=/dev/md0 \
      --readonly \
      --rw=read \
      --direct=1 \
      --bs=1M \
      --ioengine=libaio \
      --runtime=60 \
      --numjobs=24 \
      --time_based=1 \
      --group_reporting \
      --name=SEQREAD_1M \
      --iodepth=32

SEQREAD_1M: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
...
fio-3.36
Starting 24 processes
Jobs: 24 (f=24): [R(24)][100.0%][r=36.2GiB/s][r=37.1k IOPS][eta 00m:00s]
SEQREAD_1M: (groupid=0, jobs=24): err= 0: pid=573183: Sun Feb  9 19:02:22 2025
  read: IOPS=38.7k, BW=37.8GiB/s (40.6GB/s)(2268GiB/60036msec)
    slat (usec): min=56, max=4832, avg=67.55, stdev=18.88
    clat (usec): min=115, max=150892, avg=19779.55, stdev=16087.13
     lat (usec): min=178, max=150957, avg=19847.10, stdev=16087.35
    clat percentiles (usec):
     |  1.00th=[   676],  5.00th=[  1532], 10.00th=[  1991], 20.00th=[  3359],
     | 30.00th=[  6456], 40.00th=[ 11207], 50.00th=[ 16581], 60.00th=[ 22414],
     | 70.00th=[ 30016], 80.00th=[ 35914], 90.00th=[ 40633], 95.00th=[ 45351],
     | 99.00th=[ 62653], 99.50th=[ 72877], 99.90th=[ 93848], 99.95th=[100140],
     | 99.99th=[122160]
   bw (  MiB/s): min=27217, max=48975, per=100.00%, avg=38728.43, stdev=161.84, samples=2856
   iops        : min=27215, max=48975, avg=38727.16, stdev=161.86, samples=2856
  lat (usec)   : 250=0.05%, 500=0.45%, 750=0.69%, 1000=0.77%
  lat (msec)   : 2=8.10%, 4=12.60%, 10=14.89%, 20=18.32%, 50=41.29%
  lat (msec)   : 100=2.80%, 250=0.05%
  cpu          : usr=0.16%, sys=12.75%, ctx=2147099, majf=0, minf=196995
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=2321962,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=37.8GiB/s (40.6GB/s), 37.8GiB/s-37.8GiB/s (40.6GB/s-40.6GB/s), io=2268GiB (2435GB), run=60036-60036msec

Disk stats (read/write):
    md0: ios=18531889/0, sectors=4744163584/0, merge=0/0, ticks=256192726/0, in_queue=256192726, util=100.00%, aggrios=4643924/2, aggsectors=1188844546/0, aggrmerge=0/0, aggrticks=64180507/2, aggrin_queue=64180511, aggrutil=72.41%
  nvme3n1: ios=4643904/2, sectors=1188839424/0, merge=0/0, ticks=134953567/2, in_queue=134953572, util=70.58%
  nvme0n1: ios=4643945/2, sectors=1188849672/0, merge=0/0, ticks=25059323/2, in_queue=25059327, util=72.41%
  nvme1n1: ios=4643944/2, sectors=1188849664/0, merge=0/0, ticks=27630208/2, in_queue=27630212, util=71.25%
  nvme2n1: ios=4643904/2, sectors=1188839424/0, merge=0/0, ticks=69078931/2, in_queue=69078936, util=70.02%

fio buffered llama.cpp mmap() simulation

About 1.5GB/s

This benchmarks simulates running llama.cpp when the model is too big to fit into your RAM + VRAM. The magic of mmap() makes it even possible to do this by juggling data through the Linux Kernel Page Cache.

This buffered i/o bottleneck is why for now you want to keep as much of the model in RAM + VRAM.

I chose 64k to somewhat simulate actual block sizes measured during llama.cpp inferencing. Actual llama.cpp is a little better than this given it isn’t 100% random.

Notes on how I measured that further down the rabbit hole…

$ sudo fio \
      --filename=/dev/md0 \
      --readonly \
      --rw=randread \
      --direct=0 \
      --bs=64k \
      --ioengine=mmap \
      --runtime=60 \
      --numjobs=24 \
      --time_based=1 \
      --group_reporting \
      --name=IOPS_64k_BUFFERED_MMAP \
      --iodepth=1
IOPS_64k_BUFFERED_MMAP: (g=0): rw=randread, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=mmap, iodepth=1
...
fio-3.36
Starting 24 processes
Jobs: 24 (f=24): [r(24)][100.0%][r=1461MiB/s][r=23.4k IOPS][eta 00m:00s]
IOPS_64k_BUFFERED_MMAP: (groupid=0, jobs=24): err= 0: pid=624772: Thu Feb 13 12:37:01 2025
  read: IOPS=21.5k, BW=1346MiB/s (1411MB/s)(78.9GiB/60001msec)
    clat (usec): min=7, max=166280, avg=1113.75, stdev=1271.13
     lat (usec): min=7, max=166280, avg=1113.78, stdev=1271.13
    clat percentiles (usec):
     |  1.00th=[  285],  5.00th=[  306], 10.00th=[  318], 20.00th=[  334],
     | 30.00th=[  375], 40.00th=[  570], 50.00th=[  783], 60.00th=[  979],
     | 70.00th=[ 1254], 80.00th=[ 1663], 90.00th=[ 2409], 95.00th=[ 3195],
     | 99.00th=[ 5080], 99.50th=[ 5735], 99.90th=[ 7046], 99.95th=[ 7898],
     | 99.99th=[30278]
   bw (  MiB/s): min=  619, max= 3834, per=100.00%, avg=1346.99, stdev=22.40, samples=2856
   iops        : min= 9914, max=61352, avg=21551.55, stdev=358.47, samples=2856
  lat (usec)   : 10=0.01%, 20=0.11%, 50=0.01%, 100=0.01%, 250=0.01%
  lat (usec)   : 500=36.44%, 750=11.93%, 1000=12.21%
  lat (msec)   : 2=24.91%, 4=11.76%, 10=2.60%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%
  cpu          : usr=0.86%, sys=80.94%, ctx=20656649, majf=20647760, minf=4009
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1292171,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=1346MiB/s (1411MB/s), 1346MiB/s-1346MiB/s (1411MB/s-1411MB/s), io=78.9GiB (84.7GB), run=60001-60001msec

Disk stats (read/write):
    md0: ios=20593564/0, sectors=164748512/0, merge=0/0, ticks=311541/0, in_queue=311541, util=74.42%, aggrios=5161940/24, aggsectors=41295522/0, aggrmerge=0/0, aggrticks=67192/23, aggrin_queue=67239, aggrutil=28.65%
  nvme3n1: ios=5165040/24, sectors=41320320/0, merge=0/0, ticks=64937/24, in_queue=64985, util=27.60%
  nvme0n1: ios=5160824/24, sectors=41286592/0, merge=0/0, ticks=74396/23, in_queue=74442, util=28.65%
  nvme1n1: ios=5155084/24, sectors=41240672/0, merge=0/0, ticks=64624/23, in_queue=64669, util=27.54%
  nvme2n1: ios=5166813/24, sectors=41334504/0, merge=0/0, ticks=64814/24, in_queue=64862, util=27.52%

sar

Linux System Activity Report tool sar is a powerful tool for BBS sysops :wink: . You can use sar not only to print out useful stats, but capture logs to a data file. Then you can use sadf to conver the data file into an interactive .svg image with charts and graphs!

This graph is a time series chart. On the left is the beginning where llama.cpp starts up. As time goes on I’ve labeled the time periods for loading the model, prompt processing, and token generation until it stops.

This test uses a larger Q4_K_M quant weighing in at 377GiB which is too big to fit into the 256GB system RAM. So this test was designed to stress the buffered disk reads and figure out the bottle neck on performance there.

After updating the Linux kernel from 6.8.0 to a fresh 6.13.0 and enabling transparent huge pages, performance improved a bit as shown in this side by side. Left side is same chart as above on 6.8.0, and right side is newer 6.13.0. While slightly better, it isn’t a significant improvement for actual token generation.

There are two .svg files as well if you want more metrics:

Here is how you could generate graphs like this for your own tests:

# install needed packages
$ sudo apt-get install sysstat || pacman -Sy extra/sysstat

# drop linux kernel page cache, cached entries, and inodes
# might take a minute if you have a lot of RAM and stuff cached
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

# begin capturing 
$ sar -d --dev=md0,nvme0n1,nvme1n1,nvme2n1,nvme3n1 -B -q ALL 1 -o sarlog.dat

# do whatever test in another terminal e.g. a llama-server inference test
# hit <cntrl>+c stop stop capturing when done

# generate svg file using same arguments as sar above
$ sadf -g sarlog.dat -O showtoc -- -d --dev=md0,nvme0n1,nvme1n1,nvme2n1,nvme3n1 -B -q ALL > output.svg

You don’t have to use sadf and generate a .svg as the sar will print some useful stuff to the console while it is logging. The next section has some examples.

BPF Performance Tools

One of my sysadmin role models is Brendan Gregg who popularized perf Flame Graphs as well as advanced eBPF Linux Kernel Probe tools. He has a couple fat books available and a useful website as well. The tools are a bit confusing to install, but can be a great way to observe how a user land application interacts with the Linux Kernel systems.

You can measure things like block i/o size, disk latency, sequential vs random i/o, Off-CPU Analysis, cache stats, etc. The advantage of using BPF probes over normal perf is they often have less overhead.

The rest of this mega post is a bunch of these tools measuring various things during llama-bench for both prompt processing as well as token
generation. You can see that each phase of inferencing has a somewhat different profile.

All these measurements were taken with Linux 6.13.0 on the 24 core Threadripper Pro running llama-bench on the Q4_K_M to focus on that buffered read i/o bottleneck.

./build/bin/llama-bench \
    --model /mnt/raid/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf \
    --cache-type-k q4_0 \
    --cache-type-v f16 \
    --threads 24
model size params backend threads type_k test t/s
deepseek2 671B Q4_K - Medium 376.65 GiB 671.03 B CPU 24 q4_0 pp512 2.32 ± 0.20
deepseek2 671B Q4_K - Medium 376.65 GiB 671.03 B CPU 24 q4_0 tg128 1.12 ± 0.16

build: 90e4dba4 (4689)

Install libbpf-tools

# The package situation on Ubuntu is confusing
# https://github.com/iovisor/bcc/blob/master/INSTALL.md#ubuntu---binary
# https://github.com/iovisor/bcc/pull/4940#issuecomment-2645861929
sudo apt-get install libbpf-tools

# i'm using the /usr/sbin/biosnoop ones for example
$ apt-file search /usr/sbin/biosnoop
bpfcc-tools: /usr/sbin/biosnoop-bpfcc
bpftrace: /usr/sbin/biosnoop.bt
libbpf-tools: /usr/sbin/biosnoop

Methodology

Wait for the disk cache to fill up and everything to settle down, then
run the tests twice, once during prompt processing and again for token
generation. Confirm that CPU usage is around 50% with btop or similar.
Then take each measurement and log it.

Repeat as necessary to collect all of the data.

Baseline

Get baseline disk i/o and page cache stats using a more common tool like sar.

Prompt Processing
# sample the disks/raid backing the model for 180 seconds 1 time
$ sar -d --dev=md0,nvme0n1,nvme1n1,nvme2n1,nvme3n1 -B -q ALL 180 1

Linux 6.13.0-061300-generic (wTR24)     02/12/2025      _x86_64_        (48 CPU)

01:43:56 PM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
01:46:56 PM 1336677.96   3894.84  21891.61   2635.86 2371208.87 367155.06  22370.98 667337.86    171.32

01:43:56 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked
01:46:56 PM        24      1835     24.72     17.01     11.06         0

01:43:56 PM       DEV       tps     rkB/s     wkB/s     dkB/s   areq-sz    aqu-sz     await     %util
01:46:56 PM   nvme5n1    283.17   4181.93   3894.84      0.00     28.52      2.31      8.15      2.05
01:46:56 PM   nvme1n1  19100.67 333147.36      0.00      0.00     17.44      0.72      0.04     14.20
01:46:56 PM   nvme0n1  18782.46 333234.49      0.00      0.00     17.74      0.99      0.05     14.97
01:46:56 PM   nvme2n1  19364.39 332987.67      0.00      0.00     17.20      0.74      0.04     14.07
01:46:56 PM   nvme3n1  19495.42 332798.36      0.00      0.00     17.07      0.75      0.04     13.96
01:46:56 PM       md0  76849.94 1332167.78      0.00      0.00     17.33      3.05      0.04     73.67

01:43:56 PM  %scpu-10  %scpu-60 %scpu-300     %scpu
01:46:56 PM      0.00      0.00      0.00      0.02

01:43:56 PM   %sio-10   %sio-60  %sio-300      %sio   %fio-10   %fio-60  %fio-300      %fio
01:46:56 PM      0.00      0.17      0.12      0.65      0.00      0.14      0.10      0.64

01:43:56 PM  %smem-10  %smem-60 %smem-300     %smem  %fmem-10  %fmem-60 %fmem-300     %fmem
01:46:56 PM      1.79      2.57      4.34      3.29      1.79      2.57      4.34      3.29
Token Generation
$ sar -d --dev=md0,nvme0n1,nvme1n1,nvme2n1,nvme3n1 -B -q ALL 180 1

Linux 6.13.0-061300-generic (wTR24)     02/12/2025      _x86_64_        (48 CPU)

03:13:58 PM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
03:16:58 PM 1595231.85    188.96  24359.67   5846.29 1818449.61 412689.57  39633.53 798180.86    176.46

03:13:58 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked
03:16:58 PM         1      1837     24.42     24.44     23.35        19

03:13:58 PM       DEV       tps     rkB/s     wkB/s     dkB/s   areq-sz    aqu-sz     await     %util
03:16:58 PM   nvme1n1  22941.29 398430.58      0.00      0.00     17.37      1.06      0.05     13.77
03:16:58 PM   nvme0n1  22536.70 398376.89      0.00      0.00     17.68      1.56      0.07     15.28
03:16:58 PM   nvme2n1  22890.26 398587.04      0.00      0.00     17.41      1.06      0.05     13.76
03:16:58 PM   nvme3n1  22838.09 398251.42      0.00      0.00     17.44      1.06      0.05     13.79
03:16:58 PM       md0  91373.23 1593645.27      0.00      0.00     17.44      4.50      0.05     65.69

03:13:58 PM  %scpu-10  %scpu-60 %scpu-300     %scpu
03:16:58 PM      0.00      0.00      0.00      0.05

03:13:58 PM   %sio-10   %sio-60  %sio-300      %sio   %fio-10   %fio-60  %fio-300      %fio
03:16:58 PM      0.00      0.17      0.54      0.90      0.00      0.17      0.54      0.89

03:13:58 PM  %smem-10  %smem-60 %smem-300     %smem  %fmem-10  %fmem-60 %fmem-300     %fmem
03:16:58 PM      4.69      3.63      3.45      4.79      4.69      3.63      3.45      4.79

Block i/o size

Summarize block device I/O size as a histogram.

Prompt Processing
$ sudo bitesize -c llama-bench 180 1

Process Name = llama-bench
     Kbytes              : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 6019517  |****************************************|
         8 -> 15         : 3953364  |**************************              |
        16 -> 31         : 1914791  |************                            |
        32 -> 63         : 1011460  |******                                  |
        64 -> 127        : 604708   |****                                    |
       128 -> 255        : 378287   |**                                      |
Token Generation
$ sudo bitesize -c llama-bench 180 1

Process Name = llama-bench
     Kbytes              : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 7742863  |****************************************|
         8 -> 15         : 4567127  |***********************                 |
        16 -> 31         : 1938310  |**********                              |
        32 -> 63         : 869145   |****                                    |
        64 -> 127        : 639664   |***                                     |
       128 -> 255        : 626360   |***                                     |

Block i/o latency

Summarize block device I/O latency as a histogram.

Prompt Processing
$ sudo biolatency -T -D 180 1

disk = nvme0n1
     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 870744   |***************************             |
         8 -> 15         : 1250919  |****************************************|
        16 -> 31         : 192358   |******                                  |
        32 -> 63         : 144834   |****                                    |
        64 -> 127        : 527912   |****************                        |
       128 -> 255        : 328101   |**********                              |
       256 -> 511        : 77128    |**                                      |
       512 -> 1023       : 4002     |                                        |
      1024 -> 2047       : 14       |                                        |
      2048 -> 4095       : 5        |                                        |

disk = nvme1n1
     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 909798   |***************************             |
         8 -> 15         : 1340334  |****************************************|
        16 -> 31         : 212264   |******                                  |
        32 -> 63         : 337110   |**********                              |
        64 -> 127        : 475752   |**************                          |
       128 -> 255        : 146569   |****                                    |
       256 -> 511        : 27103    |                                        |
       512 -> 1023       : 708      |                                        |
      1024 -> 2047       : 3        |                                        |
      2048 -> 4095       : 4        |                                        |

disk = nvme2n1
     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 923536   |***************************             |
         8 -> 15         : 1366403  |****************************************|
        16 -> 31         : 214847   |******                                  |
        32 -> 63         : 349386   |**********                              |
        64 -> 127        : 481452   |**************                          |
       128 -> 255        : 147123   |****                                    |
       256 -> 511        : 27168    |                                        |
       512 -> 1023       : 591      |                                        |
      1024 -> 2047       : 9        |                                        |
      2048 -> 4095       : 1        |                                        |

disk = nvme3n1
     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 920670   |***************************             |
         8 -> 15         : 1361679  |****************************************|
        16 -> 31         : 217317   |******                                  |
        32 -> 63         : 356585   |**********                              |
        64 -> 127        : 488626   |**************                          |
       128 -> 255        : 149823   |****                                    |
       256 -> 511        : 27308    |                                        |
       512 -> 1023       : 694      |                                        |
      1024 -> 2047       : 5        |                                        |
      2048 -> 4095       : 1        |                                        |
Token Generation
$ sudo biolatency -T -D 180 1

disk = nvme0n1
     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 1770055  |****************************************|
         8 -> 15         : 588660   |*************                           |
        16 -> 31         : 107722   |**                                      |
        32 -> 63         : 173490   |***                                     |
        64 -> 127        : 550676   |************                            |
       128 -> 255        : 488273   |***********                             |
       256 -> 511        : 156297   |***                                     |
       512 -> 1023       : 11933    |                                        |
      1024 -> 2047       : 161      |                                        |
      2048 -> 4095       : 39       |                                        |
      4096 -> 8191       : 49       |                                        |

disk = nvme1n1
     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 1918463  |****************************************|
         8 -> 15         : 668260   |*************                           |
        16 -> 31         : 121909   |**                                      |
        32 -> 63         : 358798   |*******                                 |
        64 -> 127        : 494699   |**********                              |
       128 -> 255        : 246599   |*****                                   |
       256 -> 511        : 55995    |*                                       |
       512 -> 1023       : 1480     |                                        |
      1024 -> 2047       : 160      |                                        |
      2048 -> 4095       : 60       |                                        |
      4096 -> 8191       : 19       |                                        |

disk = nvme2n1
     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 1920661  |****************************************|
         8 -> 15         : 669311   |*************                           |
        16 -> 31         : 122003   |**                                      |
        32 -> 63         : 356457   |*******                                 |
        64 -> 127        : 492837   |**********                              |
       128 -> 255        : 246685   |*****                                   |
       256 -> 511        : 54411    |*                                       |
       512 -> 1023       : 1674     |                                        |
      1024 -> 2047       : 175      |                                        |
      2048 -> 4095       : 47       |                                        |
      4096 -> 8191       : 47       |                                        |

disk = nvme3n1
     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 1910195  |****************************************|
         8 -> 15         : 673350   |**************                          |
        16 -> 31         : 122117   |**                                      |
        32 -> 63         : 355674   |*******                                 |
        64 -> 127        : 496290   |**********                              |
       128 -> 255        : 247926   |*****                                   |
       256 -> 511        : 55412    |*                                       |
       512 -> 1023       : 1825     |                                        |
      1024 -> 2047       : 115      |                                        |
      2048 -> 4095       : 58       |                                        |
      4096 -> 8191       : 24       |                                        |

Block i/o Top

Trace file reads/writes by process.

Prompt Processing
$ sudo biotop --pid=$(pidof llama-bench) 180 1

PID     COMM             D MAJ MIN DISK       I/O  Kbytes  AVGms
463986  llama-bench      R 259 6   nvme0n1  3323356 61527016   0.05
463986  llama-bench      R 259 4   nvme1n1  3390545 61579280   0.03
463986  llama-bench      R 259 7   nvme2n1  3446108 61477652   0.03
463986  llama-bench      R 259 10  nvme3n1  3464197 61390536   0.03
Token Generation
$ sudo biotop --pid=$(pidof llama-bench) 180 1

PID     COMM             D MAJ MIN DISK       I/O  Kbytes  AVGms
483569  llama-bench      R 259 6   nvme0n1  3828846 64984816   0.06
483569  llama-bench      R 259 4   nvme1n1  3849134 64995000   0.04
483569  llama-bench      R 259 7   nvme2n1  3846462 64923580   0.04
483569  llama-bench      R 259 10  nvme3n1  3846248 64986828   0.04

Block i/o Pattern

Show block device I/O pattern.

Prompt Processing
$ sudo biopattern -T 180 1

TIME      DISK     %RND  %SEQ    COUNT     KBYTES
14:01:53  nvme0n1    35    64  3345173   62683820
14:01:53  nvme1n1    32    67  3404623   62661548
14:01:53  nvme2n1    33    66  3470880   62643204
14:01:53  nvme3n1    33    66  3496122   62571880
Token Generation
$ sudo biopattern -T 180 1

Tracing block device I/O requested seeks... Hit Ctrl-C to end.
TIME      DISK     %RND  %SEQ    COUNT     KBYTES
15:53:13  nvme0n1    58    41  4297223   55141108
15:53:13  nvme1n1    54    45  4393544   55156308
15:53:13  nvme2n1    54    45  4382473   55182708
15:53:13  nvme3n1    54    45  4385386   55126844

Block i/o Snoop

Trace block I/O.

Prompt Processing
# do this a few times to get a feel for exact size reads and latencies
$ sudo biosnoop | head
TIME(s)     COMM           PID     DISK    T    SECTOR     BYTES   LAT(ms)
0.000000    llama-b        472645  nvme1n1 RA   171795752  4096      0.056
0.000751    llama-b        472645  nvme2n1 RA   171796208  65536     0.044
0.000756    llama-b        472645  nvme2n1 RA   171796376  20480     0.030
0.000769    llama-b        472645  nvme2n1 RA   171796440  20480     0.010
0.001025    llama-b        472645  nvme3n1 RA   171795456  16384     0.268
0.001027    llama-b        472645  nvme3n1 RA   171795584  8192      0.253
0.001028    llama-b        472645  nvme3n1 RA   171795664  36864     0.220
0.001341    llama-b        472645  nvme0n1 RA   171796488  8192      0.131
0.002048    llama-b        472645  nvme1n1 RA   171796520  8192      0.122
Token Generation
$ sudo biosnoop 1 | head
TIME(s)     COMM           PID     DISK    T    SECTOR     BYTES   LAT(ms)
0.000000    llama-b        483569  nvme1n1 RA   262911728  32768     0.064
0.000091    llama-b        483569  nvme1n1 RA   262911912  4096      0.042
0.000141    llama-b        483569  nvme1n1 RA   262911792  61440     0.088
0.000166    llama-b        483569  nvme1n1 RA   262911920  16384     0.079
0.000216    llama-b        483569  nvme2n1 RA   262911000  4096      0.047
0.000230    llama-b        483569  nvme1n1 RA   262911952  24576     0.056
0.000250    llama-b        483569  nvme2n1 RA   262911008  12288     0.062
0.000259    llama-b        483569  nvme2n1 RA   262910976  12288     0.088
0.000272    llama-b        483569  nvme2n1 RA   262911032  4096      0.071

CPU Cache Stats

Summarize cache references and misses by PID.

Prompt Processing
# sort by cpu (the processes likely bounce between cores/threads)
sudo llcstat 180 | grep -P '(llama-bench|Total|PID)' | sort -V -k 3,3

PID      NAME             CPU     REFERENCE         MISS    HIT%
472645   llama-bench      0       260084200    105109100  59.59%
472645   llama-bench      1       271595900    107856300  60.29%
472645   llama-bench      2       122634100     48118400  60.76%
472645   llama-bench      3       103657900     38727000  62.64%
472645   llama-bench      4       152044500     59005700  61.19%
472645   llama-bench      5       155044000     58195700  62.47%
472645   llama-bench      6       292201400    117024400  59.95%
472645   llama-bench      7       174555900     67375600  61.40%
472645   llama-bench      8       348132500    140949600  59.51%
472645   llama-bench      9       261152300    102766400  60.65%
472645   llama-bench      10      408874400    163455700  60.02%
472645   llama-bench      11      451004300    179642900  60.17%
472645   llama-bench      12      387897500    158195400  59.22%
472645   llama-bench      13      502233700    207044400  58.78%
472645   llama-bench      14      186695700     70750300  62.10%
472645   llama-bench      15      406424500    160824600  60.43%
472645   llama-bench      16      346197800    134519400  61.14%
472645   llama-bench      17      204177200     85408700  58.17%
472645   llama-bench      18      165621000     66280800  59.98%
472645   llama-bench      19      326486500    135545100  58.48%
472645   llama-bench      20      368738300    139175500  62.26%
472645   llama-bench      21      261058700    102228400  60.84%
472645   llama-bench      22      424314500    166226400  60.82%
472645   llama-bench      23      252195500    102542900  59.34%
472645   llama-bench      24     1189916100    498972700  58.07%
472645   llama-bench      25     1160928000    478048200  58.82%
472645   llama-bench      26     1314916900    535932500  59.24%
472645   llama-bench      27     1342106800    549897700  59.03%
472645   llama-bench      28     1278900000    521841200  59.20%
472645   llama-bench      29     1290325600    527823400  59.09%
472645   llama-bench      30     1060096500    428899700  59.54%
472645   llama-bench      31     1280229000    524769100  59.01%
472645   llama-bench      32     1052018800    428794900  59.24%
472645   llama-bench      33      976827500    394650300  59.60%
472645   llama-bench      34     1025104000    419006300  59.13%
472645   llama-bench      35      815274000    336547500  58.72%
472645   llama-bench      36     1054887300    428246800  59.40%
472645   llama-bench      37      917551200    373496700  59.29%
472645   llama-bench      38     1247426700    513614100  58.83%
472645   llama-bench      39     1023180200    424158000  58.55%
472645   llama-bench      40     1087901900    448935100  58.73%
472645   llama-bench      41     1222746800    499064500  59.18%
472645   llama-bench      42     1290558800    526509700  59.20%
472645   llama-bench      43     1064135400    431187300  59.48%
472645   llama-bench      44      809287700    340242200  57.96%
472645   llama-bench      45     1191512400    491595900  58.74%
472645   llama-bench      46     1006870000    416382800  58.65%
472645   llama-bench      47     1191455700    490035800  58.87%
Total References: 35036149900 Total Misses: 14365309500 Hit Rate: 59.00%
Token Generation
sudo llcstat 180 | grep -P '(llama-bench|Total|PID)' | sort -V -k 3,3

PID      NAME             CPU     REFERENCE         MISS    HIT%
493469   llama-bench      0       178739900     92973100  47.98%
493469   llama-bench      1       143521000     71367200  50.27%
493469   llama-bench      2       102070400     49472700  51.53%
493469   llama-bench      3        93535400     45344500  51.52%
493469   llama-bench      4       122853300     63089400  48.65%
493469   llama-bench      5       100126500     50860700  49.20%
493469   llama-bench      6       135180700     68212700  49.54%
493469   llama-bench      7       221127800    114417400  48.26%
493469   llama-bench      8       353029000    192503900  45.47%
493469   llama-bench      9       201506200    107204200  46.80%
493469   llama-bench      10      213137400    111369200  47.75%
493469   llama-bench      11      211342600    107674700  49.05%
493469   llama-bench      12      171033500     92188700  46.10%
493469   llama-bench      13      165303900     87134700  47.29%
493469   llama-bench      14      265153200    138694800  47.69%
493469   llama-bench      15      172611600     84571800  51.00%
493469   llama-bench      16      255942900    131531100  48.61%
493469   llama-bench      17      151376500     75140300  50.36%
493469   llama-bench      18      248108600    130249100  47.50%
493469   llama-bench      19      268105300    140291000  47.67%
493469   llama-bench      20      155939500     81268600  47.88%
493469   llama-bench      21      125014800     66195500  47.05%
493469   llama-bench      22      196849600    103162900  47.59%
493469   llama-bench      23      217617100    104683100  51.90%
493469   llama-bench      24     1010395500    536548800  46.90%
493469   llama-bench      25     1037633200    546470200  47.33%
493469   llama-bench      26      945782800    485664500  48.65%
493469   llama-bench      27     1060634700    554281500  47.74%
493469   llama-bench      28      992441000    524587200  47.14%
493469   llama-bench      29     1041995900    550117700  47.21%
493469   llama-bench      30     1011749000    538214700  46.80%
493469   llama-bench      31      962790100    517939200  46.20%
493469   llama-bench      32      839163500    438704700  47.72%
493469   llama-bench      33      960316800    507777500  47.12%
493469   llama-bench      34      957871700    508651900  46.90%
493469   llama-bench      35      927833600    492939700  46.87%
493469   llama-bench      36      871497400    453762300  47.93%
493469   llama-bench      37      910214300    480818400  47.18%
493469   llama-bench      38      842718900    442357800  47.51%
493469   llama-bench      39     1007880700    536754700  46.74%
493469   llama-bench      40      899776500    473773500  47.35%
493469   llama-bench      41      909038300    477876200  47.43%
493469   llama-bench      42      920419100    485129300  47.29%
493469   llama-bench      43      885914700    468553600  47.11%
493469   llama-bench      44      960370900    510075500  46.89%
493469   llama-bench      45      907635400    484657100  46.60%
493469   llama-bench      46      850517900    453385200  46.69%
493469   llama-bench      47      909464200    491268200  45.98%
Total References: 29114831100 Total Misses: 15202374500 Hit Rate: 47.78%

Cache Stat

Count cache kernel function calls.

Prompt Processing
$ sudo cachestat -T 180 1
TIME         HITS   MISSES  DIRTIES HITRATIO   BUFFERS_MB  CACHED_MB
14:25:46        0        0     4848    0.00%           14     249539
Token Generation
$ sudo cachestat -T 180 1

TIME         HITS   MISSES  DIRTIES HITRATIO   BUFFERS_MB  CACHED_MB
15:58:00        0        0     4865    0.00%           13     250911

Scheduler Run Queue Latency

Summarize run queue (scheduler) latency as a histogram.

Prompt Processing
$ sudo runqlat -p $(pidof llama-bench) 180 1

     usecs               : count    distribution
         0 -> 1          : 82496    |****************************************|
         2 -> 3          : 49371    |***********************                 |
         4 -> 7          : 14323    |******                                  |
         8 -> 15         : 3843     |*                                       |
        16 -> 31         : 978      |                                        |
        32 -> 63         : 189      |                                        |
        64 -> 127        : 208      |                                        |
       128 -> 255        : 39       |                                        |
       256 -> 511        : 30       |                                        |
       512 -> 1023       : 43       |                                        |
      1024 -> 2047       : 41       |                                        |
      2048 -> 4095       : 45       |                                        |

Token Generation
$ sudo runqlat -p $(pidof llama-bench) 180 1

     usecs               : count    distribution
         0 -> 1          : 90843    |****************************************|
         2 -> 3          : 79882    |***********************************     |
         4 -> 7          : 39149    |*****************                       |
         8 -> 15         : 5292     |**                                      |
        16 -> 31         : 1515     |                                        |
        32 -> 63         : 282      |                                        |
        64 -> 127        : 183      |                                        |
       128 -> 255        : 16       |                                        |
       256 -> 511        : 17       |                                        |
       512 -> 1023       : 42       |                                        |
      1024 -> 2047       : 55       |                                        |
      2048 -> 4095       : 34       |                                        |

On-CPU Time

Summarize on-CPU time per task as a histogram.

Prompt Processing
$ sudo cpudist -p $(pidof llama-bench) 180 1

     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 5        |                                        |
         8 -> 15         : 17       |                                        |
        16 -> 31         : 49       |                                        |
        32 -> 63         : 58       |                                        |
        64 -> 127        : 56       |                                        |
       128 -> 255        : 100      |                                        |
       256 -> 511        : 147      |*                                       |
       512 -> 1023       : 598      |*****                                   |
      1024 -> 2047       : 291      |**                                      |
      2048 -> 4095       : 619      |******                                  |
      4096 -> 8191       : 1425     |**************                          |
      8192 -> 16383      : 1931     |*******************                     |
     16384 -> 32767      : 2872     |****************************            |
     32768 -> 65535      : 4011     |****************************************|
     65536 -> 131071     : 3888     |**************************************  |
    131072 -> 262143     : 2650     |**************************              |
    262144 -> 524287     : 893      |********                                |
    524288 -> 1048575    : 76       |                                        |
Token Generation
$ sudo cpudist -p $(pidof llama-bench) 180 1

     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 1        |                                        |
         4 -> 7          : 8        |                                        |
         8 -> 15         : 11       |                                        |
        16 -> 31         : 27       |                                        |
        32 -> 63         : 69       |                                        |
        64 -> 127        : 84       |                                        |
       128 -> 255        : 132      |*                                       |
       256 -> 511        : 282      |**                                      |
       512 -> 1023       : 846      |********                                |
      1024 -> 2047       : 535      |*****                                   |
      2048 -> 4095       : 1060     |**********                              |
      4096 -> 8191       : 2314     |***********************                 |
      8192 -> 16383      : 2792     |****************************            |
     16384 -> 32767      : 3419     |***********************************     |
     32768 -> 65535      : 3878     |****************************************|
     65536 -> 131071     : 2915     |******************************          |
    131072 -> 262143     : 1850     |*******************                     |
    262144 -> 524287     : 519      |*****                                   |
    524288 -> 1048575    : 44       |                                        |

Off-CPU Time

Summarize off-CPU time per task as a histogram.

Prompt Processing
$ sudo cpudist --offcpu -p $(pidof llama-bench) 180 1

     usecs               : count    distribution
         0 -> 1          : 1        |                                        |
         2 -> 3          : 10007    |************                            |
         4 -> 7          : 30993    |****************************************|
         8 -> 15         : 17615    |**********************                  |
        16 -> 31         : 21114    |***************************             |
        32 -> 63         : 13342    |*****************                       |
        64 -> 127        : 13039    |****************                        |
       128 -> 255        : 11525    |**************                          |
       256 -> 511        : 8151     |**********                              |
       512 -> 1023       : 5882     |*******                                 |
      1024 -> 2047       : 5374     |******                                  |
      2048 -> 4095       : 3981     |*****                                   |
      4096 -> 8191       : 1264     |*                                       |
      8192 -> 16383      : 104      |                                        |
     16384 -> 32767      : 1        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 6        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 23       |                                        |
Token Generation
$ sudo cpudist --offcpu -p $(pidof llama-bench) 180 1

     usecs               : count    distribution
         0 -> 1          : 5        |                                        |
         2 -> 3          : 5250     |*********                               |
         4 -> 7          : 17459    |******************************          |
         8 -> 15         : 14778    |*************************               |
        16 -> 31         : 14864    |*************************               |
        32 -> 63         : 20889    |************************************    |
        64 -> 127        : 18418    |*******************************         |
       128 -> 255        : 17902    |******************************          |
       256 -> 511        : 15147    |**************************              |
       512 -> 1023       : 21717    |*************************************   |
      1024 -> 2047       : 23123    |****************************************|
      2048 -> 4095       : 21891    |*************************************   |
      4096 -> 8191       : 8738     |***************                         |
      8192 -> 16383      : 1500     |**                                      |
     16384 -> 32767      : 303      |                                        |
     32768 -> 65535      : 90       |                                        |
     65536 -> 131071     : 71       |                                        |
    131072 -> 262143     : 5        |                                        |

Off-CPU Time Stack Traces

Summarize off-CPU time by stack trace.

Prompt Processing
# run a few times to see what is happening
$ sudo offcputime -p $(pidof llama-bench) 1

    bpf_prog_721f6802fa0b7ba5_sched_switch
    bpf_prog_721f6802fa0b7ba5_sched_switch
    bpf_trace_run4
    __bpf_trace_sched_switch
    __schedule
    schedule
    io_schedule
    folio_wait_bit_common
    filemap_fault
    __do_fault
    do_read_fault
    do_fault
    handle_pte_fault
    __handle_mm_fault
    handle_mm_fault
    do_user_addr_fault
    exc_page_fault
    asm_exc_page_fault
    ggml_vec_dot_q4_K_q8_K
    ggml_compute_forward_mul_mat_id
    -                llama-bench (473106)
        4645

Token Generation
$ sudo offcputime -p $(pidof llama-bench) 1

    bpf_prog_721f6802fa0b7ba5_sched_switch
    bpf_prog_721f6802fa0b7ba5_sched_switch
    bpf_trace_run4
    __bpf_trace_sched_switch
    __traceiter_sched_switch
    __schedule
    schedule
    io_schedule
    folio_wait_bit_common
    filemap_fault
    __do_fault
    do_read_fault
    do_fault
    handle_pte_fault
    __handle_mm_fault
    handle_mm_fault
    do_user_addr_fault
    exc_page_fault
    asm_exc_page_fault
    ggml_vec_dot_q4_K_q8_K
    ggml_compute_forward_mul_mat_id
    -                llama-bench (473122)
        167
4 Likes

Hey, is there a solution for multiple machines around? What if I had a couple of machines and 100Gbps NICs? I know llama.cpp had some rpc backend, but it seems pretty early and experimental.
E: It apparently uses TCP, which probably doesn’t really scale

2 Likes

Nice video Wendal. I got the sloth 1.58bit dynamic quant on my AI server system I have and have been thinking about upgrading it so I could get a usable tokens per sec. Right now I get about 1.5 tokens per sec with a 4096 context size which is really unusable for anything meaningful. Part of that may be that I’m running it on windows and LM Studio does not have anything special in terms of LLama.cpp optimizations.

I would like to get up to 10 tokens per sec and have as much of context size loaded as possible. I would like 128k context if I can get it but a 64k context would be acceptable. I’ve been looking at Threadripper Pro 5000 and Threadripper 7000 but maybe a dual CPU EPYC system may be the best even if it’s 3rd gen epyc. It’s just that DDR5 RDIMMS cost a fortune still. What would be the best system for the price for me to get at least 10 tokens per sec with a big context size?

Current System Specs: AMD Ryzen 5900x 128gb DDR4 Ram 2x RTX 3090’s running without a bridge but I have a NVLink Bridge.


1 Like

Wendell got the full Q8_0 running at usable speeds on the big dual EPYC w/ 768GB RAM.

While I did technically did get both the full Q8_0 and half sized Q4_K_M unsloth quants running running on the 256GB RAM 24 core Threadripper, it is not something I’d recommend. Less than 1 tok/sec at any usable context length e.g. 8k+.

If a big chunk of the model weights are too big for RAM, it will constantly page off the NVMe drives which kills performance.

This is true regardless of Optane NVMe, or Gen 5 T705 RAID0 array, etc… The current software implementation can’t take full advantage of the O_DIRECT performance of these drives yet.

But you can get smaller quantized versions of the real R1 671b that fit in less than 256GB RAM which works okay!

Yeah I saw some wild commands on an experimental branch. Basically slicing up the MoE and placing the most used experts on the fastest RAM or VRAM across multiple machines.

But yeah, sounds like latency becomes an issue, though folks are experimenting.

I think it would be very interesting to have a performance comparison between
Singlesocket or Dualsocket EPYC, both SP3 and SP5 Platforms.

I do not have the resources to test it rn.

I found something similar–e.g. for the 671b model to fit within my RAM envelope (512GB), I could use a maximum of 5k context, and the model would simply think itself in circles until ollama hit an EOF error and crashed. I’m downloading the unsloth model you linked and I’ll give it a try. Sadly running the full-fat model locally appears to be out of my reach unless I 1) upgrade to 1TB of RAM or 2) completely replace my current EPYC server with a 9xx4 or 9xx5 with 768GB DDR5. Both of those options are waay too expensive for what is currently for hobby/education purposes.

Once I get the unsloth model downloaded and running, I can benchmark a 7532 and 7C13 with 512GB DDR4 3200. I’ll have to leave it to wendell (or someone else) to provide SP5 numbers, but I think the additional memory bandwidth from 12 channels of DDR5 will stomp my SP3 machine.
Here’s the llama-bench (using the same parameters as ubergarm) on my 32-core Epyc Rome with 512GB RAM:

model size params backend threads type_k test t/s
deepseek2 671B Q2_K - Medium 211.03 GiB 671.03 B CPU 32 q4_0 pp512 15.41 ± 0.22
deepseek2 671B Q2_K - Medium 211.03 GiB 671.03 B CPU 32 q4_0 tg128 1.96 ± 0.02

Yikes to that tg128 number. I’d hoped the smaller model would result in a faster t/s than I was seeing using the 671b model from the ollama library, but this is only modestly faster…

3 Likes

I did some calculations and you can build a System with

  • Gigabyte MZ73-LM1This text will be blurred
  • 2x AMD EPYC GENOA SP5 ZEN4 9334 QS
  • 24x 64GB DDR5 5600
  • 4x 2TB NVME
    for around 5000 Euros if Alibaba doesn’t lie about those memory prices.
    Welp will cost above 10k, DDR5 memory got very expensive, so no cheap AI machine for me :frowning:

And maybe can save 1k when going to a Single socket System.
The only thing that I need to know is, the ruff performance that someone can expect.

Liked the video. Just wanted to add I’ve been playing with Msty. Found it was more manageable than using ollama with extensions. It has knowledge stacks that a very cool. Running on relative pleb hardware, 7950x 64 GB RX 6800 16 GB fairly well.

I’m in the camp of being hopeful to use NVME, CPU and GPU to accelerate R1 at q4.

Found another example of loading direct to GPU via pcie, so it could be fast to pull from a raid of NVME drives.

1 Like

I am also doing beches for all the deepseek versions with various cores ngls to see what different models do well at. Can add my findings if you guys want. Im about half way through.

1 Like

Wow great thanks for the numbers and yeah hrmmm… If you could run that intel memory checker shown above, that’d give us some numbers for aggregate RAM bandwidth.

Also iirc single socket Epyc Rome still has numa nodes depending on how its configured right numactl --hardware?

If so, maybe try adding --numa distribute to the llama bench command to see if it makes a difference?

Yeah, Rome has numa nodes. Here’s the memory checker results. Looks like ~160-173 GB/s or so memory bandwidth:

Intel(R) Memory Latency Checker - v3.11b
Measuring idle latencies for random access (in ns)...
		Numa node
Numa node	     0	     1	     2	     3	
       0	 107.6	 112.5	 120.2	 128.5	
       1	 112.9	 107.5	 127.3	 120.9	
       2	 120.8	 131.2	 107.3	 113.0	
       3	 127.4	 121.2	 112.9	 107.6	

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :	173978.0	
3:1 Reads-Writes :	161623.1	
2:1 Reads-Writes :	159760.8	
1:1 Reads-Writes :	159933.1	
Stream-triad like:	164774.6	

Measuring Memory Bandwidths between nodes within system 
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
		Numa node
Numa node	     0	     1	     2	     3	
       0	43328.3	42146.1	40687.7	39666.0	
       1	41983.4	43520.2	39840.1	40340.3	
       2	40356.6	39902.4	43583.3	41970.6	
       3	39679.5	40214.1	42238.0	42956.3	

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject	Latency	Bandwidth
Delay	(ns)	MB/sec
==========================
 00000	227.22	 173611.8
 00002	226.90	 173911.9
 00008	224.54	 173796.0
 00015	215.57	 173829.0
 00050	177.41	 173550.8
 00100	153.73	 171025.4
 00200	135.96	 112950.5
 00300	126.31	  79863.2
 00400	126.62	  61199.2
 00500	124.44	  49708.7
 00700	123.60	  36151.1
 01000	122.82	  25755.8
 01300	122.25	  20046.7
 01700	121.81	  15509.1
 02500	123.19	  10769.5
 03500	121.06	   7870.4
 05000	120.83	   5691.3
 09000	121.11	   3399.8
 20000	121.62	   1819.4

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency	22.3
Local Socket L2->L2 HITM latency	22.7

I’ll try with the --numa distribute flag and see what I get.

Alright. That definitely helped. New llama-bench results:

model size params backend threads type_k test t/s
deepseek2 671B Q2_K - Medium 211.03 GiB 671.03 B CPU 32 q4_0 pp512 15.38 ± 0.80
deepseek2 671B Q2_K - Medium 211.03 GiB 671.03 B CPU 32 q4_0 tg128 4.69 ± 0.15

I presume your additional performance comes in part from the higher memory bandwidth. Does your threadripper have AVX-512 also?

1 Like

Oh great that sped it up!

Yeah the flags for the TRPRO 7965WX which include avx:

avx avx2 avx512f avx512dq avx512ifma avx512cd avx512bw avx512vl avx512_bf16 avx512vbmi avx512_vbmi2 avx512_vnni avx512_bitalg avx512_vpopcntdq

My 9950X has the same plus avx512_vp2intersect

How many of your 32 cores are you using? Did you try varying that to see if it has any effect? Assuming you have SMT enabled, you could search for the best thread count like so:

# $(seq -s, 16 4 64) = 16,20,24,28,32,36, ...
./build/bin/llama-bench \
    --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf
    --cache-type-k q4_0 \
    --cache-type-v f16 \
    --numa distribute \
    --threads $(seq -s, 16 4 64)

It would be cool to get some kind of linear relationship between RAM bandwidth and generation speed across different systems to show the bottle neck is primarily RAM bandwidth. But if it isn’t linear, there may be other limitations e.g. avx512, possibly memory latency, on some systems.

I’ve disabled SMT based on the suggestion of the YouTuber Digital Spaceport, so that bench is with 32 threads, using all of my cores. Sometime this weekend I should be able to put my Milan CPU in this board and then I’ll be able to test if there’s any performance uplift moving from Zen 2 to Zen 3 and 32 to 64 cores.

Currently running a subset of your llama-bench suggestions, 16 through 28 cores (since I’ve already tested 32). I’ll edit this post with the results when it finishes.

EDIT I ended up testing through 32 cores. Interesting results–prompt processing scales fairly linearly with cores, but there’s little to nothing going on with token generation. AND, the full 32-core run came in 33% lower than when I ran it last night. Weird.

./build/bin/llama-bench \
    --model ./models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf \
    --cache-type-k q4_0 \
    --cache-type-v f16 \
    --threads $(seq -s, 16 4 32) \
    --numa distribute
model size params backend threads type_k test t/s
deepseek2 671B Q2_K - Medium 211.03 GiB 671.03 B CPU 16 q4_0 pp512 10.54 ± 0.02
deepseek2 671B Q2_K - Medium 211.03 GiB 671.03 B CPU 16 q4_0 tg128 3.06 ± 0.01
deepseek2 671B Q2_K - Medium 211.03 GiB 671.03 B CPU 20 q4_0 pp512 12.37 ± 0.01
deepseek2 671B Q2_K - Medium 211.03 GiB 671.03 B CPU 20 q4_0 tg128 3.22 ± 0.02
deepseek2 671B Q2_K - Medium 211.03 GiB 671.03 B CPU 24 q4_0 pp512 13.99 ± 0.05
deepseek2 671B Q2_K - Medium 211.03 GiB 671.03 B CPU 24 q4_0 tg128 3.21 ± 0.01
deepseek2 671B Q2_K - Medium 211.03 GiB 671.03 B CPU 28 q4_0 pp512 15.48 ± 0.05
deepseek2 671B Q2_K - Medium 211.03 GiB 671.03 B CPU 28 q4_0 tg128 3.32 ± 0.03
deepseek2 671B Q2_K - Medium 211.03 GiB 671.03 B CPU 32 q4_0 pp512 16.43 ± 0.35
deepseek2 671B Q2_K - Medium 211.03 GiB 671.03 B CPU 32 q4_0 tg128 3.18 ± 0.04
3 Likes

Weird indeed! I saw one github issue talking about how their Epyc Turin rig needed generation stuff cached in a certain order or something?

Might be easy enough to try, basically flush caches and try only pp benchmark ?

Unrelated, but supposedly ktransformers can run faster than llama.cpp right now, but much of the discussion is in chinese so its getting interesting :sweat_smile:

I’ll keep ya posted haha…