DeepSeek Deep Dive R1 at Home!

ubergarm · February 13, 2025, 5:42pm

Nitty Gritty

Okay the rest of this is a hodge podge of notes, benchmarks, and some graphs collected during this testing.

$ cowsay -s "this is exciting"
 __________________
< this is exciting >
 ------------------
        \   ^__^
         \  (**)\_______
            (__)\       )\/\
             U  ||----w |
                ||     ||

Hardware Configuration

I did most of the testing on a AMD Ryzen Threadripper PRO 7965WX 24-Core with 8x32GB DDR5 Kingston KF560R32-32 DIMMs for a total of 256GB RAM. It also has 4x Crucial CT4000T705SSD3 4TB PCIe Gen 5 NVMe drives configured with mdadm into a striped RAID0 array for maximum read performance.

Baseline Benchmarks

I used some publicly available benchmarking tools to get a feel for performance of the various hardware components.

RAM

Grab intel’s memory latency checker (mlc) for basically a Linux version of AIDA64 more or less.

This test shows about 225-250GB/s aggregate RAM i/o bandwidth and ~95ns latency.

$ echo 4000 | sudo tee /proc/sys/vm/nr_hugepages
$ sudo ./Linux/mlc | tee -a output.log

Intel(R) Memory Latency Checker - v3.11b
Measuring idle latencies for random access (in ns)...
                Numa node
Numa node            0
       0          95.3

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      224438.0
3:1 Reads-Writes :      254061.5
2:1 Reads-Writes :      248663.7
1:1 Reads-Writes :      227839.2
Stream-triad like:      250344.9

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node            0
       0        221419.9

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject  Latency Bandwidth
Delay   (ns)    MB/sec
==========================
 00000  412.79   222680.1
 00002  417.48   222670.9
 00008  429.14   222643.9
 00015  429.00   222613.8
 00050  430.61   222542.4
 00100  426.57   222426.4
 00200  113.76   138084.6
 00300  111.73    92931.0
 00400  110.78    68318.5
 00500  110.33    55330.5
 00700  109.71    39991.5
 01000  109.29    28329.2
 01300  105.82    22024.3
 01700  104.64    17036.3
 02500  104.34    11811.4
 03500  104.20     8626.4
 05000  104.12     6229.8
 09000  103.99     3739.3
 20000  103.91     2023.0

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency        18.3
Local Socket L2->L2 HITM latency        18.3

Next I grabbed google multichase and compiled it for another look at RAM latency suggesting about 90ns.

$ ./multichase -m 1g -n 20
89.694

NVMe RAID0 Array

Here is how I setup the RAID0 array and did some basic fio tests to get a feel for out of the box unoptimized performance on normal ext4 filesystem.

There may be some limited gains to be had using other file systems, changing chunk size, and tuning other knobs. But without an O_DIRECT unbuffered llama.cpp implementation the tons of page faults incurred for running an LLM too big to fit into RAM will still be the bottleneck. This is why for now, get more RAM not more SSDs…

$ mdadm --version
mdadm - v4.3 - 2024-02-15 - Ubuntu 4.3-1ubuntu2.1

# Good reference guide
# https://www.jeffgeerling.com/blog/2021/htgwa-create-raid-array-linux-mdadm

$ sudo wipefs --all --force /dev/nvme0n1
$ sudo wipefs --all --force /dev/nvme1n1
$ sudo wipefs --all --force /dev/nvme2n1
$ sudo wipefs --all --force /dev/nvme3n1

# partition_number:partition_start:partition_end
$ sudo sgdisk -n 1:0:0 /dev/nvme0n1
Creating new GPT entries in memory.
The operation has completed successfully.
$ sudo sgdisk -n 1:0:0 /dev/nvme1n1
Creating new GPT entries in memory.
The operation has completed successfully.
$ sudo sgdisk -n 1:0:0 /dev/nvme2n1
Creating new GPT entries in memory.
The operation has completed successfully.
$ sudo sgdisk -n 1:0:0 /dev/nvme3n1
Creating new GPT entries in memory.
The operation has completed successfully.

$ sudo mdadm --create --verbose /dev/md0 --level=0 --raid-devices=4 /dev/nvme0n1p1 /dev/nvme1n1p1 /dev/nvme2n1p1 /dev/nvme3n1p1
mdadm: chunk size defaults to 512K
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.

$ cat /proc/mdstat
Personalities : [raid0]
md0 : active raid0 nvme3n1p1[3] nvme2n1p1[2] nvme1n1p1[1] nvme0n1p1[0]
      15627540480 blocks super 1.2 512k chunks

$ sudo mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Fri Feb  7 16:13:27 2025
        Raid Level : raid0
        Array Size : 15627540480 (14.55 TiB 16.00 TB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

       Update Time : Fri Feb  7 16:13:27 2025
             State : clean
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0

            Layout : original
        Chunk Size : 512K

Consistency Policy : none

              Name : wTR24:0  (local to host wTR24)
              UUID : 5416209d:671789ee:296e46b5:bfa09730
            Events : 0

    Number   Major   Minor   RaidDevice State
       0     259        8        0      active sync   /dev/nvme0n1p1
       1     259        9        1      active sync   /dev/nvme1n1p1
       2     259       10        2      active sync   /dev/nvme2n1p1
       3     259       11        3      active sync   /dev/nvme3n1p1

$ sudo mkfs.ext4 -m 0 -E lazy_itable_init=0,lazy_journal_init=0 /dev/md0

`fio` random read IOPS

Almost 18GB/sec

# non destructive unbuffered random read 4k IOPS benchmark
$ sudo fio \
      --filename=/dev/md0 \
      --readonly \
      --rw=randread \
      --direct=1 \
      --bs=4k \
      --ioengine=libaio \
      --runtime=60 \
      --numjobs=24 \
      --time_based=1 \
      --group_reporting \
      --name=IOPS_4k \
      --iodepth=32

IOPS_4k: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.36
Starting 24 processes
Jobs: 24 (f=24): [r(24)][100.0%][r=16.7GiB/s][r=4374k IOPS][eta 00m:00s]
IOPS_4k: (groupid=0, jobs=24): err= 0: pid=559472: Sun Feb  9 18:36:39 2025
  read: IOPS=4352k, BW=16.6GiB/s (17.8GB/s)(996GiB/60001msec)
    slat (usec): min=2, max=3964, avg= 3.54, stdev= 2.95
    clat (nsec): min=580, max=4484.0k, avg=172308.91, stdev=108153.11
     lat (usec): min=10, max=4488, avg=175.85, stdev=108.20
    clat percentiles (usec):
     |  1.00th=[   21],  5.00th=[   41], 10.00th=[   59], 20.00th=[   83],
     | 30.00th=[  101], 40.00th=[  122], 50.00th=[  147], 60.00th=[  176],
     | 70.00th=[  212], 80.00th=[  262], 90.00th=[  330], 95.00th=[  375],
     | 99.00th=[  461], 99.50th=[  523], 99.90th=[  725], 99.95th=[  848],
     | 99.99th=[ 1045]
   bw (  MiB/s): min=15657, max=17387, per=100.00%, avg=17025.63, stdev= 8.68, samples=2856
   iops        : min=4008240, max=4451242, avg=4358560.48, stdev=2220.89, samples=2856
  lat (nsec)   : 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.85%, 50=6.81%
  lat (usec)   : 100=21.64%, 250=48.59%, 500=21.47%, 750=0.54%, 1000=0.08%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%
  cpu          : usr=18.33%, sys=74.16%, ctx=16667145, majf=0, minf=1075
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=261127526,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=16.6GiB/s (17.8GB/s), 16.6GiB/s-16.6GiB/s (17.8GB/s-17.8GB/s), io=996GiB (1070GB), run=60001-60001msec

Disk stats (read/write):
    md0: ios=261123443/0, sectors=2088987544/0, merge=0/0, ticks=39912514/0, in_queue=39912514, util=99.40%, aggrios=65281881/20, aggsectors=522255054/0, aggrmerge=0/0, aggrticks=9920376/11, aggrin_queue=9920399, aggrutil=90.17%
  nvme3n1: ios=65290024/20, sectors=522320192/0, merge=0/0, ticks=10342006/11, in_queue=10342028, util=89.91%
  nvme0n1: ios=65278033/20, sectors=522224264/0, merge=0/0, ticks=8944765/12, in_queue=8944789, util=89.17%
  nvme1n1: ios=65275278/20, sectors=522202224/0, merge=0/0, ticks=9868045/12, in_queue=9868068, util=90.17%
  nvme2n1: ios=65284192/20, sectors=522273536/0, merge=0/0, ticks=10526690/11, in_queue=10526713, util=89.80%

`fio` sequential read

Over 40GB/s

# non destructive unbuffered sequential read 1M block size benchmark
$ sudo fio \
      --filename=/dev/md0 \
      --readonly \
      --rw=read \
      --direct=1 \
      --bs=1M \
      --ioengine=libaio \
      --runtime=60 \
      --numjobs=24 \
      --time_based=1 \
      --group_reporting \
      --name=SEQREAD_1M \
      --iodepth=32

SEQREAD_1M: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
...
fio-3.36
Starting 24 processes
Jobs: 24 (f=24): [R(24)][100.0%][r=36.2GiB/s][r=37.1k IOPS][eta 00m:00s]
SEQREAD_1M: (groupid=0, jobs=24): err= 0: pid=573183: Sun Feb  9 19:02:22 2025
  read: IOPS=38.7k, BW=37.8GiB/s (40.6GB/s)(2268GiB/60036msec)
    slat (usec): min=56, max=4832, avg=67.55, stdev=18.88
    clat (usec): min=115, max=150892, avg=19779.55, stdev=16087.13
     lat (usec): min=178, max=150957, avg=19847.10, stdev=16087.35
    clat percentiles (usec):
     |  1.00th=[   676],  5.00th=[  1532], 10.00th=[  1991], 20.00th=[  3359],
     | 30.00th=[  6456], 40.00th=[ 11207], 50.00th=[ 16581], 60.00th=[ 22414],
     | 70.00th=[ 30016], 80.00th=[ 35914], 90.00th=[ 40633], 95.00th=[ 45351],
     | 99.00th=[ 62653], 99.50th=[ 72877], 99.90th=[ 93848], 99.95th=[100140],
     | 99.99th=[122160]
   bw (  MiB/s): min=27217, max=48975, per=100.00%, avg=38728.43, stdev=161.84, samples=2856
   iops        : min=27215, max=48975, avg=38727.16, stdev=161.86, samples=2856
  lat (usec)   : 250=0.05%, 500=0.45%, 750=0.69%, 1000=0.77%
  lat (msec)   : 2=8.10%, 4=12.60%, 10=14.89%, 20=18.32%, 50=41.29%
  lat (msec)   : 100=2.80%, 250=0.05%
  cpu          : usr=0.16%, sys=12.75%, ctx=2147099, majf=0, minf=196995
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=2321962,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=37.8GiB/s (40.6GB/s), 37.8GiB/s-37.8GiB/s (40.6GB/s-40.6GB/s), io=2268GiB (2435GB), run=60036-60036msec

Disk stats (read/write):
    md0: ios=18531889/0, sectors=4744163584/0, merge=0/0, ticks=256192726/0, in_queue=256192726, util=100.00%, aggrios=4643924/2, aggsectors=1188844546/0, aggrmerge=0/0, aggrticks=64180507/2, aggrin_queue=64180511, aggrutil=72.41%
  nvme3n1: ios=4643904/2, sectors=1188839424/0, merge=0/0, ticks=134953567/2, in_queue=134953572, util=70.58%
  nvme0n1: ios=4643945/2, sectors=1188849672/0, merge=0/0, ticks=25059323/2, in_queue=25059327, util=72.41%
  nvme1n1: ios=4643944/2, sectors=1188849664/0, merge=0/0, ticks=27630208/2, in_queue=27630212, util=71.25%
  nvme2n1: ios=4643904/2, sectors=1188839424/0, merge=0/0, ticks=69078931/2, in_queue=69078936, util=70.02%

`fio` buffered llama.cpp mmap() simulation

About 1.5GB/s

This benchmarks simulates running llama.cpp when the model is too big to fit into your RAM + VRAM. The magic of mmap() makes it even possible to do this by juggling data through the Linux Kernel Page Cache.

This buffered i/o bottleneck is why for now you want to keep as much of the model in RAM + VRAM.

I chose 64k to somewhat simulate actual block sizes measured during llama.cpp inferencing. Actual llama.cpp is a little better than this given it isn’t 100% random.

Notes on how I measured that further down the rabbit hole…

$ sudo fio \
      --filename=/dev/md0 \
      --readonly \
      --rw=randread \
      --direct=0 \
      --bs=64k \
      --ioengine=mmap \
      --runtime=60 \
      --numjobs=24 \
      --time_based=1 \
      --group_reporting \
      --name=IOPS_64k_BUFFERED_MMAP \
      --iodepth=1
IOPS_64k_BUFFERED_MMAP: (g=0): rw=randread, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=mmap, iodepth=1
...
fio-3.36
Starting 24 processes
Jobs: 24 (f=24): [r(24)][100.0%][r=1461MiB/s][r=23.4k IOPS][eta 00m:00s]
IOPS_64k_BUFFERED_MMAP: (groupid=0, jobs=24): err= 0: pid=624772: Thu Feb 13 12:37:01 2025
  read: IOPS=21.5k, BW=1346MiB/s (1411MB/s)(78.9GiB/60001msec)
    clat (usec): min=7, max=166280, avg=1113.75, stdev=1271.13
     lat (usec): min=7, max=166280, avg=1113.78, stdev=1271.13
    clat percentiles (usec):
     |  1.00th=[  285],  5.00th=[  306], 10.00th=[  318], 20.00th=[  334],
     | 30.00th=[  375], 40.00th=[  570], 50.00th=[  783], 60.00th=[  979],
     | 70.00th=[ 1254], 80.00th=[ 1663], 90.00th=[ 2409], 95.00th=[ 3195],
     | 99.00th=[ 5080], 99.50th=[ 5735], 99.90th=[ 7046], 99.95th=[ 7898],
     | 99.99th=[30278]
   bw (  MiB/s): min=  619, max= 3834, per=100.00%, avg=1346.99, stdev=22.40, samples=2856
   iops        : min= 9914, max=61352, avg=21551.55, stdev=358.47, samples=2856
  lat (usec)   : 10=0.01%, 20=0.11%, 50=0.01%, 100=0.01%, 250=0.01%
  lat (usec)   : 500=36.44%, 750=11.93%, 1000=12.21%
  lat (msec)   : 2=24.91%, 4=11.76%, 10=2.60%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%
  cpu          : usr=0.86%, sys=80.94%, ctx=20656649, majf=20647760, minf=4009
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1292171,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=1346MiB/s (1411MB/s), 1346MiB/s-1346MiB/s (1411MB/s-1411MB/s), io=78.9GiB (84.7GB), run=60001-60001msec

Disk stats (read/write):
    md0: ios=20593564/0, sectors=164748512/0, merge=0/0, ticks=311541/0, in_queue=311541, util=74.42%, aggrios=5161940/24, aggsectors=41295522/0, aggrmerge=0/0, aggrticks=67192/23, aggrin_queue=67239, aggrutil=28.65%
  nvme3n1: ios=5165040/24, sectors=41320320/0, merge=0/0, ticks=64937/24, in_queue=64985, util=27.60%
  nvme0n1: ios=5160824/24, sectors=41286592/0, merge=0/0, ticks=74396/23, in_queue=74442, util=28.65%
  nvme1n1: ios=5155084/24, sectors=41240672/0, merge=0/0, ticks=64624/23, in_queue=64669, util=27.54%
  nvme2n1: ios=5166813/24, sectors=41334504/0, merge=0/0, ticks=64814/24, in_queue=64862, util=27.52%

sar

Linux System Activity Report tool sar is a powerful tool for BBS sysops . You can use sar not only to print out useful stats, but capture logs to a data file. Then you can use sadf to conver the data file into an interactive .svg image with charts and graphs!

This graph is a time series chart. On the left is the beginning where llama.cpp starts up. As time goes on I’ve labeled the time periods for loading the model, prompt processing, and token generation until it stops.

This test uses a larger Q4_K_M quant weighing in at 377GiB which is too big to fit into the 256GB system RAM. So this test was designed to stress the buffered disk reads and figure out the bottle neck on performance there.

After updating the Linux kernel from 6.8.0 to a fresh 6.13.0 and enabling transparent huge pages, performance improved a bit as shown in this side by side. Left side is same chart as above on 6.8.0, and right side is newer 6.13.0. While slightly better, it isn’t a significant improvement for actual token generation.

There are two .svg files as well if you want more metrics:

Here is how you could generate graphs like this for your own tests:

# install needed packages
$ sudo apt-get install sysstat || pacman -Sy extra/sysstat

# drop linux kernel page cache, cached entries, and inodes
# might take a minute if you have a lot of RAM and stuff cached
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

# begin capturing 
$ sar -d --dev=md0,nvme0n1,nvme1n1,nvme2n1,nvme3n1 -B -q ALL 1 -o sarlog.dat

# do whatever test in another terminal e.g. a llama-server inference test
# hit <cntrl>+c stop stop capturing when done

# generate svg file using same arguments as sar above
$ sadf -g sarlog.dat -O showtoc -- -d --dev=md0,nvme0n1,nvme1n1,nvme2n1,nvme3n1 -B -q ALL > output.svg

You don’t have to use sadf and generate a .svg as the sar will print some useful stuff to the console while it is logging. The next section has some examples.

BPF Performance Tools

One of my sysadmin role models is Brendan Gregg who popularized perf Flame Graphs as well as advanced eBPF Linux Kernel Probe tools. He has a couple fat books available and a useful website as well. The tools are a bit confusing to install, but can be a great way to observe how a user land application interacts with the Linux Kernel systems.

You can measure things like block i/o size, disk latency, sequential vs random i/o, Off-CPU Analysis, cache stats, etc. The advantage of using BPF probes over normal perf is they often have less overhead.

The rest of this mega post is a bunch of these tools measuring various things during llama-bench for both prompt processing as well as token
generation. You can see that each phase of inferencing has a somewhat different profile.

All these measurements were taken with Linux 6.13.0 on the 24 core Threadripper Pro running llama-bench on the Q4_K_M to focus on that buffered read i/o bottleneck.

./build/bin/llama-bench \
    --model /mnt/raid/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf \
    --cache-type-k q4_0 \
    --cache-type-v f16 \
    --threads 24

model	size	params	backend	threads	type_k	test	t/s
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	24	q4_0	pp512	2.32 ± 0.20
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	24	q4_0	tg128	1.12 ± 0.16

build: 90e4dba4 (4689)

Install libbpf-tools

# The package situation on Ubuntu is confusing
# https://github.com/iovisor/bcc/blob/master/INSTALL.md#ubuntu---binary
# https://github.com/iovisor/bcc/pull/4940#issuecomment-2645861929
sudo apt-get install libbpf-tools

# i'm using the /usr/sbin/biosnoop ones for example
$ apt-file search /usr/sbin/biosnoop
bpfcc-tools: /usr/sbin/biosnoop-bpfcc
bpftrace: /usr/sbin/biosnoop.bt
libbpf-tools: /usr/sbin/biosnoop

Methodology

Wait for the disk cache to fill up and everything to settle down, then
run the tests twice, once during prompt processing and again for token
generation. Confirm that CPU usage is around 50% with btop or similar.
Then take each measurement and log it.

Repeat as necessary to collect all of the data.

Baseline

Get baseline disk i/o and page cache stats using a more common tool like sar.

Prompt Processing

# sample the disks/raid backing the model for 180 seconds 1 time
$ sar -d --dev=md0,nvme0n1,nvme1n1,nvme2n1,nvme3n1 -B -q ALL 180 1

Linux 6.13.0-061300-generic (wTR24)     02/12/2025      _x86_64_        (48 CPU)

01:43:56 PM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
01:46:56 PM 1336677.96   3894.84  21891.61   2635.86 2371208.87 367155.06  22370.98 667337.86    171.32

01:43:56 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked
01:46:56 PM        24      1835     24.72     17.01     11.06         0

01:43:56 PM       DEV       tps     rkB/s     wkB/s     dkB/s   areq-sz    aqu-sz     await     %util
01:46:56 PM   nvme5n1    283.17   4181.93   3894.84      0.00     28.52      2.31      8.15      2.05
01:46:56 PM   nvme1n1  19100.67 333147.36      0.00      0.00     17.44      0.72      0.04     14.20
01:46:56 PM   nvme0n1  18782.46 333234.49      0.00      0.00     17.74      0.99      0.05     14.97
01:46:56 PM   nvme2n1  19364.39 332987.67      0.00      0.00     17.20      0.74      0.04     14.07
01:46:56 PM   nvme3n1  19495.42 332798.36      0.00      0.00     17.07      0.75      0.04     13.96
01:46:56 PM       md0  76849.94 1332167.78      0.00      0.00     17.33      3.05      0.04     73.67

01:43:56 PM  %scpu-10  %scpu-60 %scpu-300     %scpu
01:46:56 PM      0.00      0.00      0.00      0.02

01:43:56 PM   %sio-10   %sio-60  %sio-300      %sio   %fio-10   %fio-60  %fio-300      %fio
01:46:56 PM      0.00      0.17      0.12      0.65      0.00      0.14      0.10      0.64

01:43:56 PM  %smem-10  %smem-60 %smem-300     %smem  %fmem-10  %fmem-60 %fmem-300     %fmem
01:46:56 PM      1.79      2.57      4.34      3.29      1.79      2.57      4.34      3.29

Token Generation

$ sar -d --dev=md0,nvme0n1,nvme1n1,nvme2n1,nvme3n1 -B -q ALL 180 1

Linux 6.13.0-061300-generic (wTR24)     02/12/2025      _x86_64_        (48 CPU)

03:13:58 PM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
03:16:58 PM 1595231.85    188.96  24359.67   5846.29 1818449.61 412689.57  39633.53 798180.86    176.46

03:13:58 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked
03:16:58 PM         1      1837     24.42     24.44     23.35        19

03:13:58 PM       DEV       tps     rkB/s     wkB/s     dkB/s   areq-sz    aqu-sz     await     %util
03:16:58 PM   nvme1n1  22941.29 398430.58      0.00      0.00     17.37      1.06      0.05     13.77
03:16:58 PM   nvme0n1  22536.70 398376.89      0.00      0.00     17.68      1.56      0.07     15.28
03:16:58 PM   nvme2n1  22890.26 398587.04      0.00      0.00     17.41      1.06      0.05     13.76
03:16:58 PM   nvme3n1  22838.09 398251.42      0.00      0.00     17.44      1.06      0.05     13.79
03:16:58 PM       md0  91373.23 1593645.27      0.00      0.00     17.44      4.50      0.05     65.69

03:13:58 PM  %scpu-10  %scpu-60 %scpu-300     %scpu
03:16:58 PM      0.00      0.00      0.00      0.05

03:13:58 PM   %sio-10   %sio-60  %sio-300      %sio   %fio-10   %fio-60  %fio-300      %fio
03:16:58 PM      0.00      0.17      0.54      0.90      0.00      0.17      0.54      0.89

03:13:58 PM  %smem-10  %smem-60 %smem-300     %smem  %fmem-10  %fmem-60 %fmem-300     %fmem
03:16:58 PM      4.69      3.63      3.45      4.79      4.69      3.63      3.45      4.79

Block i/o size

Summarize block device I/O size as a histogram.

Prompt Processing

$ sudo bitesize -c llama-bench 180 1

Process Name = llama-bench
     Kbytes              : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 6019517  |****************************************|
         8 -> 15         : 3953364  |**************************              |
        16 -> 31         : 1914791  |************                            |
        32 -> 63         : 1011460  |******                                  |
        64 -> 127        : 604708   |****                                    |
       128 -> 255        : 378287   |**                                      |

Token Generation

$ sudo bitesize -c llama-bench 180 1

Process Name = llama-bench
     Kbytes              : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 7742863  |****************************************|
         8 -> 15         : 4567127  |***********************                 |
        16 -> 31         : 1938310  |**********                              |
        32 -> 63         : 869145   |****                                    |
        64 -> 127        : 639664   |***                                     |
       128 -> 255        : 626360   |***                                     |

Block i/o latency

Summarize block device I/O latency as a histogram.

Prompt Processing

$ sudo biolatency -T -D 180 1

disk = nvme0n1
     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 870744   |***************************             |
         8 -> 15         : 1250919  |****************************************|
        16 -> 31         : 192358   |******                                  |
        32 -> 63         : 144834   |****                                    |
        64 -> 127        : 527912   |****************                        |
       128 -> 255        : 328101   |**********                              |
       256 -> 511        : 77128    |**                                      |
       512 -> 1023       : 4002     |                                        |
      1024 -> 2047       : 14       |                                        |
      2048 -> 4095       : 5        |                                        |

disk = nvme1n1
     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 909798   |***************************             |
         8 -> 15         : 1340334  |****************************************|
        16 -> 31         : 212264   |******                                  |
        32 -> 63         : 337110   |**********                              |
        64 -> 127        : 475752   |**************                          |
       128 -> 255        : 146569   |****                                    |
       256 -> 511        : 27103    |                                        |
       512 -> 1023       : 708      |                                        |
      1024 -> 2047       : 3        |                                        |
      2048 -> 4095       : 4        |                                        |

disk = nvme2n1
     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 923536   |***************************             |
         8 -> 15         : 1366403  |****************************************|
        16 -> 31         : 214847   |******                                  |
        32 -> 63         : 349386   |**********                              |
        64 -> 127        : 481452   |**************                          |
       128 -> 255        : 147123   |****                                    |
       256 -> 511        : 27168    |                                        |
       512 -> 1023       : 591      |                                        |
      1024 -> 2047       : 9        |                                        |
      2048 -> 4095       : 1        |                                        |

disk = nvme3n1
     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 920670   |***************************             |
         8 -> 15         : 1361679  |****************************************|
        16 -> 31         : 217317   |******                                  |
        32 -> 63         : 356585   |**********                              |
        64 -> 127        : 488626   |**************                          |
       128 -> 255        : 149823   |****                                    |
       256 -> 511        : 27308    |                                        |
       512 -> 1023       : 694      |                                        |
      1024 -> 2047       : 5        |                                        |
      2048 -> 4095       : 1        |                                        |

Token Generation

$ sudo biolatency -T -D 180 1

disk = nvme0n1
     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 1770055  |****************************************|
         8 -> 15         : 588660   |*************                           |
        16 -> 31         : 107722   |**                                      |
        32 -> 63         : 173490   |***                                     |
        64 -> 127        : 550676   |************                            |
       128 -> 255        : 488273   |***********                             |
       256 -> 511        : 156297   |***                                     |
       512 -> 1023       : 11933    |                                        |
      1024 -> 2047       : 161      |                                        |
      2048 -> 4095       : 39       |                                        |
      4096 -> 8191       : 49       |                                        |

disk = nvme1n1
     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 1918463  |****************************************|
         8 -> 15         : 668260   |*************                           |
        16 -> 31         : 121909   |**                                      |
        32 -> 63         : 358798   |*******                                 |
        64 -> 127        : 494699   |**********                              |
       128 -> 255        : 246599   |*****                                   |
       256 -> 511        : 55995    |*                                       |
       512 -> 1023       : 1480     |                                        |
      1024 -> 2047       : 160      |                                        |
      2048 -> 4095       : 60       |                                        |
      4096 -> 8191       : 19       |                                        |

disk = nvme2n1
     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 1920661  |****************************************|
         8 -> 15         : 669311   |*************                           |
        16 -> 31         : 122003   |**                                      |
        32 -> 63         : 356457   |*******                                 |
        64 -> 127        : 492837   |**********                              |
       128 -> 255        : 246685   |*****                                   |
       256 -> 511        : 54411    |*                                       |
       512 -> 1023       : 1674     |                                        |
      1024 -> 2047       : 175      |                                        |
      2048 -> 4095       : 47       |                                        |
      4096 -> 8191       : 47       |                                        |

disk = nvme3n1
     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 1910195  |****************************************|
         8 -> 15         : 673350   |**************                          |
        16 -> 31         : 122117   |**                                      |
        32 -> 63         : 355674   |*******                                 |
        64 -> 127        : 496290   |**********                              |
       128 -> 255        : 247926   |*****                                   |
       256 -> 511        : 55412    |*                                       |
       512 -> 1023       : 1825     |                                        |
      1024 -> 2047       : 115      |                                        |
      2048 -> 4095       : 58       |                                        |
      4096 -> 8191       : 24       |                                        |

Block i/o Top

Trace file reads/writes by process.

Prompt Processing

$ sudo biotop --pid=$(pidof llama-bench) 180 1

PID     COMM             D MAJ MIN DISK       I/O  Kbytes  AVGms
463986  llama-bench      R 259 6   nvme0n1  3323356 61527016   0.05
463986  llama-bench      R 259 4   nvme1n1  3390545 61579280   0.03
463986  llama-bench      R 259 7   nvme2n1  3446108 61477652   0.03
463986  llama-bench      R 259 10  nvme3n1  3464197 61390536   0.03

Token Generation

$ sudo biotop --pid=$(pidof llama-bench) 180 1

PID     COMM             D MAJ MIN DISK       I/O  Kbytes  AVGms
483569  llama-bench      R 259 6   nvme0n1  3828846 64984816   0.06
483569  llama-bench      R 259 4   nvme1n1  3849134 64995000   0.04
483569  llama-bench      R 259 7   nvme2n1  3846462 64923580   0.04
483569  llama-bench      R 259 10  nvme3n1  3846248 64986828   0.04

Block i/o Pattern

Show block device I/O pattern.

Prompt Processing

$ sudo biopattern -T 180 1

TIME      DISK     %RND  %SEQ    COUNT     KBYTES
14:01:53  nvme0n1    35    64  3345173   62683820
14:01:53  nvme1n1    32    67  3404623   62661548
14:01:53  nvme2n1    33    66  3470880   62643204
14:01:53  nvme3n1    33    66  3496122   62571880

Token Generation

$ sudo biopattern -T 180 1

Tracing block device I/O requested seeks... Hit Ctrl-C to end.
TIME      DISK     %RND  %SEQ    COUNT     KBYTES
15:53:13  nvme0n1    58    41  4297223   55141108
15:53:13  nvme1n1    54    45  4393544   55156308
15:53:13  nvme2n1    54    45  4382473   55182708
15:53:13  nvme3n1    54    45  4385386   55126844

Block i/o Snoop

Trace block I/O.

Prompt Processing

# do this a few times to get a feel for exact size reads and latencies
$ sudo biosnoop | head
TIME(s)     COMM           PID     DISK    T    SECTOR     BYTES   LAT(ms)
0.000000    llama-b        472645  nvme1n1 RA   171795752  4096      0.056
0.000751    llama-b        472645  nvme2n1 RA   171796208  65536     0.044
0.000756    llama-b        472645  nvme2n1 RA   171796376  20480     0.030
0.000769    llama-b        472645  nvme2n1 RA   171796440  20480     0.010
0.001025    llama-b        472645  nvme3n1 RA   171795456  16384     0.268
0.001027    llama-b        472645  nvme3n1 RA   171795584  8192      0.253
0.001028    llama-b        472645  nvme3n1 RA   171795664  36864     0.220
0.001341    llama-b        472645  nvme0n1 RA   171796488  8192      0.131
0.002048    llama-b        472645  nvme1n1 RA   171796520  8192      0.122

Token Generation

$ sudo biosnoop 1 | head
TIME(s)     COMM           PID     DISK    T    SECTOR     BYTES   LAT(ms)
0.000000    llama-b        483569  nvme1n1 RA   262911728  32768     0.064
0.000091    llama-b        483569  nvme1n1 RA   262911912  4096      0.042
0.000141    llama-b        483569  nvme1n1 RA   262911792  61440     0.088
0.000166    llama-b        483569  nvme1n1 RA   262911920  16384     0.079
0.000216    llama-b        483569  nvme2n1 RA   262911000  4096      0.047
0.000230    llama-b        483569  nvme1n1 RA   262911952  24576     0.056
0.000250    llama-b        483569  nvme2n1 RA   262911008  12288     0.062
0.000259    llama-b        483569  nvme2n1 RA   262910976  12288     0.088
0.000272    llama-b        483569  nvme2n1 RA   262911032  4096      0.071

CPU Cache Stats

Summarize cache references and misses by PID.

Prompt Processing

# sort by cpu (the processes likely bounce between cores/threads)
sudo llcstat 180 | grep -P '(llama-bench|Total|PID)' | sort -V -k 3,3

PID      NAME             CPU     REFERENCE         MISS    HIT%
472645   llama-bench      0       260084200    105109100  59.59%
472645   llama-bench      1       271595900    107856300  60.29%
472645   llama-bench      2       122634100     48118400  60.76%
472645   llama-bench      3       103657900     38727000  62.64%
472645   llama-bench      4       152044500     59005700  61.19%
472645   llama-bench      5       155044000     58195700  62.47%
472645   llama-bench      6       292201400    117024400  59.95%
472645   llama-bench      7       174555900     67375600  61.40%
472645   llama-bench      8       348132500    140949600  59.51%
472645   llama-bench      9       261152300    102766400  60.65%
472645   llama-bench      10      408874400    163455700  60.02%
472645   llama-bench      11      451004300    179642900  60.17%
472645   llama-bench      12      387897500    158195400  59.22%
472645   llama-bench      13      502233700    207044400  58.78%
472645   llama-bench      14      186695700     70750300  62.10%
472645   llama-bench      15      406424500    160824600  60.43%
472645   llama-bench      16      346197800    134519400  61.14%
472645   llama-bench      17      204177200     85408700  58.17%
472645   llama-bench      18      165621000     66280800  59.98%
472645   llama-bench      19      326486500    135545100  58.48%
472645   llama-bench      20      368738300    139175500  62.26%
472645   llama-bench      21      261058700    102228400  60.84%
472645   llama-bench      22      424314500    166226400  60.82%
472645   llama-bench      23      252195500    102542900  59.34%
472645   llama-bench      24     1189916100    498972700  58.07%
472645   llama-bench      25     1160928000    478048200  58.82%
472645   llama-bench      26     1314916900    535932500  59.24%
472645   llama-bench      27     1342106800    549897700  59.03%
472645   llama-bench      28     1278900000    521841200  59.20%
472645   llama-bench      29     1290325600    527823400  59.09%
472645   llama-bench      30     1060096500    428899700  59.54%
472645   llama-bench      31     1280229000    524769100  59.01%
472645   llama-bench      32     1052018800    428794900  59.24%
472645   llama-bench      33      976827500    394650300  59.60%
472645   llama-bench      34     1025104000    419006300  59.13%
472645   llama-bench      35      815274000    336547500  58.72%
472645   llama-bench      36     1054887300    428246800  59.40%
472645   llama-bench      37      917551200    373496700  59.29%
472645   llama-bench      38     1247426700    513614100  58.83%
472645   llama-bench      39     1023180200    424158000  58.55%
472645   llama-bench      40     1087901900    448935100  58.73%
472645   llama-bench      41     1222746800    499064500  59.18%
472645   llama-bench      42     1290558800    526509700  59.20%
472645   llama-bench      43     1064135400    431187300  59.48%
472645   llama-bench      44      809287700    340242200  57.96%
472645   llama-bench      45     1191512400    491595900  58.74%
472645   llama-bench      46     1006870000    416382800  58.65%
472645   llama-bench      47     1191455700    490035800  58.87%
Total References: 35036149900 Total Misses: 14365309500 Hit Rate: 59.00%

Token Generation

sudo llcstat 180 | grep -P '(llama-bench|Total|PID)' | sort -V -k 3,3

PID      NAME             CPU     REFERENCE         MISS    HIT%
493469   llama-bench      0       178739900     92973100  47.98%
493469   llama-bench      1       143521000     71367200  50.27%
493469   llama-bench      2       102070400     49472700  51.53%
493469   llama-bench      3        93535400     45344500  51.52%
493469   llama-bench      4       122853300     63089400  48.65%
493469   llama-bench      5       100126500     50860700  49.20%
493469   llama-bench      6       135180700     68212700  49.54%
493469   llama-bench      7       221127800    114417400  48.26%
493469   llama-bench      8       353029000    192503900  45.47%
493469   llama-bench      9       201506200    107204200  46.80%
493469   llama-bench      10      213137400    111369200  47.75%
493469   llama-bench      11      211342600    107674700  49.05%
493469   llama-bench      12      171033500     92188700  46.10%
493469   llama-bench      13      165303900     87134700  47.29%
493469   llama-bench      14      265153200    138694800  47.69%
493469   llama-bench      15      172611600     84571800  51.00%
493469   llama-bench      16      255942900    131531100  48.61%
493469   llama-bench      17      151376500     75140300  50.36%
493469   llama-bench      18      248108600    130249100  47.50%
493469   llama-bench      19      268105300    140291000  47.67%
493469   llama-bench      20      155939500     81268600  47.88%
493469   llama-bench      21      125014800     66195500  47.05%
493469   llama-bench      22      196849600    103162900  47.59%
493469   llama-bench      23      217617100    104683100  51.90%
493469   llama-bench      24     1010395500    536548800  46.90%
493469   llama-bench      25     1037633200    546470200  47.33%
493469   llama-bench      26      945782800    485664500  48.65%
493469   llama-bench      27     1060634700    554281500  47.74%
493469   llama-bench      28      992441000    524587200  47.14%
493469   llama-bench      29     1041995900    550117700  47.21%
493469   llama-bench      30     1011749000    538214700  46.80%
493469   llama-bench      31      962790100    517939200  46.20%
493469   llama-bench      32      839163500    438704700  47.72%
493469   llama-bench      33      960316800    507777500  47.12%
493469   llama-bench      34      957871700    508651900  46.90%
493469   llama-bench      35      927833600    492939700  46.87%
493469   llama-bench      36      871497400    453762300  47.93%
493469   llama-bench      37      910214300    480818400  47.18%
493469   llama-bench      38      842718900    442357800  47.51%
493469   llama-bench      39     1007880700    536754700  46.74%
493469   llama-bench      40      899776500    473773500  47.35%
493469   llama-bench      41      909038300    477876200  47.43%
493469   llama-bench      42      920419100    485129300  47.29%
493469   llama-bench      43      885914700    468553600  47.11%
493469   llama-bench      44      960370900    510075500  46.89%
493469   llama-bench      45      907635400    484657100  46.60%
493469   llama-bench      46      850517900    453385200  46.69%
493469   llama-bench      47      909464200    491268200  45.98%
Total References: 29114831100 Total Misses: 15202374500 Hit Rate: 47.78%

Cache Stat

Count cache kernel function calls.

Prompt Processing

$ sudo cachestat -T 180 1
TIME         HITS   MISSES  DIRTIES HITRATIO   BUFFERS_MB  CACHED_MB
14:25:46        0        0     4848    0.00%           14     249539

Token Generation

$ sudo cachestat -T 180 1

TIME         HITS   MISSES  DIRTIES HITRATIO   BUFFERS_MB  CACHED_MB
15:58:00        0        0     4865    0.00%           13     250911

Scheduler Run Queue Latency

Summarize run queue (scheduler) latency as a histogram.

Prompt Processing

$ sudo runqlat -p $(pidof llama-bench) 180 1

     usecs               : count    distribution
         0 -> 1          : 82496    |****************************************|
         2 -> 3          : 49371    |***********************                 |
         4 -> 7          : 14323    |******                                  |
         8 -> 15         : 3843     |*                                       |
        16 -> 31         : 978      |                                        |
        32 -> 63         : 189      |                                        |
        64 -> 127        : 208      |                                        |
       128 -> 255        : 39       |                                        |
       256 -> 511        : 30       |                                        |
       512 -> 1023       : 43       |                                        |
      1024 -> 2047       : 41       |                                        |
      2048 -> 4095       : 45       |                                        |

Token Generation

$ sudo runqlat -p $(pidof llama-bench) 180 1

     usecs               : count    distribution
         0 -> 1          : 90843    |****************************************|
         2 -> 3          : 79882    |***********************************     |
         4 -> 7          : 39149    |*****************                       |
         8 -> 15         : 5292     |**                                      |
        16 -> 31         : 1515     |                                        |
        32 -> 63         : 282      |                                        |
        64 -> 127        : 183      |                                        |
       128 -> 255        : 16       |                                        |
       256 -> 511        : 17       |                                        |
       512 -> 1023       : 42       |                                        |
      1024 -> 2047       : 55       |                                        |
      2048 -> 4095       : 34       |                                        |

On-CPU Time

Summarize on-CPU time per task as a histogram.

Prompt Processing

$ sudo cpudist -p $(pidof llama-bench) 180 1

     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 5        |                                        |
         8 -> 15         : 17       |                                        |
        16 -> 31         : 49       |                                        |
        32 -> 63         : 58       |                                        |
        64 -> 127        : 56       |                                        |
       128 -> 255        : 100      |                                        |
       256 -> 511        : 147      |*                                       |
       512 -> 1023       : 598      |*****                                   |
      1024 -> 2047       : 291      |**                                      |
      2048 -> 4095       : 619      |******                                  |
      4096 -> 8191       : 1425     |**************                          |
      8192 -> 16383      : 1931     |*******************                     |
     16384 -> 32767      : 2872     |****************************            |
     32768 -> 65535      : 4011     |****************************************|
     65536 -> 131071     : 3888     |**************************************  |
    131072 -> 262143     : 2650     |**************************              |
    262144 -> 524287     : 893      |********                                |
    524288 -> 1048575    : 76       |                                        |

Token Generation

$ sudo cpudist -p $(pidof llama-bench) 180 1

     usecs               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 1        |                                        |
         4 -> 7          : 8        |                                        |
         8 -> 15         : 11       |                                        |
        16 -> 31         : 27       |                                        |
        32 -> 63         : 69       |                                        |
        64 -> 127        : 84       |                                        |
       128 -> 255        : 132      |*                                       |
       256 -> 511        : 282      |**                                      |
       512 -> 1023       : 846      |********                                |
      1024 -> 2047       : 535      |*****                                   |
      2048 -> 4095       : 1060     |**********                              |
      4096 -> 8191       : 2314     |***********************                 |
      8192 -> 16383      : 2792     |****************************            |
     16384 -> 32767      : 3419     |***********************************     |
     32768 -> 65535      : 3878     |****************************************|
     65536 -> 131071     : 2915     |******************************          |
    131072 -> 262143     : 1850     |*******************                     |
    262144 -> 524287     : 519      |*****                                   |
    524288 -> 1048575    : 44       |                                        |

Off-CPU Time

Summarize off-CPU time per task as a histogram.

Prompt Processing

$ sudo cpudist --offcpu -p $(pidof llama-bench) 180 1

     usecs               : count    distribution
         0 -> 1          : 1        |                                        |
         2 -> 3          : 10007    |************                            |
         4 -> 7          : 30993    |****************************************|
         8 -> 15         : 17615    |**********************                  |
        16 -> 31         : 21114    |***************************             |
        32 -> 63         : 13342    |*****************                       |
        64 -> 127        : 13039    |****************                        |
       128 -> 255        : 11525    |**************                          |
       256 -> 511        : 8151     |**********                              |
       512 -> 1023       : 5882     |*******                                 |
      1024 -> 2047       : 5374     |******                                  |
      2048 -> 4095       : 3981     |*****                                   |
      4096 -> 8191       : 1264     |*                                       |
      8192 -> 16383      : 104      |                                        |
     16384 -> 32767      : 1        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 6        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 23       |                                        |

Token Generation

$ sudo cpudist --offcpu -p $(pidof llama-bench) 180 1

     usecs               : count    distribution
         0 -> 1          : 5        |                                        |
         2 -> 3          : 5250     |*********                               |
         4 -> 7          : 17459    |******************************          |
         8 -> 15         : 14778    |*************************               |
        16 -> 31         : 14864    |*************************               |
        32 -> 63         : 20889    |************************************    |
        64 -> 127        : 18418    |*******************************         |
       128 -> 255        : 17902    |******************************          |
       256 -> 511        : 15147    |**************************              |
       512 -> 1023       : 21717    |*************************************   |
      1024 -> 2047       : 23123    |****************************************|
      2048 -> 4095       : 21891    |*************************************   |
      4096 -> 8191       : 8738     |***************                         |
      8192 -> 16383      : 1500     |**                                      |
     16384 -> 32767      : 303      |                                        |
     32768 -> 65535      : 90       |                                        |
     65536 -> 131071     : 71       |                                        |
    131072 -> 262143     : 5        |                                        |

Off-CPU Time Stack Traces

Summarize off-CPU time by stack trace.

Prompt Processing

# run a few times to see what is happening
$ sudo offcputime -p $(pidof llama-bench) 1

    bpf_prog_721f6802fa0b7ba5_sched_switch
    bpf_prog_721f6802fa0b7ba5_sched_switch
    bpf_trace_run4
    __bpf_trace_sched_switch
    __schedule
    schedule
    io_schedule
    folio_wait_bit_common
    filemap_fault
    __do_fault
    do_read_fault
    do_fault
    handle_pte_fault
    __handle_mm_fault
    handle_mm_fault
    do_user_addr_fault
    exc_page_fault
    asm_exc_page_fault
    ggml_vec_dot_q4_K_q8_K
    ggml_compute_forward_mul_mat_id
    -                llama-bench (473106)
        4645

Token Generation

$ sudo offcputime -p $(pidof llama-bench) 1

    bpf_prog_721f6802fa0b7ba5_sched_switch
    bpf_prog_721f6802fa0b7ba5_sched_switch
    bpf_trace_run4
    __bpf_trace_sched_switch
    __traceiter_sched_switch
    __schedule
    schedule
    io_schedule
    folio_wait_bit_common
    filemap_fault
    __do_fault
    do_read_fault
    do_fault
    handle_pte_fault
    __handle_mm_fault
    handle_mm_fault
    do_user_addr_fault
    exc_page_fault
    asm_exc_page_fault
    ggml_vec_dot_q4_K_q8_K
    ggml_compute_forward_mul_mat_id
    -                llama-bench (473122)
        167

DeepSeek Deep Dive R1 at Home!

Nitty Gritty

Hardware Configuration

Baseline Benchmarks

RAM

NVMe RAID0 Array

fio random read IOPS

fio sequential read

fio buffered llama.cpp mmap() simulation

sar

BPF Performance Tools

Install libbpf-tools

Methodology

Baseline

Prompt Processing

Token Generation

Block i/o size

Prompt Processing

Token Generation

Block i/o latency

Prompt Processing

Token Generation

Block i/o Top

Prompt Processing

Token Generation

Block i/o Pattern

Prompt Processing

Token Generation

Block i/o Snoop

Prompt Processing

Token Generation

CPU Cache Stats

Prompt Processing

Token Generation

Cache Stat

Prompt Processing

Token Generation

Scheduler Run Queue Latency

Prompt Processing

Token Generation

On-CPU Time

Prompt Processing

Token Generation

Off-CPU Time

Prompt Processing

Token Generation

Off-CPU Time Stack Traces

Prompt Processing

Token Generation

`fio` random read IOPS

`fio` sequential read

`fio` buffered llama.cpp mmap() simulation