Has anybody able to hit the advertised read speeds from NVMe SSD when it comes to loading a slightly bigger LLM?

, ,

OP, if your LLM loader is anything like the loader in the Stable Diffusion ecosystem, disk reads are choppy, bursty, small block, and out-of-order.

Read this saga.

We’ve been trying to get disk reads deshittified for 2+ years.

Edit: See this comment where strace was used to characterize disk I/O. You might consider a similar exercise to help determine if there’s a throat to choke somewhere on github.

Edit2: Interesting ideas in this thread. biosnoop output is easier to read vs. strace:

sudo /usr/share/bcc/tools/biosnoop -d nvme0n1

TIME(s)     COMM           PID     DISK      T SECTOR     BYTES  LAT(ms)
6.756196    python3        51553   nvme0n1   R 1634512288 16384     0.24
6.756386    python3        51553   nvme0n1   R 1634513632 16384     0.15
6.756646    python3        51553   nvme0n1   R 2596332352 16384     0.21
6.756818    python3        51553   nvme0n1   R 1635159584 16384     0.12
6.756997    python3        51553   nvme0n1   R 1635159648 16384     0.15
6.757176    python3        51553   nvme0n1   R 2596333312 16384     0.15
6.784098    python3        51553   nvme0n1   R 2594759872 16384     0.16
6.784149    python3        51553   nvme0n1   R 2594759904 16384     0.03
6.784695    python3        51553   nvme0n1   R 1635161152 16384     0.12
6.784861    python3        51553   nvme0n1   R 1635161184 16384     0.12
6.785112    python3        51553   nvme0n1   R 1635161536 16384     0.22
6.785442    python3        51553   nvme0n1   R 3140679440 131072    0.30
6.785539    python3        51553   nvme0n1   R 3140679184 131072    0.40
6.785591    python3        51553   nvme0n1   R 3140680976 131072    0.40
6.785628    python3        51553   nvme0n1   R 3140679952 131072    0.48
6.785697    python3        51553   nvme0n1   R 3140678928 131072    0.56
6.785705    python3        51553   nvme0n1   R 3140679696 131072    0.56
6.785735    python3        51553   nvme0n1   R 3140680208 131072    0.58
6.785795    python3        51553   nvme0n1   R 3140681488 131072    0.59
6.785801    python3        51553   nvme0n1   R 3140680464 131072    0.64
6.785896    python3        51553   nvme0n1   R 3140682000 131072    0.69
6.785924    python3        51553   nvme0n1   R 3140683024 131072    0.67
6.785961    python3        51553   nvme0n1   R 3140682512 131072    0.74
6.786033    python3        51553   nvme0n1   R 3140684048 131072    0.76
6.786041    python3        51553   nvme0n1   R 3140687056 32768     0.71
6.786083    python3        51553   nvme0n1   R 3140680720 131072    0.89
6.786084    python3        51553   nvme0n1   R 3140685072 98304     0.77
6.786111    python3        51553   nvme0n1   R 3140686032 131072    0.79
6.786115    python3        51553   nvme0n1   R 3140683536 131072    0.85
6.786157    python3        51553   nvme0n1   R 3140685520 131072    0.84
6.786179    python3        51553   nvme0n1   R 3140681232 131072    0.98
6.786190    python3        51553   nvme0n1   R 3140684560 131072    0.91
6.786222    python3        51553   nvme0n1   R 3140686544 131072    0.89
6.786281    python3        51553   nvme0n1   R 3140682256 131072    1.07
6.786308    python3        51553   nvme0n1   R 3140681744 131072    1.10
6.786352    python3        51553   nvme0n1   R 3140683280 131072    1.09
6.786359    python3        51553   nvme0n1   R 3140682768 131072    1.11
6.786363    python3        51553   nvme0n1   R 3140684304 131072    1.09

Note the SECTOR column. ComfyUI is jumping around reading this freshly-defragmented .safetensors checkpoint. At its worst I’ve seen I/Os capped at 16k. This is actually somewhat improved. Regardless this is capping my reads near ~2GB/s on a device that benches ~3.1GB/s.

The safetensors read technique, once deshittified, is a thing of beauty. Perfectly back-to-back and fully sequential 1M I/Os yielding fully saturated disk subsystems. We had this until the first version of the comfyui-faster-loading plug-in quit working after the Comfy folk re-engineered the model loader.

Anyway, see if your LLM model loader exhibits a similar pathology.

2 Likes