You’re mixing up page size with block size / LBAF. Page size doesn’t have any relevance to the physical write operations.
I know those drives report a 512B block size, I’ve read claims that it’s actually 4K but advertised as 512B for compatibility reasons. As far as I know, Samsung hasn’t publicly clarified that
The default “inbox” Debian drivers for rdma and/or rdma-core, as far as I can tell, doesn’t support NFSoRMDA out of the box. Therefore; to get that, I would need to install the MLNX_OFED driver. But that’s just for the NFSoRDMA feature/functionality, which, many years ago, Mellanox (pre-acquisition by Nvidia), actually took out said NFSoRDMA feature/functionality, from their MLNX_OFED driver.
As a result, I try to stay away from using the Mellanox driver, where/when/as much as possible, because they can disable features at will, whereas the CentOS inbox driver at the time, didn’t have this feature disabled.
(I called them out for false advertising in their marketing materials. Mellanox re-enabled said NFSoRDMA feature in thelr MLNX_OFED driver.)
Sounds close. 10 GB/s = 80 Gbps (give or take).
What’s the syntax for this command that you’re using, if you don’t mind me asking?
Varies.
Depends on how you have things set up (i.e. you can use something like Gluster, and maybe Ceph and then export the distributed volume over onto the NFSoRDMA network. I’ve done that with RAM drives and Gluster v3.7. It works ok.) But if you are using distributed storage like Gluster and/or Ceph, unless you have a second port that you can use for the backend syncing, then otherwise, it will use your primary/only port for sync and active data transfers, which can eat into your total, available bandwidth, for the data transfers itself.
(All of my ConnectX-4 100 Gbps IB NICs are MCX456A-ECAT, so they’re all dual port NICs, which will saturate the PCIe 3.0 x16 bus.)
Additionally, if I have the NVMe slots/space to put the storage inside the inference rig, then I can power down my entire IB network, and it will still work. (i.e. I don’t need my IB network to be up and running, for my inferencing system to work.)
This suggests that you’re running multiple NVMe SSDs.
That doesn’t tell me whether you’re able to hit the advertised read speeds from each of your NVMe SSDs that you have in the stripe (or how close/far you are from being able to hit said advertised speeds).
This one only shows a write bandwidth of 23.3 GB/s and a read bandwidth of 39.6 GB/s.
And from your initial commentary (about your system/setup):
Can’t tell if you’re referring to this video here or not, when you said “screenshot”.
Don’t know how well Windows Task Manager is able to accurately report ZFS stripe speeds.
I know, for example, that on 12th gen and newer Intel CPUs, it can’t accurately report CPU clock speeds as a result of the mix of P- and E-cores. So, unless you’re exporting the ZFS stripe of 4x Gen5 NVMe SSD as an iSCSI target, but would also suggest that you’re using faster than 100 Gbps IB for that, which you’ve said:
so getting 50.7 GB/s read via Windows Task Manager for a ZFS stripe seems unlikely.
On the other hand, if the VROC m2 stripe across (presumably) the same 4 gen5 NVMe SSDs is able to get your 50.7 GB/s out of 56 GB/s possible, then that would suggest that loading the LLM from a Windows system would be better than trying to load it from a ZFS stripe.
And at that point, your 100 Gbps IB would be your bottleneck, and therefore; you’d be better off putting the GPUs on the same system as your storage, which would be the opposite to what you said earlier:
Seems like VROC m2 stripe on the same system as your GPUs would be the way to go, based on the data that you’ve shared.
But it also sounds like that you need a system which, as you’ve said “has wayyyy more cores”, to be able to pull it off, which also implies that with consumer grade hardware, that would be unlikely to be achievable.
Actually no. Page size within the context of SSDs, more specifically NAND flash is the smallest physical chunk which can be written.
It’s confusing as fuck.
But it is what it is.
It’s different than sector sizes seen by the OS, because the OS cannot put 2 files in a single sector (although it can concatenate files and metadata - these are called resident files).
However the SSD can put 2+ files on the same page. It will just lead to write amplification, since it has to discard that page and write a new one. Not erase mind you, there’s even more hidden complexity, because NAND can be written on a page level, but only erased on a (NAND) block level (multiple NAND pages).
Please don’t ask why these terms are so overloaded.
I’m a software person, if you didn’t guess. A combination of misunderstanding what I’d previously read about the logical block size + ignoring the fact that the flash controller exists + confusing the terms used in OS virtual management == confidently asserting wrongness
Anyway, thank you for the correction and the explanation - it really made me think about it. Fun exercise.
I think I’ll stick with software, though. Especially when “correcting” someone.
As far as overloading of terms - it seems like a poor compromise of using terms in traditional virtual memory management while at the same time using terms from traditional non-volatile storage systems (block/LBA)
That’s not even quite right, but it’s the best I can do
You are mixing up two completely different setups.
My workstation is a xeon 2495x. it doesnt have any e-cores. and it doesnt run zfs or mount remote storage. I am using the same 4xT705 in asus hyper gen5 cards in multiple places.
The VROC setup can come very close to the theoretical max. You could achieve the same on a linux box with the high clock speed cores and some tuning.
For the inference rig, 10GB/s is fine. Running local storage inside it isnt worth the hardware.
Also these T705s thermal throttle if you dont keep them cool, so keep that in mind. You could run the same setup with slightly worse cooling and find wildly slower results.
We’ve been trying to get disk reads deshittified for 2+ years.
Edit: See this comment where strace was used to characterize disk I/O. You might consider a similar exercise to help determine if there’s a throat to choke somewhere on github.
Edit2: Interesting ideas in this thread. biosnoop output is easier to read vs. strace:
sudo /usr/share/bcc/tools/biosnoop -d nvme0n1
TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms)
6.756196 python3 51553 nvme0n1 R 1634512288 16384 0.24
6.756386 python3 51553 nvme0n1 R 1634513632 16384 0.15
6.756646 python3 51553 nvme0n1 R 2596332352 16384 0.21
6.756818 python3 51553 nvme0n1 R 1635159584 16384 0.12
6.756997 python3 51553 nvme0n1 R 1635159648 16384 0.15
6.757176 python3 51553 nvme0n1 R 2596333312 16384 0.15
6.784098 python3 51553 nvme0n1 R 2594759872 16384 0.16
6.784149 python3 51553 nvme0n1 R 2594759904 16384 0.03
6.784695 python3 51553 nvme0n1 R 1635161152 16384 0.12
6.784861 python3 51553 nvme0n1 R 1635161184 16384 0.12
6.785112 python3 51553 nvme0n1 R 1635161536 16384 0.22
6.785442 python3 51553 nvme0n1 R 3140679440 131072 0.30
6.785539 python3 51553 nvme0n1 R 3140679184 131072 0.40
6.785591 python3 51553 nvme0n1 R 3140680976 131072 0.40
6.785628 python3 51553 nvme0n1 R 3140679952 131072 0.48
6.785697 python3 51553 nvme0n1 R 3140678928 131072 0.56
6.785705 python3 51553 nvme0n1 R 3140679696 131072 0.56
6.785735 python3 51553 nvme0n1 R 3140680208 131072 0.58
6.785795 python3 51553 nvme0n1 R 3140681488 131072 0.59
6.785801 python3 51553 nvme0n1 R 3140680464 131072 0.64
6.785896 python3 51553 nvme0n1 R 3140682000 131072 0.69
6.785924 python3 51553 nvme0n1 R 3140683024 131072 0.67
6.785961 python3 51553 nvme0n1 R 3140682512 131072 0.74
6.786033 python3 51553 nvme0n1 R 3140684048 131072 0.76
6.786041 python3 51553 nvme0n1 R 3140687056 32768 0.71
6.786083 python3 51553 nvme0n1 R 3140680720 131072 0.89
6.786084 python3 51553 nvme0n1 R 3140685072 98304 0.77
6.786111 python3 51553 nvme0n1 R 3140686032 131072 0.79
6.786115 python3 51553 nvme0n1 R 3140683536 131072 0.85
6.786157 python3 51553 nvme0n1 R 3140685520 131072 0.84
6.786179 python3 51553 nvme0n1 R 3140681232 131072 0.98
6.786190 python3 51553 nvme0n1 R 3140684560 131072 0.91
6.786222 python3 51553 nvme0n1 R 3140686544 131072 0.89
6.786281 python3 51553 nvme0n1 R 3140682256 131072 1.07
6.786308 python3 51553 nvme0n1 R 3140681744 131072 1.10
6.786352 python3 51553 nvme0n1 R 3140683280 131072 1.09
6.786359 python3 51553 nvme0n1 R 3140682768 131072 1.11
6.786363 python3 51553 nvme0n1 R 3140684304 131072 1.09
Note the SECTOR column. ComfyUI is jumping around reading this freshly-defragmented .safetensors checkpoint. At its worst I’ve seen I/Os capped at 16k. This is actually somewhat improved. Regardless this is capping my reads near ~2GB/s on a device that benches ~3.1GB/s.
The safetensors read technique, once deshittified, is a thing of beauty. Perfectly back-to-back and fully sequential 1M I/Os yielding fully saturated disk subsystems. We had this until the first version of the comfyui-faster-loading plug-in quit working after the Comfy folk re-engineered the model loader.
Anyway, see if your LLM model loader exhibits a similar pathology.
I did code a benchmark tool for LM Studio. Maybe I finish it soon to make it public.
Using the python api the SSD loading speed tops out at ~2GB/s. However models vary widely in loading speed. Larger models seem to load slower.
A repeat load is 3.2 GB/s (probably cached)
You post screenshots showing the results without clearly defining nor stating what the system specs were, relative to each result.
You might know that information, but when you post two screenshots, one after another, without clearly identify nor stating that we are now looking at results from an entirely different system – I can’t read your mind.
I surmise that you’re running RAID0.
But again, the fact that you have four Gen5 NVMe SSDs to achieve the speeds you’re reporting you’re able to get would suggest that each drive is contributing 1/4 of said total speed, when really, a single Gen5 NVMe SSD is supposed to be able to deliver that speed, by itself. (Spec sheet for the 1 TB variant says sequential read speeds of up to 13.6 GB/s.)
In other words, with respect to my OP, you’re not getting the rated speeds when you’re reading a LLM from your storage array.
Again, according to the spec speet for the 1 TB variant, a single T705 Gen5 NVMe SSD should be able to achieve this 10 GB/s that you’re referring to.
Varies/depends.
As far as I can tell, it doesn’t look like that.
It varies a little bit between 1.5 GB/s to 2 GB/s, but it pretty much stays there until the LLM is loaded.
I’m not really sure.
Again, it will take someone who knows how to read and understand the code for how vLLM loads the models in.
But that will require investigation into how vLLM loads the LLMs, and I don’t know anything about coding/programming to be able to ascertain this.
I still can’t say I understand why, when LLM tools are loading said LLMs, why it doesn’t appear to be able to take full advantage of what the SSDs are capable of, when it comes to sequential reads.
(Unless it’s not strictly a sequential read, but I don’t know why random reads would be a part of loading said LLM into said LLM tool(s).)
See if you can get your hands on biosnoop for your distro. If your LLM reads are like the snip I posted above then we’ll know exactly what the problem is.
I am not certain whether you running biosnoop helped you get closer to your 3.1 GB/s benchmark read speeds vs. what you were actually able to achieve (~2 GB/s).
My thinking is that without running said biosnoop, if you were still only getting ~2 GB/s out of ~3.1 GB/s, it’s still “better” (i.e. ~67-68% of what the drive can do, but presumably below the theoretical bandwidth/capability of the interface whether it’s NVMe 3.0 x4, 4.0 x4, or 5.0 x4.
In other words, given that we still really aren’t hitting what the drives can do, especially when it’s a single drive, at least for AI workloads, it would probably make more sense for me to invest in capacity over sheer speed as the model loaders can’t fully saturate the read bandwidth of any single NVMe SSD, pretty much regardless of whether it’s 3.0, 4.0, and most definitely not 5.0.
(Besides, if you’re running this at home, with consumer hardware, you’re going to be limited by a bunch of other factors as well (e.g. total VRAM you have available to you for the model to be loaded into) (or main system RAM if you’re using ONLY CPU inferencing), which then determines how much of a model you’re going to bother downloading. (i.e. if you have a RTX A2000 6 GB, chances are, you’re not really going to bother download the deepseek-r1:671b model, which is 404 GB. Therefore; it would be pointless to see how fast your NVMe drive can load said 404 GB model, because your actual usage of that model, especially if it is primarily CPU-only, is going to be too slow for you to be able to use said model in an effective and efficient manner.)
Conversely, for me, I’m using mostly the deepseek-r1:70b model (43 GB), which means in theory, even at 3 GB/s read, it’d still take me about 14 seconds to load. At 2 GB/s, it’s increase to almost 22 seconds (when I’m using ollama). When I’m using/testing vllm, I have to go with smaller models like deepseek-r1:32b model because of how it is loading the higher precision safetensors from HuggingFace.
But the point of my question still stands – given the data, it is better for me to invest in capacity over sheer speed. (Most NVMe SSDs really, much faster than NVMe 3.0 x4, can’t effectively nor efficiently be utilised when it comes to their speed, for its price. Any gains in read speed/reduction in model load times would be marginal, at best.)
I’m seeing loading speeds anywhere from 1.5 to 2.3 GB/s on Windows using LM Studio. Models were 22 - 85GB in size. The largest loaded the quickest, the median average would be around 1.67GB if ignoring the 2.3GB/s result.
As others have already said it varies by model. You are assuming it’s a disk bound process, but I’m fairly sure it is not. You are entirely ignoring that LLMs are also concurrently loaded into GPU VRAM for hardware acceleration. The CPU has to both read the disk IO, unpack and convert it, copy it into RAM, and process out select layers of the data to feed into the VRAM. How much of the LLM is loaded into your GPU’s VRAM depends on your GPU, it especially depends on your settings and/or each model’s defaults. If you overfill your VRAM due to configured settings then some data is pushed back out into the GPU’s system RAM cache, this in turn will greatly slow down the loading process for your model since you effectively just ouroborosed the data. For example if I underfill the VRAM the same models will load faster, but I rather overfill slightly for maximum acceleration and let 500MB up to 1GB of mostly VGA data be evicted back into system RAM which I have plenty of to spare.
So ultimately, your loading times are going to vary significantly based on how much VRAM you’re filling, and if you’re tuning your LLM settings to load exactly the right amount of layers & context window size relative to your GPU. I can brute force it and overfill the VRAM by several gigabytes, but the slowdown on model loading becomes extreme as well as the resulting model performs like molasses for it anyway.
As I mentioned, in regards to the question of how the specific tools actually read/load the models – which I’ve already said – that analysis will depend on someone else who has the capability and capacity to read and understand the code for how those tools read/load LLMs from disk.
I don’t program, so I am going to be useless for this.
On the other hand, to my point though – it is quite useless, especially if you’re running AI/LLM with consumer-grade hardware, either at home (in general) or in your homelab, to buy like a NVMe 5.0 x4 SSD, thinking that you’ll be able to load a 22 GB model in < 2 seconds because the sequential read speed of said NVMe 5.0 x4 SSD is typically at least 12 GB/s.
Yeah, I don’t know how much “processing” the CPU is really doing with this stream.
There are a lot of individual tools where you can bench individual pieces/portions of that data flow, and you can find both data for and against this, so it really depends.
(i.e. even if you just stored data into a 7-zip file, with NO compression, decompressing the archive still has its speed limits)
Which is why I’ve always argued that talking and/or comparing different GPUs between each other, as a function of VRAM bandwidth, is somewhat misleading because VRAM bandwidth limitations would rarely be the limiting factor in terms of how fast you can load a LLM into said VRAM.
I agree and I understand this.
This is why, for the time being, I am only dealing/working with models that can fit relatively comfortably within the total system VRAM limit (i.e. I run a pair of 3090s, so I have a total 48 GB of VRAM available, in that system). (There is a proposal for me to split that out and bring up two additional nodes with another pair of 3090s, making four total, and a total VRAM limit of 96 GB, but I’d have to use vLLM’s distributed inferencing over 100 Gbps IB to really be able to make “real” use of it. Haven’t quite gotten that far. Yet.)
The biggest model that I am loading via vLLM are 32b parameter models.
But if I am using ollama, then I can run 70b parameter models, comfortably, within said 48 GB VRAM limit.
But again, my OP still stands – where investing in even NVMe 4.0 x4 SSDs would be overkill, because:
You can’t actually predict what the read speeds are going to be, given the entire model read/load chain
and 2) from the feedback from others – pushing past ~2-ish GB/s (even with your set up/your system, your settings) is quite a challenge.
And that’s with you understanding the model read/loading sequence/chain of events that needs to take place before you can use said model(s).
re: overstuffing your VRAM
There is also a chance that the slow down that you are experiencing might have less to do with quote “overstuffing” your VRAM, and moreso because now you are using the CPU to do some of the inferencing work rather than leaving it pretty much entirely up to said GPU(s).
CPU inferencing is slow, pretty much no matter what you do to/with it.
Again, based on the results that people are reporting back here, from their own systems, and their own experiences, if many doesn’t seem to be push much past 2 GB/s during model read/load, and unless you have a system with MCIO connectors, otherwise, you might be better off just having a plethora of SATA 6 Gbps SSDs (in RAID0 since you might not care about losing a drive or two, since you can always just re-download the models), as many motherboards tend to have more SATA ports than they do NVMe SSD sockets.
And whilst you can add NVMe SSD sockets via AICs, eventually you’ll just run out of practical PCIe lanes whereas even with consumer grade motherboards, you can find boards with 8 SATA ports without even really trying.
And also, whilst yes, there are some SSDs where the NVMe version is only a few bucks more than the SATA version, but again, I can easily fit 8 SATA SSDs in my systems whereas trying to jam 8 NVMe SSDs on my 5950X system is going to be quite difficult given that I don’t have enough PCIe lanes off the CPU, to be able to use for said 8 NVMe SSDs.
Hard to tell if loading LLMs is off topic, since it seems like OPs goal but…
I’ve been playing running ollama + open webui for a long time, and was just using a 4tb nvme drive, but then grabbed a x4 pcie4 on xfs “raid 5.” It’s been pretty quick to load models in both cases, but once they are loaded, there’s not a lot of swapping going on during a session usually in my use cases. I might have a max of 3 models running (auto complete, etc.)
On a “production” server, vllm once loaded never shuts down, so the load time there is not relevant, it just serves one model continuously for ever.
As an experiment I’ve just set up a two tier NFS share with cachefilesd: Spinning rust NAS → x4 sled kvm host → VM. This allows me to store terrabytes of models on my cheap storage, then cache several larger ones on my x4 array. Then I use a half dozen at any given time but don’t have to redownload them when I want to play with less commonly used ones, and they load transparently.
All of this is much slower than the speed of one nvme until cached locally (even the k8s persistent volumes serving my workloads are caching some) but there are trade-offs at least in my use cases.
I guess my question is, do you need to load models that quickly?
Would it be better to invest in capacity over speed, given that even large LLMs still don’t use a NVMe SSD to its fullest read bandwidth potential?
That is the nature of my question here.
If said NVMe 4.0 x4 SSDs were able to actually achieve and realise its fullest bandwidth potential, then this wouldn’t really be much of a discussion, because then the answer to my question would be “No, invest in speed with a balance for capacity.”
But, that’s not the case here, as evidenced by myself and others who has tested, due in part to because I bothered to ask the question – doesn’t seem like that loading a LLM can even really take advantage of a NVMe 3.0 x4 SSD, at which point, it would seem to be quite pointless to buy a NVMe 4.0 x4 or newer/faster NVMe SSD, when you could, instead, spend that money for a larger capacity NVMe 3.0 x4 drive, no?
4x4’s been same or lower cost than 3x4 here for close to three years. ¯\_(ツ)_/¯
5x4 offers a few percent over 4x4 to most IO oriented workloads so won’t be price-performant for some time, outside of niches where the IO code is optimized enough and the workload has a compute shape enabling like 35-75% perf gains. With useful, rather than detrimental, LLM use being niche hard to see why to spend money on them. So the main reasons for putting them on 5x4 are probably performance review optics and enthusiast braggadocio.
Like, if it’s good office politics to say you saved money on DDR by using 5x4 it may be money well spent even if it’s an engineering waste.
I do wonder what inference vendors like https://baseten.co , who advertise low latencies (and I would imagine cold loading models from disk, but maybe not,) know about this sort of thing. I know they actively submit code to vllm and sglang. Optimizing loading times seems like a nice improvement to target.
I imagine that their platform deploys models and then makes endpoints available, so maybe not an issue for them either.
I have a mixture of different storage types, would be cool to do some testing in that regard.