I know it may sound like very silly question but… How do I copy files in Linux so that it fully utilizes NVME speeds? We’ve got all those fancy 7 GB/s gen 4 hyper fast ssds but when I’m actually doing stuff with them… they hardly exceed 3 GB/s. And it’s probably not CPU limitation because it’s TR PRO 5965wx workstation. In benchmarks it’s all cool an stuff but in real world when I’m doing cp -a on files it’s actually not all that fast.
I also tried dd if=/dev/nvme2n1 of=/dev/null bs=1G status=progress and it reaches 3.2 GB/s at best on Samsung 990 PRO 2tb and 3.1 GB/s at best on FireCuda 530. I thought it’s maybe some pci-e downgrade issues but it doesn’t seem so. lstopo -v lists 8 GB/s link pci-e speed to each drive so it looks legit.
As far as I understand modern SSDs can only reach advertised speed when they’re used in multithread fashion but well… cp -a doesn’t do that certainly.
So how do I copy files in linux in a way that would allow me to copy files at 7 GB/s speeds? (I’m asking for terminal options, there’s no GUI on this machine)
this isn’t an answer for you sorry but your question is a interesting one! just wondering what size files, like individually or in a folder or archive, are you trying to copy. That should make a difference, right?
A chain is as strong as it’s weakest link. Likewise, a connection is as fast as its slowest participant. That includes internal connections, Linux uses sockets abundantly. This tool may be useful finding the bottleneck: strace (<- man-page link!)
Regardless of cpu or ssd speeds your system wil be limited by the buss on your mainboard.
One of the purposes of cache ram on drives is to temporarily store data until it can be fed into the buss.
Imagine a funnel, you can fill its bowl fast but the outflow rate will always the same.
Careful research into the systems capabilities is a must when building a performance based system.
Most sites where you can order a board have package deals that you can pre configure the board with your choice of processor, memory, etc. And installed with burn in time.
A bottleneck in your current system could be a possibility. Do you have a Windows install on that machine to test and compare results? People often quote super high speeds for fancy new drives, but those are often benchmark numbers, and Windows benchmarks at that. Easiest way would be to make a separate Windows install on the machine and run similar benchmarks with the same or close to similar settings.
Tried this myself, and immediately, single thread cpu bottleneck occurred around 2.2GB on a gen4 drive.
Doing a btrfs scrub sees rates in excess of 5GB/s, right in line with the rated read speed, so it’s clear to me that DD is not multithreaded and hits a bottleneck on a single thread.
Mass filecopy also goes pretty fast for large files, close to full speed, but sees performance dip when copying many small files.
kdiskmark can also show high numbers if that’s what you’re after.
If you want to actually copy data between two drives, well… If they use the same filesystem, just copy the files via rsync or a file manager or something?
yeah, I had 3 GB/s with single dd and then I started another dd with 1TB offset on drive and it also was running at 3GB/s so in total it was reaching 6GB/s but not really what I wanted to achieve so it kinda sucks xD
I’m usually copying VM images which are hundreds of GB in single file. 300 GB+ hence it’s kind annoying that I’m only getting 3 GB/s while my disks can achieve much more.
I mean I probably didn’t describe it in clear enough way - when I use eg. parallel to run multiple cp command threads on single files list then it’s considerably faster but it doesn’t work for single giant file (like VM images) so I’m actually asking for tool that copies SINGLE file using multiple threads. I know it may at first sound “impossible” but it really isn’t - with big files you can just read from multiple cursors (eg. for 100GB file, spawn 4 processes, one readying from 0, another from 25GB, another from 50GB and so on), to file that has been allocated with truncate to specified size beforehand. It would certainly introduce some fragmentation to data but sometimes you don’t really care, especially if it’s only 4 continuous chunks. In fact they most likely wouldn’t be continuos in the first place anyways. Like I said I can achieve leaflet speeds when using multiple dd’s manually on single big file but well… I don’t wanna calculate it evey time for each file. I could most likely hack some dd script in bash but come on, someone sure made it properly in C right?.. right?..
Check out the reviews for your two consumer model m.2 drives on Tomshardware.
I specifically refer these reviews because they routinely test Transfer Rates (found on second page of review) and Sustained Write Performance. Either can become a bottleneck in your testing.
In the Samsung 990 review you can see that copy transfer rates are below 2GB/s for all contenders with the Samsung 990 on top.
I think the performance you experience is quite normal and as to be expected.
all those drives should deliver around 6.5 GB/s for first minute. If you multiply 6G by 60 seconds you’ll come to conclusion that whole 300G image should be served by initial burst speed. And even if not, then FireCuda 530 is sustained write champ offering 4G/s for over 5 minutes which would be like 2TB continuos write.
Yet I can’t even copy single 300gb file faster than 3G/s and I’m not talking about copying to the same drive but between drives (in fact even to /dev/null).
I’ve performed few more tests and I found out that with 3 dd processes I’m saturating SSD at around 6.8 GB/s:
Anything below that bottlenecks on CPU just as @alkafrazin noticed. More dd processes don’t increase bandwidth but 2 dd processes already don’t achieve 6.8 GB/s. So it’d be great to have cp that could use 3 or 4 threads. This behavior is still valid after issuing drop_caches or after fresh boot so it’s not RAMcached.
That said another thing I noticed is that after few seconds kswapd0 kicks in and performance drops again to around 4 GB/s but I believe it may be related to fact that I’m writing to /dev/null and that may not be the best test ever and maybe there are some performance limitations related to implementation of /dev/null so I wouldn’t be really concerned about that.
I’m 99% sure it’s related to /dev/null caching since free -g shows growing amount of buff/cache until it reaches 119G (machine has 256GB RAM, 128GB hugepages for VMs):
But the point is - cp should be multithreaded with at least 3 threads in my case (or in fact 6 threads because I was testing on single drive and I use 2 drives in btrfs RAID1 which round robins PID’s between drives so with 6 threads it could assign 3 to each drive)
Quick sanity check , are you copying data locally, i.e from within the same physical device? All write benchmarks assume that data source is external to measured device! Almost no one outright measures mixed scenario
If yes, then then observed performance matches real world maximum. You are using some of than maximum bandwidth just to get the data in first place. On my own machine, split is almost always 1:1, with 3GB/s being effective sequential write maximum as a result.
Other than that, have you tried the good old tar pipe trick?
If sequential rw are most efficient, then lets move around big sequential tar chunk, end extract afterwards.
Regarding perf drop:
what fs is the target? If its something complex like zfs or btrfs, it must be taken into account.
if write performance regularly drops after specific time doing continuous write, pSLC cache exhaustion is likely culprit
pSLC cache size depends on drives size (upper limit)
and also on current free space avaible (real limit)
finally firmware policies (uknown), refer to techpowerup ssd database for details
fio benchmark is much better test that just dd to anywhere
are your ssd drives cooled properly? Controllers might be throttling, worth checking just in case
Also this good old reference article might be of interest. Some system tuning might be fun
As the previous post with the test chart, all those samsung and seagate are fast and only mesure the ram cache built on the nvme. The actual storage chip is all the same of a standard usb stick of 10gbs. This is why after 1min top it drop. If you want full sustain, you go with intel D7 or kioxia serie as thoses will give full throuput at top speed. And price diff is not high and you will kill thoses samsung about 1000x time before the entreprise grade one will die. But for gaming and photo edit… a consumer is made for that, not high load.
I have performed fio benchmark and it easily goes at 7G/s rate with command: fio --filename=/dev/nvme2n1 --direct=1 --rw=read --bs=256k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4 --time_based --readonly --name=seqread
That said this thread asks how to copy FILES at 7G/s not perform synthetic, optimized benchmarks that do not tank single core load on my CPU xD
when I use fio that works like linux cp so: fio --filename=/dev/nvme2n1 --direct=1 --rw=read --bs=256k --ioengine=libaio --iodepth=1 --runtime=120 --numjobs=1 --time_based --readonly --name=seqread
then I’m again capped at 3G/s or even below that. So as much at it may be nice to shove performance issues and blame not-datacenter-grade nvme storage for thousands of dollars it’s just not the case here. It’s software problem.
I tried to perform actual cp benchmarks and it’s even worse - much worse - i’m getting 1.3 G/s cp with tmpfs as target directory. It’s more or less the same as 1.4G/s dd if=./vmdisk.img of=/dev/null. That said AGAIN when I run dd command 4 times with various file offsets I’m getting 4x 1.3G. There’s multiple reasons why i’m only getting 1.3G/s - it’s btrfs on LUKS with lzo compression, a bit fragmented CoW copy image - you know like… REAL WORLD case where performance is significantly worse than synthetic fio benchmark.
And just to point out how utterly pointless fio is in context of my question - I ran: fio --filename=./vmdisk.img on actual filesystem (btrfs RAID1 with lzo compression on LUKS) and it spawned billion of threads, reading from BOTH ssds and gave me result of 13 GB/s and actually properly used my CPU.
Wow lmfao I literally never came even remotely close to figure like this in any real world file copying scenario…
The perfornance is there. Just cp sucks balls. That’s why I’m looking for alternatives. I knew performance is there because I used to md5sum hundreds of files on this RAID using parallel on 64 threads and it was slaying it. But I want this performance while copying single big file as well since that’s 90% of my use cases.
That said thanks for fio suggestion, this exercise just showed me how sh*t cp actually is at doing its only job.
it’s extremely out of context stance given that cp command achieves 1.3 Gb/s while multithread fio reaches more than 13 GB/s while reading the same file… (whole file) I can assure you that if I were to get that Kioxia I would still get 1.3 G/s with GNU cp on single file. Even though those are consumer drives they’re far from being your “average” dollar store laptop drives.
My problem is NOT that i’m getting “only” 13G or “only” 4G per drive but that I’m getting utterly sh*t performance that is 10x worse than synthetic benchmark with optimized program ON THE SAME FILE.
So in case it wasn’t clear from the beginning, essence of this thread is question:
“What program should I use to copy files if I want to get 14 GB/s seq read on single file)”. And no, I’m not copying to the same drive.
Sounds similar to my performance +/- some losses due to encryption fragmentation etc. Still main issue with cp is that it’s single thread and that has LOTS of implications - starting from btrfs-related thing - btrfs RAID1 balances read between disks based on PID of process - that means if you have multithreaded copy app it will read from both disk simultaneously (like fio) while if you have single threaded app like cp it will only read from one disk. That’s just one of aspects but in general fio properly hammers my CPU and achieves good read speeds while cp barely puts any load on anything and performance is quite sh*t.
I’m using dd specifically mostly because it’s easy to script dd in such way that it works in multi-threaded fashion - you can easily hack a script that will read file at 4 offsets using 4 processes and it will write data in parallel - and it DOES scale. But you know - using hacky bash scripts to copy files in 4 chunks doesn’t really sound like proper solution to the problem XD
I will be experimenting with some multithreaded cp forks that are out there on github etc but I hoped there’s something more… “official” and widely adopted for this task.
Yes, I did that and it works fine. Even with cp. But it only works if you have many files (then it will spawn multiple cp processes for different files and copy files simultaneously and it works great). I’m more interested in solution that will be able to implement similar concept for single file by implementing multiple read-heads and multiple write-heads on multiple threads.