How to copy files in Linux on nvme SSD?

I know it may sound like very silly question but… How do I copy files in Linux so that it fully utilizes NVME speeds? We’ve got all those fancy 7 GB/s gen 4 hyper fast ssds but when I’m actually doing stuff with them… they hardly exceed 3 GB/s. And it’s probably not CPU limitation because it’s TR PRO 5965wx workstation. In benchmarks it’s all cool an stuff but in real world when I’m doing cp -a on files it’s actually not all that fast.

I also tried dd if=/dev/nvme2n1 of=/dev/null bs=1G status=progress and it reaches 3.2 GB/s at best on Samsung 990 PRO 2tb and 3.1 GB/s at best on FireCuda 530. I thought it’s maybe some pci-e downgrade issues but it doesn’t seem so. lstopo -v lists 8 GB/s link pci-e speed to each drive so it looks legit.

As far as I understand modern SSDs can only reach advertised speed when they’re used in multithread fashion but well… cp -a doesn’t do that certainly.

So how do I copy files in linux in a way that would allow me to copy files at 7 GB/s speeds? (I’m asking for terminal options, there’s no GUI on this machine)

this isn’t an answer for you sorry but your question is a interesting one! just wondering what size files, like individually or in a folder or archive, are you trying to copy. That should make a difference, right?

A chain is as strong as it’s weakest link. Likewise, a connection is as fast as its slowest participant. That includes internal connections, Linux uses sockets abundantly. This tool may be useful finding the bottleneck: strace (<- man-page link!)

HTH!

Regardless of cpu or ssd speeds your system wil be limited by the buss on your mainboard.
One of the purposes of cache ram on drives is to temporarily store data until it can be fed into the buss.

Imagine a funnel, you can fill its bowl fast but the outflow rate will always the same.

Careful research into the systems capabilities is a must when building a performance based system.
Most sites where you can order a board have package deals that you can pre configure the board with your choice of processor, memory, etc. And installed with burn in time.

So its not a silly question.

A bottleneck in your current system could be a possibility. Do you have a Windows install on that machine to test and compare results? People often quote super high speeds for fancy new drives, but those are often benchmark numbers, and Windows benchmarks at that. Easiest way would be to make a separate Windows install on the machine and run similar benchmarks with the same or close to similar settings.

Tried this myself, and immediately, single thread cpu bottleneck occurred around 2.2GB on a gen4 drive.
Doing a btrfs scrub sees rates in excess of 5GB/s, right in line with the rated read speed, so it’s clear to me that DD is not multithreaded and hits a bottleneck on a single thread.
Mass filecopy also goes pretty fast for large files, close to full speed, but sees performance dip when copying many small files.
kdiskmark can also show high numbers if that’s what you’re after.
If you want to actually copy data between two drives, well… If they use the same filesystem, just copy the files via rsync or a file manager or something?

Tested cat too, same result as dd, same speed.

yeah, I had 3 GB/s with single dd and then I started another dd with 1TB offset on drive and it also was running at 3GB/s so in total it was reaching 6GB/s but not really what I wanted to achieve so it kinda sucks xD

I’m usually copying VM images which are hundreds of GB in single file. 300 GB+ hence it’s kind annoying that I’m only getting 3 GB/s while my disks can achieve much more.

I mean I probably didn’t describe it in clear enough way - when I use eg. parallel to run multiple cp command threads on single files list then it’s considerably faster but it doesn’t work for single giant file (like VM images) so I’m actually asking for tool that copies SINGLE file using multiple threads. I know it may at first sound “impossible” but it really isn’t - with big files you can just read from multiple cursors (eg. for 100GB file, spawn 4 processes, one readying from 0, another from 25GB, another from 50GB and so on), to file that has been allocated with truncate to specified size beforehand. It would certainly introduce some fragmentation to data but sometimes you don’t really care, especially if it’s only 4 continuous chunks. In fact they most likely wouldn’t be continuos in the first place anyways. Like I said I can achieve leaflet speeds when using multiple dd’s manually on single big file but well… I don’t wanna calculate it evey time for each file. I could most likely hack some dd script in bash but come on, someone sure made it properly in C right?.. right?..

Check out the reviews for your two consumer model m.2 drives on Tomshardware.

I specifically refer these reviews because they routinely test Transfer Rates (found on second page of review) and Sustained Write Performance. Either can become a bottleneck in your testing.

In the Samsung 990 review you can see that copy transfer rates are below 2GB/s for all contenders with the Samsung 990 on top.

I think the performance you experience is quite normal and as to be expected.

even with that in mind pleasw note that in context of NVME speeds 300gb file is not all that much of “sustained write”. According to this chart:


all those drives should deliver around 6.5 GB/s for first minute. If you multiply 6G by 60 seconds you’ll come to conclusion that whole 300G image should be served by initial burst speed. And even if not, then FireCuda 530 is sustained write champ offering 4G/s for over 5 minutes which would be like 2TB continuos write.

Yet I can’t even copy single 300gb file faster than 3G/s and I’m not talking about copying to the same drive but between drives (in fact even to /dev/null).

I’ve performed few more tests and I found out that with 3 dd processes I’m saturating SSD at around 6.8 GB/s:

image

Anything below that bottlenecks on CPU just as @alkafrazin noticed. More dd processes don’t increase bandwidth but 2 dd processes already don’t achieve 6.8 GB/s. So it’d be great to have cp that could use 3 or 4 threads. This behavior is still valid after issuing drop_caches or after fresh boot so it’s not RAMcached.

That said another thing I noticed is that after few seconds kswapd0 kicks in and performance drops again to around 4 GB/s but I believe it may be related to fact that I’m writing to /dev/null and that may not be the best test ever and maybe there are some performance limitations related to implementation of /dev/null so I wouldn’t be really concerned about that.

I’m 99% sure it’s related to /dev/null caching since free -g shows growing amount of buff/cache until it reaches 119G (machine has 256GB RAM, 128GB hugepages for VMs):

And then drops to 4G/s:

But the point is - cp should be multithreaded with at least 3 threads in my case (or in fact 6 threads because I was testing on single drive and I use 2 drives in btrfs RAID1 which round robins PID’s between drives so with 6 threads it could assign 3 to each drive)

Quick sanity check , are you copying data locally, i.e from within the same physical device? All write benchmarks assume that data source is external to measured device! Almost no one outright measures mixed scenario :slight_smile:

If yes, then then observed performance matches real world maximum. You are using some of than maximum bandwidth just to get the data in first place. On my own machine, split is almost always 1:1, with 3GB/s being effective sequential write maximum as a result.

Other than that, have you tried the good old tar pipe trick?

If sequential rw are most efficient, then lets move around big sequential tar chunk, end extract afterwards.

Regarding perf drop:

  • what fs is the target? If its something complex like zfs or btrfs, it must be taken into account.
  • if write performance regularly drops after specific time doing continuous write, pSLC cache exhaustion is likely culprit
    • pSLC cache size depends on drives size (upper limit)
    • and also on current free space avaible (real limit)
    • finally firmware policies (uknown), refer to techpowerup ssd database for details
  • fio benchmark is much better test that just dd to anywhere
  • are your ssd drives cooled properly? Controllers might be throttling, worth checking just in case

Also this good old reference article might be of interest. Some system tuning might be fun

3 Likes

As the previous post with the test chart, all those samsung and seagate are fast and only mesure the ram cache built on the nvme. The actual storage chip is all the same of a standard usb stick of 10gbs. This is why after 1min top it drop. If you want full sustain, you go with intel D7 or kioxia serie as thoses will give full throuput at top speed. And price diff is not high and you will kill thoses samsung about 1000x time before the entreprise grade one will die. But for gaming and photo edit… a consumer is made for that, not high load.

1 Like

I have performed fio benchmark and it easily goes at 7G/s rate with command:
fio --filename=/dev/nvme2n1 --direct=1 --rw=read --bs=256k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4 --time_based --readonly --name=seqread

That said this thread asks how to copy FILES at 7G/s not perform synthetic, optimized benchmarks that do not tank single core load on my CPU xD

when I use fio that works like linux cp so:
fio --filename=/dev/nvme2n1 --direct=1 --rw=read --bs=256k --ioengine=libaio --iodepth=1 --runtime=120 --numjobs=1 --time_based --readonly --name=seqread

then I’m again capped at 3G/s or even below that. So as much at it may be nice to shove performance issues and blame not-datacenter-grade nvme storage for thousands of dollars it’s just not the case here. It’s software problem.

I tried to perform actual cp benchmarks and it’s even worse - much worse - i’m getting 1.3 G/s cp with tmpfs as target directory. It’s more or less the same as 1.4G/s dd if=./vmdisk.img of=/dev/null. That said AGAIN when I run dd command 4 times with various file offsets I’m getting 4x 1.3G. There’s multiple reasons why i’m only getting 1.3G/s - it’s btrfs on LUKS with lzo compression, a bit fragmented CoW copy image - you know like… REAL WORLD case where performance is significantly worse than synthetic fio benchmark.

And just to point out how utterly pointless fio is in context of my question - I ran:
fio --filename=./vmdisk.img on actual filesystem (btrfs RAID1 with lzo compression on LUKS) and it spawned billion of threads, reading from BOTH ssds and gave me result of 13 GB/s and actually properly used my CPU.

Wow lmfao I literally never came even remotely close to figure like this in any real world file copying scenario…

The perfornance is there. Just cp sucks balls. That’s why I’m looking for alternatives. I knew performance is there because I used to md5sum hundreds of files on this RAID using parallel on 64 threads and it was slaying it. But I want this performance while copying single big file as well since that’s 90% of my use cases.

That said thanks for fio suggestion, this exercise just showed me how sh*t cp actually is at doing its only job.

it’s extremely out of context stance given that cp command achieves 1.3 Gb/s while multithread fio reaches more than 13 GB/s while reading the same file… (whole file) I can assure you that if I were to get that Kioxia I would still get 1.3 G/s with GNU cp on single file. Even though those are consumer drives they’re far from being your “average” dollar store laptop drives.

My problem is NOT that i’m getting “only” 13G or “only” 4G per drive but that I’m getting utterly sh*t performance that is 10x worse than synthetic benchmark with optimized program ON THE SAME FILE.

So in case it wasn’t clear from the beginning, essence of this thread is question:
“What program should I use to copy files if I want to get 14 GB/s seq read on single file)”. And no, I’m not copying to the same drive.

Well fio is exactly as precise as you choose for it to be? Its afterall swiss knife.

In this case

fio --filename=/dev/nvme2n1 --direct=1 --rw=read --bs=256k
–ioengine=libaio –iodepth=64 --runtime=120 –numjobs=4 --time_based
–readonly --name=seqread

will likely completely skew your test scenario. If you want to independently simulate realword intradevice copy workloads, you have to at very least:

  • modify rw mode to mixed 50/50 SEQ RW
  • lower your iodepth to something like desktop scenario 64 → 1
  • limit the numjobs to 1

But on the other hand, you mention btrfs. Fio alone will not tell you how much performance loss is incured by hardware and how much is fs.

Have you done testing on any other run of the mill fs? xfs, ext4 anything well known and performant.

I remember btrfs having really crappy performance on ssds without manual tuning.

Otherwise on my system behaviour is (fedora 39 and samsung 990 PRO):

  • scenario: dumping 40GB qcow image to null (pure read op)
    • plain dd
      dd if=desktop-lpj44ij.qcow2 of=/dev/null
      C67490785+0 records in
      67490784+0 records out
      34555281408 bytes (35 GB, 32 GiB) copied, 50.525 s, 684 MB/s
    • pv desktop-lpj44ij.qcow2 > /dev/null
      39.8GiB 0:00:32 [1.21GiB/s]
    • just copy cp desktop-lpj44ij.qcow2 /dev/null ~~~ 1.7 GB/s

yet synthetics are as follow:


– notice the perf bum between SEQ Q8T1 vs SEQ Q1T1. Guess which is closer to real world performance scenarios?

Preliminary conclusion → dd utility is spectacularly poor performer with default configuration, cp is suprisingly decent.

We are still seeing just 50% expected performance. Where is the bottleneck?

What he means is if there is longer runnig write operation going on physical device, it will eventually run out of fast pSLC cache space.

Enteprise drives have more conservative approach both in desing and firmware, where long term consistency is preferred than high bursty performance.

You can write hammer enterprise drive all day long and get the same perfomance until you fill upo the enterpise drive.

With even excellent samsung that perf windows is only 60-90 GB of writesm then it falls down to lower perforamnce of native flash.

Have you considered rsync?
credit;:barafu on redit

Create file list:

find /source -type f > files.txt

multiprocess copy:

parallel -a files.txt rsync -a {} /dest

Taken from this
https://www.reddit.com/r/linuxquestions/comments/pb7lbi/multithread_rsync/
thread on redit

1 Like

Sounds similar to my performance +/- some losses due to encryption fragmentation etc. Still main issue with cp is that it’s single thread and that has LOTS of implications - starting from btrfs-related thing - btrfs RAID1 balances read between disks based on PID of process - that means if you have multithreaded copy app it will read from both disk simultaneously (like fio) while if you have single threaded app like cp it will only read from one disk. That’s just one of aspects but in general fio properly hammers my CPU and achieves good read speeds while cp barely puts any load on anything and performance is quite sh*t.

I’m using dd specifically mostly because it’s easy to script dd in such way that it works in multi-threaded fashion - you can easily hack a script that will read file at 4 offsets using 4 processes and it will write data in parallel - and it DOES scale. But you know - using hacky bash scripts to copy files in 4 chunks doesn’t really sound like proper solution to the problem XD

I will be experimenting with some multithreaded cp forks that are out there on github etc but I hoped there’s something more… “official” and widely adopted for this task.

Yes, I did that and it works fine. Even with cp. But it only works if you have many files (then it will spawn multiple cp processes for different files and copy files simultaneously and it works great). I’m more interested in solution that will be able to implement similar concept for single file by implementing multiple read-heads and multiple write-heads on multiple threads.

(srr duplicated post, this website is sometimes broken)