Current have data stored on a Network Share Drive (142.2GB 2,444 million 50KB JSON files) that I need to move to Azure Datalake. There is no connectivity between the Network Share Drive and Azure for security reasons.
To make matters worse if I was to copy the data to my laptop Symantec will scan every file that is copied
I have tried Teracopy it estimates that it will take 1000 years. I have tried to zip the file same problem again.
it will depend on your formatted cluster size as to how fast the files will transfer.
142 gigs at 50kb each you will be looking at 1-3 mb per second on some drives.
so yeah you already hit on the solution…
create about 100 archives of about 1.44 gigs.
yes it will take time but less than transferring all the files 1 at a time.
the good news is you dont have to use compression so it will basically be just read write times you have to deal with.
just stick as many files into the archive as will fit. as is.
turning multiple 1000’s of small files in to 100 larger ones.
which will then transfer at 100+mbps
Network/disk latency will kill performance if you file protocol between the laptop and network storage is only doing one file at a time, you need to do many requests/operations in parallel.
Depending on your directory structure, that’s simple enough on Linux (ls|parallel -j64 echo rsync {.} /path/to/local/copy/) but no idea on Windows.
Much simpler/faster would be if you can get a shell on the network storage machine and just make an archive there, then copy that across the network.
As a test, I made a shell script to create the same number of files as you said, and it took my machine 1m35s to tar them into a 125GB .tar file, which you can than transfer as you wish, at the speed of your network.
When dealing with lots of tiny files getting local access will always be better, the latency of “open file”, “read file” starts to add up with millions of them. And as you’ve found a copier that copies lots of files at once can improve the copy speed. Alternatively archiving them helps, even with no compression. One big file = fast copy.
In my tests I’ve found tar to be the best archiver for this situation, it can stream the archive over the network without storing it locally. However this is really only an option on Windows.
Are all the files in a single directory?
If you’re stuck with windows, give robocopy a try with 30ish threads.