Big Data Little File Problem

Hi All,

Looking for advise for a problem I have at work.

Current have data stored on a Network Share Drive (142.2GB 2,444 million 50KB JSON files) that I need to move to Azure Datalake. There is no connectivity between the Network Share Drive and Azure for security reasons.

To make matters worse if I was to copy the data to my laptop Symantec will scan every file that is copied :stuck_out_tongue:

I have tried Teracopy it estimates that it will take 1000 years. I have tried to zip the file same problem again.

Any Ideas?

it will depend on your formatted cluster size as to how fast the files will transfer.
142 gigs at 50kb each you will be looking at 1-3 mb per second on some drives.
so yeah you already hit on the solution…
create about 100 archives of about 1.44 gigs.
yes it will take time but less than transferring all the files 1 at a time.

the good news is you dont have to use compression so it will basically be just read write times you have to deal with.
just stick as many files into the archive as will fit. as is.
turning multiple 1000’s of small files in to 100 larger ones.
which will then transfer at 100+mbps

3 Likes

Thanks for the reply.

How would you go about creating the archives?

What protocols/formats does Azure Datalake support for importing?

  • Avro format.
  • Binary format.
  • Delimited text format.
  • Excel format.
  • JSON format.
  • ORC format.
  • Parquet format.
  • XML format.

Over REST API? Or do they offer a file protocol like CIFS/SCP/NFS?

I might have the ability to go over REST, however I am limited for Internal to external connectivity.

Well what protocol are you using with Teracopy on the Azure side currently?

I was trying to copy directly to my laptop

1 Like

Network/disk latency will kill performance if you file protocol between the laptop and network storage is only doing one file at a time, you need to do many requests/operations in parallel.

Depending on your directory structure, that’s simple enough on Linux (ls|parallel -j64 echo rsync {.} /path/to/local/copy/) but no idea on Windows.

Much simpler/faster would be if you can get a shell on the network storage machine and just make an archive there, then copy that across the network.

As a test, I made a shell script to create the same number of files as you said, and it took my machine 1m35s to tar them into a 125GB .tar file, which you can than transfer as you wish, at the speed of your network.

1 Like

Great! I’ll use Robocopy as it’s the closest thing to rsync for windows. I might be able to get RDP access, I’ll find out on Monday.

When dealing with lots of tiny files getting local access will always be better, the latency of “open file”, “read file” starts to add up with millions of them. And as you’ve found a copier that copies lots of files at once can improve the copy speed. Alternatively archiving them helps, even with no compression. One big file = fast copy.

In my tests I’ve found tar to be the best archiver for this situation, it can stream the archive over the network without storing it locally. However this is really only an option on Windows.

Are all the files in a single directory?

If you’re stuck with windows, give robocopy a try with 30ish threads.

Just send them one of your backups?

You have backups, right? :smiley:

I appreciate everyone’s help with this, I’m currently uploading the final copy now.