Seeking advice on rsyncing 50tb to remote smb share

Hey all,

So, I have a debian-based system (actually a terrifying frankenstein of OMV, debian, proxmox, openzfs…) that was supposed to be a temporary patch over a gaping wound of a failed primary fileserver, and i’m sure you all know the story. primary server(s) was rebuilt as a storage spaces direct cluster on used harware (defo not my rec or choice), and it has fallen over. now this backup monstrosity is delivering the primary workload. in the past when storage spaces has fallen over, resilio connect has managed to sync corrupted files to other NASes, so we’re holding on to a zfs snapshot from a couple weeks ago with both hands just in case. the decision has come down to sync this snapshot (~50tb) to an amazon fsx share, and theonly option we have in the sydney region is a “windows file server” type. so smb it is.

we are an ISP also so have a genuine 10g direct connect link to amazon’s datacentre in sydney; another tech says he’s able to push 1GB/s across the link from a windows worstation. the debian fileserver has a mellanox 40g card.

i’ve mounted the fsx share on this debian machine and i’m running fpsync with 8 threads but it’s only managing about 25MB/s on average, with rare spikes up to ~200MB/s. by MB/s i mean megabytes per second, for the avoidance of doubt.

last week we were using msrsync with 12 threads and it was doing better - about 100MB/s average but, and i don’t know if this is due to msrsync or not, the machine’s smbd service wrecked somehow and i had to recover parts of the file share config manually over the weekend; i’m a little gunshy to go straight back to msrsync.

does anybody have useful suggestions on how to get this sync to go faster? all i have on the other end is an smb share to mount, since it’s an amazon fsx thing. no host, no ports, no daemons, nothing else but a share to mount.

i also invite more general discussions about the various options out there for multi-thread copy jobs like this; i like the idea behind fpsync and it’s available in the debian repo, so i’m using it. but rn it go slo hngg

Is the destination blank and you’re copying everything fresh or is it partially populated and you’re wanting to complete it with rsync?

If it’s a blank share, I’d just let cp rip through the transfer, and then run rsync -n later to confirm the data integrity.

With 50TB it’s easier to load it to a hard-drive and physically take it to the backup location. Get yourself a 100TB Exadrive, or more, and arrange a visit to that Amazon data center.

I second the suggestion of copying over with cp then run rsync to get the snapshot updated.

1 Like

Are you going to use the data once uploaded, or are you just looking to back it up. Assuming the latter.

You could try rclone on top of s3 this should work great for a tar archive or storing filesystem snapshots (use chunking), but I’m not sure about storing a bunch of small files.

On the other hand, you have 3 HDDs worth of data, maybe you could get a 4-5 drive usb enclosure and hook it up to the server and syphon out the data that way,… or if you’re limited to what you can connect to a server, hook that enclosure up to laptop and laptop into network.

1 Like

OK so magically just ask I asked this question, some other operation seems to have ceased and the transfer has continued since then at around ~200MB/s. I also restarted my fpsync job with less verbosity (and turned off the --progress I had set on each of the rsyncs), although I doubt that made an actual impact.

Yeah, the transfer has already been running for some time. If I understand correctly I can use cp -u to skip files that already exist, right? Maybe I can just stop the fpsync process and continue using cp. Having now done more reading, perhaps this could be the trick: (from a link I can’t post, sorry: find * -type f | parallel --will-cite -j 6 cp -u {} /mountpoint/ &

Sure, this could work - but then it has to copy twice, first to the drive then to the datacentre. The machine I’m running this on is old and has miserable IO except via the network, and I already have a 10G link to the datacentre. I don’t know a ton about exadrives but I’d be surprised if a single drive can do more than 10G aye. Plus I’m pulling from a busy filesystem in the first place so I’m not sure I could saturate it even if it could. I don’t underestimate the speed of a station wagon, but I’m hoping to get this done without spending $40,000

I can’t really tell what the architect here is aiming at, but I think he’s convinced that the FSX solution is the right one, so I don’t think rclone would be better or faster than the other options. I think he imagines we’ll actually use it later for something, possibly a new resilio connect agent node for realtime bidirectional syncing jobs between amazon regions for artists in the cloud.

Was ZFS’s send/recv not an option? Because that’s the only way to get transfers even close to the theoretical throughput from hard drives.

1 Like

rsync -aAXrvh --no-perms --progress --delete-before

on a new copy --delete-before is not necessary, and --progress only matters if you want to see progress. This set also copies file attributes, but it is about as fast as rsync will go. CP first or some other copy may get you a bit faster initial copy.

My lord, why SMB? For local purposes, it’s fine, but over the internet (even through a VPN), it’s going to be slow.

You are already using AWS, just make a bucket, secure it, then install rclone on your NAS and send the snapshot over to the bucket.

–progress shouldn’t impact the job, but verbosity, if spitting enough output, can slow things down.

Depends on how slow the read speeds are on the ZFS pool.

Think of it in terms of how much money the business is going to lose if those 50 TB are lost. If it’s financial data, or even worse, customer’s uploaded data, the disaster can be tremendous, not just in terms of damages paid and HDD recovery services bills, but also in terms of your reputation as a company, with people being less likely to do business with you.

Though I highly doubt you need $40,000 to make the backup on a different NAS. Besides, the HDDs you would be buying can be repurposed on a better NAS after the backup is done, so you aren’t losing any money, you are investing.

-a already includes -r. -v and -h aren’t necessary, but nice to have if you want to see the output.

1 Like

not an option, nope but that was my first thought too. hoped so with FSX’s “OpenZFS” option but A) it’s not available in my geographic region and B) snapshot send/recv not a thing with FSX unfortunately…maybe some day, it’s a new product still

well. why fsx smb indeed; it’s not what i would have chosen, i don’t think. but i think it’s mostly because FSX. partly because resilio connect. i suspect the architect, although he won’t openly admit it, intends to repurpose this smb share “in the cloud” for artists who work from on workstations in AWS; once we get the primary filer operational again and confirm all the data seems ok, we’ll re-sync it to this fsx share to bring it up to date and start mounting it via virtual workstations. if we then also map it in a machine running the resilio connet agent storing jobs on the share, smb will be important because it’s the only file sharing protocol that can theoretically advise the mounting machines of filesystem events in real time, allowing the sync agent to avoid relying exclusively on scans for file updates. the whole thing is a big stretch to me.

i’m interested to know more about rclone; i’ve used it in the past for syncing stuff to/from gdrive fast but in that case the bottleneck was definitely the internet connection, and google’s tooling limitations; here i have a 10G link and no such limitations. i’ll see if i can understand why rclone would be faster than parallel cp’s, rsyncs, etc in this case, feel free to fill me in or drop hints if you can.

the majority of the data on this system is also synced to at least one nas in another country, which is that site’s primary fileserver, so this isn’t the last copy of most of the data. the 40,000 figure just came from the price of the 100tb exadrive the other user recommended. but your points are well taken; there is real potential for data loss if this pool fails. in any case, it’s not just about the capital spend but also whether it would actually be faster to copy the data twice instead of once, and regardless of the medium my original question stands, what can i do to make it go faster

just some intermediate feedback: the command i listed above doesn’t appear to be any faster :confused:

not sure what happened but i returned to check on the system and it was moving no data to fsx! static route through the new fangled production network had dropped, not sure why. made the route persistent, had to stop my fpsync job and decided to try the parallel cp instead. it’s back to it’s old ways of about 35-40MB/s. womp