Dedupe files with hardlinks with a given list?

I hope I am in the right place to ask this. I am looking for a way to dedupe a 64TB pool [windows system] with hardlinks or symlinks. Using a program like jdupes is undesirable because it needs to scan the whole pool on its own. I use Snapraid and Everything (from voitools) that already has all files indexed and can super quickly generate a duplicate file list.

So I am looking for a program that can accept a 3rd party generated list and do its own checking on said list before deduping.

Is this data all read only? What’s the plan for if anything is changed?

1 Like

The host system has full access/permission. What are you referring to on your 2nd question? Are you talking about the possibility data has changed since generating a duplicate list and before the duplication process? Or is this a Snapraid question?

you’re talkin about filesystem level deduplication

ZFS can do it, but requires 4GB of RAM (in addition to the normal 1GB) per TB to handle the hashtables generated during such an operation.

ZFS will do it better in theory as it does block level dedupe. But… not on Windows.

ReFS on windows supposedly supports deduplication though. Assuming you’re on server.

Dunno if I trust it though.

Why?

Also this:

Are you planning on keeping the duplicate files?

@TryTwiceMedia and @thro unfortunately ZFS isn’t flexible for my use case. Every few months I add one or two drives to the pool. Duplicate files can be found very easily already, just need a program described above.

@ThisMightBeAFish Have you tried scanning 64 terabytes of data? A full scan on a 16TB disk takes almost a day.

Takes about 4 days
I did it last Friday to Tuesday

But unless you plan on writing something for the host (don’t recommend in prod), you’ll need another tool that can do so.

File hashes have to be kept for all files on the device and transparently swapped so clients know what is where.

This is a block level function.

We run deduplication on data sets regularly after recoveries against known good data sets, but that’s a one time convenience service.

kinda sketch
but really interesting as your failures would be staggered, but a drive failure while adding a new drive could kill the array if the underlying filesystem can’t recover.

Mine is more archival based and not in raid. So if a disk fails, only data on that disk is lost if I can’t recover. Right now, I can sustain up to 2 disk failures before data is lost on those failed drives.

Back on topic though. Using one of the programs I mentioned above. I know 25.6 million files are duplicates and are taking 16.5 TB of space. Known duplicates might shrink if byte-by-byte comparison is used by using this list as a guide line.

What do you think would be a faster alternative?
Copying 64TB to a new FS that uses dedupe?

This does not sound particularly hard to do yourself in a small python script? A thing like that does not exists in that form.

Do you use a file system like XFS, BTRFS that can do reflinks? If yes I suggest not using a hard link but to copy the files with the cp --reflink=always flag. This will ensure that both files will be deduped and only new metadata be written (CoW style). It will also prevent you from shooting yourself in the foot by editing one copy, but expecting the old one to stay the same.

This kind of “offline” dedupe on btrfs can be done with something like beesd or similar. I don’t know if anything similar exists for xfs. It will need to scan the whole filesystem once though. Any subsequent changes will not require a full scan anymore.

1 Like

You say windows, so I assume we’re talking NTFS.

I guess I’m not the only one who had this idea.

I have a big gallery of images and I wrote myself a python script a while back to scan a folder and list or hardlink duplicate images.

Here’s your problem tho - hardlinks work only within the same filesystem. Symlinks do work across file systems, but some programs don’t handle those well.

You say you use Snapraid, which I can only assume means that your pool isn’t really a pool in any functional sense, but only a collection of disks (aka a JBOD), each with its own FS.

So how well this idea you’re proposing could work all depends on your data and what you do with it. Creating the actual sym/hardlinks is trivially simple. But if your dupes span filesystems, you will need a way to decide which disk will actually hold the data.

Basically, any solution will depend on what kind of data it is, what you intend to do with it and how it is stored.

1 Like