I appreciate this.
A LOT of documentation online about hashing, doesn’t actually really talk about what they’re ACTUALLY hashing (like you said - things ABOUT the file (e.g. filenames) vs. file metadata vs. the actual data of the file itself.)
re: diff
Yeah, I am not using that because using czkawka
, I’ve already found mutiple copies of the same and/or similar files spread acrossed different directories.
(Sidebar: Interestingly enough, czkawka
does exactly what I am talking about here already - where it will first compare partial hashes to quickly rule out files that AREN’T duplicates of each other, and then out of the remaining results, then it will compare the full hash against each other. VERY cool program.)
So…I actually first ran it on a directory (iPhone backup of wife’s egregious taking of pictures of our kids) which had, I think something like just shy of 175k files. The program itself didn’t have any issues running. (It didn’t take that much memory).
Where I ran into a problem was that I think that it found something like a grand total of 45,000 duplicate files, in I forget how many thousands of groups, so it became a bit unwieldy for me to go through that many groups of duplicates (or try to) at once, so I had to break down the task into smaller chunks so that I would only be reviewing maybe like a few hundred groups of duplicate files at a time, so that it isn’t quite as daunting trying to review thousands of groups, for tens of thousands of duplicate files.
The computer and the program ran fine.
I am the limitation. (trying to make decisions about which copy to keep and which copies to delete)
So yeah…
No worries.
Some of the files are pretty much exactly identical. (i.e. multiple backups of wife’s iPhone pictures of our kids).
(I am deduping the folder(s) because I am prepping the folder to be written to LTO-8 tape, and writing a lot of JPGs is really slow on tape, so the fewer duplicates I am writing to said tape, the better it will be.)
Um…I think that running it once is probably good enough.
What I would REALLY wished for was a personally hosted “iCloud-like” service where wife can upload her pictures and then it will be smart enough to know which pictures, from her phone, has already been uploaded, and which one is new vs. it creating a duplicate copy automatically, and then my having to run a dedup after the fact.
I’m a GROSSLY underqualified sysadmin, so a Proxmox turnkey solution would be GREAT for something like that.
(I used to use Qnap’s Qphoto app, but she has so many pictures on her phone that her app just falls to its knees under that many pictures, so it doesn’t really work anymore.)