Hashing file contents without hashing the file name

alpha754293 · April 12, 2023, 3:32pm

I am trying to deduplicate some files on my system and I was wondering if there is a way to use, for example, openssl to calculate the checksum of just the contents of the file, but not take the name of the file as a part of the said checksum?

My thought and concern (and my primitive) understanding of checksumming is that if I have a file called “image.jpg” and a duplicate copy of the same picture file called “image - Copy.jpg”, that a “normal” checksum algorithm like MD5 and/or SHA256, would use the file name as a part of the checksum, which would result in a different checksum result, despite the fact that one is just the copy of the other.

If there are more intelligent ways to do this, I would love to learn about it.

Thank you.

Trooper_ish · April 12, 2023, 3:42pm

I just tried an md5 of a copied (and renamed) file, it came up the same

what command are you thinking of?

Just tried sha256sum on the same two files, and the result is identical

vivante · April 12, 2023, 3:43pm

As far as I know changing the file name won’t change the hash. A quick test supports this theory.

echo "some random data" >> random-data.txt
cp random-data.txt random-data-copy.txt 
shasum -a 256 random-data* >> sum.txt
cat sum.txt

7093aca75401ab10381af6e71dc92e81ef68bdb7b0f2a77d54f1e65d0c5f5b36 random-data-copy.txt
7093aca75401ab10381af6e71dc92e81ef68bdb7b0f2a77d54f1e65d0c5f5b36 random-data.txt

I’ve used fdupes on Linux and Mac but never to deduplicate large amount of data, just to delete duplicates.

mathew2214 · April 12, 2023, 4:47pm

the way i do it which is independent of file’s name or metadata is by just reading the files into md5sum.
cat file-to-check | md5sum

i can even inject pv into there and it wont affect the result.
cat file-to-check | pv --size file-size | md5sum

redocbew · April 12, 2023, 5:02pm

The file name is part of the file system run by the OS. It’s not included within the metadata that’s part of the file its self.

If that were not the case, then it’d be difficult to verify integrity of a download without knowing the name of the file on the hosting system.

So yeah, I don’t see any problem with what you’re trying to do. MD5 is terrible for security these days, but for de-duping files it’ll work fine.

diizzy · April 12, 2023, 8:39pm

There are several utilities around for this but here’s a few

Log · April 12, 2023, 9:16pm

Another duplicate finder (GUI and CLI) that’s pretty good is the impossible-for-Americans-to-pronounce-or-remember Czkawka. It’s quite fast, and lets you quickly compare not only duplicate files, but similar images as well.

risk · April 12, 2023, 10:22pm

Implement duplicate files finding used to be a popular junior engineer coding interview question years ago.

It’s kind of fun to try and do - there’s so many potential optimizations, and can be a good exercise to try and learn a new language.

for example, you probably want to traverse a filesystem to find files with same size, … and then, maybe it’s enough to be pessimistic and compare only a couple of blocks before reading everything.

maybe check inode numbers before reading files (if you’re feeling lucky).

For large filesystems with lots of files, walking the tree in parallel might make things faster.

alpha754293 · April 13, 2023, 2:16am

Thank you all, for the great replies.

Stupid question then – would MD5 be faster than SHA256 or with modern processors (or relatively modern processors (i.e. Intel Xeon E5-2697A v4 or AMD Ryzen 9 5900HX)) or would they be about the same, in terms of speed?

(For the record, on my system right now, there’s about just shy of 14.5 million files that it needs to chew through.)

I’m currently testing out czkawka to see how it will do with that.

I might end up having to break up the task if attempting to chew through 14.5 million files in one go ends up being too much.

redocbew · April 13, 2023, 2:53am

MD5 would probably be faster. You probably wouldn’t notice the difference in one-off usage, but across 14 million files it might add up.

AbstractConcept · April 13, 2023, 3:15am

As @redocbew said, the file name is not part of the contents of the file; neither are the creation, modification, or access dates that you see in your file browser. No comparison or hash/checksum tool I know of will use that in its calculation.

Some file formats can store very similar information inside the file itself. For example, PDFs can store a separate Title and JPEGs store an EXIF date for when the photo was taken.

What may be confusing, is that both data stored outside the file (date modified, file name) and what is inside the content of the file (JPEG EXIF date, PDF Title) will frequently be called “metadata”. You can refer to it as external/filesystem metadata and JPEG/PDF/<name-of-filetype> metadata respectively to be specific, but in general, everyone just relies on context to figure out what someone is referring to.

Edit: I do not know how to write this without it sounding patronising, but this really was a “good question” — in the sense that understanding file names as filesystem metadata and not part of the file itself is a part of the computer learning curve that is rarely explicitly explained.

Alternative to checksums for comparing two specific files

Also, if you are simply comparing two files, and do not need a hash/checksum, secure (SHA2) or otherwise (MD5), you can also use the diff -qs command just to tell you whether two files are identical or not.

-q or alternatively --brief is to only show whether files differ, without doing diff’s default text comparison
-s or --report-identical-files tells diff to explicitly say that files are identical, rather its default of staying silent when there are no differences.

Caveats to file data

Although increasingly rarely used, filesystems can support resource forks (Mac OS, OSX, macOS) or alternate data streams (Windows) which may store additional data outside of what we normally think of as the file. Neither diff nor shasum will look at this data.

The same is true of extended attributes, which on Macs are probably what stores tags/colours and Spotlight comments.

Log · April 13, 2023, 5:28am

I’m also curious how well it’ll handle millions of files, as that may consume all memory. Not sure how it handles that situation.

regulareel · April 13, 2023, 7:00am

Oh sorry in my mind he has mismatching files already. My bad.

risk · April 13, 2023, 4:10pm

how often do you need this deduping to happen:

do you want to write code, learn rust/python/go/zig/c++ ?

depending on how often it’s needed, I’d do it in a couple of steps.

step 1. pick up list of flags from command line, enumerate recursively all files incl. filename, (filesystem and inode) number, and size
step 2. group by filesize, remove everything that has a unique size
step 3. for each file in each size group, group remaining into (fsize, ilesystem,inode)
step 4. for each (fsize, fsid, inode) get a hash of first 4k of the file, middle 4k of the file and last 4k of the file

I think what remains has a very very very strong probability of being duplicates.

step 5. for each group compare actual file contents byte-for-byte.

optimizing step 5, only makes sense if there’s lots of large groups of large files

for step 5 we can probably assume i/o is expensive, … doing this accurately without hashing and without reading any data twice, and without too many seeks on spinning rust and without too much ram usage (or network io) is really interesting.

you start by assuming all contents is the same, and you start to verify it

you do one round across all files reading a block (e.g. 4k or 1M or 8M), if the block is same as previous you group the two files in a list add it to same (vector / linked list / dynamic array)

you’ll have one such vector of file references per latest block contents.

it’s important to keep a reasonable limit to how much memory you assign to these blocks and how many you keep around, there’s no reason you can’t have a dynamic block size, such that you start with 4k assuming all files are different, and then if half of them are the same, … you move to larger blocks, … and then you split the group further later (tracking more blocks, that can be smaller).

if you have 1000 files in a group, 4k per file only gets you 4M of ram used, and if all of them are different, you’re done.

if all of them are the same, congrats, you’r now using 4k of ram, can you reasonably assume the next 4k are going to look the same.

there’s some interesting iops/time/ram tradeoffs one can make here … or you could just have max block size and max group size on the command line … and dump large groups of same sized files into some intermediary temp file as a checkpoint for secondary processing later on.

How long this step takes is heavily dependent on the machine all this is running on.

alpha754293 · April 22, 2023, 5:23pm

I appreciate this.

A LOT of documentation online about hashing, doesn’t actually really talk about what they’re ACTUALLY hashing (like you said - things ABOUT the file (e.g. filenames) vs. file metadata vs. the actual data of the file itself.)

re: diff
Yeah, I am not using that because using czkawka, I’ve already found mutiple copies of the same and/or similar files spread acrossed different directories.

(Sidebar: Interestingly enough, czkawka does exactly what I am talking about here already - where it will first compare partial hashes to quickly rule out files that AREN’T duplicates of each other, and then out of the remaining results, then it will compare the full hash against each other. VERY cool program.)

So…I actually first ran it on a directory (iPhone backup of wife’s egregious taking of pictures of our kids) which had, I think something like just shy of 175k files. The program itself didn’t have any issues running. (It didn’t take that much memory).

Where I ran into a problem was that I think that it found something like a grand total of 45,000 duplicate files, in I forget how many thousands of groups, so it became a bit unwieldy for me to go through that many groups of duplicates (or try to) at once, so I had to break down the task into smaller chunks so that I would only be reviewing maybe like a few hundred groups of duplicate files at a time, so that it isn’t quite as daunting trying to review thousands of groups, for tens of thousands of duplicate files.

The computer and the program ran fine.

I am the limitation. (trying to make decisions about which copy to keep and which copies to delete)

So yeah…

No worries.

Some of the files are pretty much exactly identical. (i.e. multiple backups of wife’s iPhone pictures of our kids).

(I am deduping the folder(s) because I am prepping the folder to be written to LTO-8 tape, and writing a lot of JPGs is really slow on tape, so the fewer duplicates I am writing to said tape, the better it will be.)

Um…I think that running it once is probably good enough.

What I would REALLY wished for was a personally hosted “iCloud-like” service where wife can upload her pictures and then it will be smart enough to know which pictures, from her phone, has already been uploaded, and which one is new vs. it creating a duplicate copy automatically, and then my having to run a dedup after the fact.

I’m a GROSSLY underqualified sysadmin, so a Proxmox turnkey solution would be GREAT for something like that.

(I used to use Qnap’s Qphoto app, but she has so many pictures on her phone that her app just falls to its knees under that many pictures, so it doesn’t really work anymore.)