Guard my documents against OCR

In my particular work environment, I need to send out my written work to other people. Very often this leads to others capitalizing on my efforts by blatantly copy-pasting, so I choose to send pdf’s printed “as an image”. With OCR being integrated in many pdf readers nowadays this is getting easily defeated.

I wonder if you have any other ideas on how to make a document humanly readable, but difficult for third-parties to extract text from it. I do faintly remember some pdfs that the extracted text was gibberish (even though characters were extractable) but I forget how to do that (maybe add a layer of gibberish text on top of printed “as-image” pdf?). Maybe there’s a way to add optical noise to the final pdf to fool OCR?

Share your ideas

What’s to stop them from simply retyping the contents of the PDF? If you have colleagues who steal your work like that, this is probably not the ideal solution…

Retyping is much more cumbersome, so if that would be the only way to copy my work, I am okay with that. I specifically want to guard against automated methods.

1 Like

I feel like if this is an issue at work the right person to address is your manager, not a random IT forum.

Surely there are guidelines in place for this.
Also messing with work material can get you in trouble too.

And if your manager is unable or unwilling to help you, get the hell out of there. It’s a toxic environment.

4 Likes

I need to send out my written work to other people.

You cannot prevent ocr without going to extremes that are entirely self defeating, like unusual paper, very low cotrast etc.

You are also trying to solve social/management problem by technical means, which is usually doomed from start.

Submit you work in digital form and electronically signed. Your authorship us undeniable then with verifiable timestamp to prove your work is the source material.

Otherwise get management involved, or if not feasible play thing defensively. Only people you trust and people in need get handed your work.

4 Likes

Thanks for all the answers, maybe some of the ideas are implementable in my case.

However for future reference, I used a simple command to product an “image .pdf” and lower the resolution as necessary (making OCR more prone to errors). Just something to keep in handy in case somebody needs it. I use Debian Linux. Adjust the value (“100”) as needed

convert -density 100 input.pdf -depth 8 -strip -background white -alpha off output.pdf
1 Like

Just make all your documents look like CAPTCHA text:


Handwriting/cursive is very difficult for OCR.

A continuous gradient background would make things more difficult as well.

These will make your work look less professional…

You can secure PDFs against printing, copying text out of it (but nothing to prevent screenshots):

4 Likes

Just a quick sanity check, will you boss(es?) accept low quality output ? I would expect getting some flack back for this.

Thats why I stressed that ocr defeat methods end up entirely self defeating, by making output poorly readable for actual intended recipients.

1 Like

this may or may not be helpful but here is an entirely different way of tracking text.

keep the product fully digital but randomly add additional space characters occasionally. Then SHA256 the documents (per page even) and log what hashes go to what clients. Then upon a ‘leak’ just check the hash and bill the client accordingly. The spaces go unnoticed to a reader and the odds of them finding the extra spaces are seriously unlikely. but the hash would make it obvious when tracking down the perp.

as an example i added several extra spaces above.

2 Likes

My idea is that you should be fired for being petty and using technology to actively make your coworkers/clients/whomever’s life more difficult.

Seriously, text as image?

Whatever problems you have with perceived plagiarism, you can solve through the usual, not-so-hard-to-imagine channels.

Well that might be a bit harsh. One could imagine a situation where the people the original poster is forwarding work to are more clients than coworkers. One could also imagine a freelancer asking such questions…

1 Like

so that all the pdfs look like ransom notes? :rofl:
th-3780704358

4 Likes

Huh? I will be judged by the person/organization paying me, your opinion on the matter is thankfully of zero importance. I asked for ideas on the technical implementation.

I like learning new stuff every day though. Delivering a text pdf printed as image is reason to get fired :rofl:
Oh man, I wish I could know you better

Whilst this may work in some cases, this definitely doesn’t work here. HTML will skip extra spaces and even the forum backend seemingly has truncated them :wink:

Its more like that any effective anti-ocr measure would make output document actively unpleasant to use for its intended purpose. As in turning in defective work for little no no reason.

Some bosses would find it hilarious on first submission, then politely tell me/you to shove it up my ass and deliver reports as usual (like my boss, but he would share it with everyone else first for laughs). Other boss I don’t report to would likely use it against me.

On the technical note its fascinating task, but it seems doomed from start - you are trying to implement physical drm in something with :
(a) physical form
(b) inherently readable form
(c) and you have no control over circulation.

Thats mission impossible level of task complexity.

Like posters above hinted, hidden steganography identifiers in distribution versions might be helpful in detecting who is stealing from you, but it seem too much.

Would you elaborate on practical detail who did what, where, how did you find out and what relationship is between you, bossman and pirate in question?

EDIT: went on research side trip, since the idea is actually fascinating. No dice however, found only single reference where everyone says pretty much the same:

Basically, what is easy to read for humans, can be perfectly OCR-ed. What is difficult to OCR is difficult for people too. In worst case, attacker may hire an Indian company to do manual retyping of text, this is not that expensive actually.

1 Like

1 Like

The goal is to find the balance of it being cumbersome to copy-paste and being readable. People here focused on the extreme end of the spectrum (making it impossible to perform OCR with over the top captcha fonts or handwriting) probably because this is a technical forum. But this is not the goal.

A fair example is an actual printed hardcopy which is then scanned again. This is mostly accomplished by my line of code above and by tweaking the DPI value, one can move through the spectrum, reaching the desired balance that I mentioned.

I am actually content with the convert function and a balanced DPI resolution. No bosses will get mad

Then password protected .pdf file with permission restriction is your best approach.

Reasonably hard co circumvent, but prevents casuals from doing anything with beyond eyballing it and making screenshots.

Event printing can be restricted.

Still curious what the fuck is happening in your workplace. This kind of weird office politics is completely outside my personal experience, hence the curiosity.

In the event that you are doing contract work and clients are stealing that work, you might consider retaining a legal professional to help you draw up a contract that stipulates monetary penalty for such behaviour. Give them enough rope to hang themselves.

it is possible to add spaces in custom HTML. Really, though, unless you are willing to share your process, we are all just throwing mud at the wall.