Do you digitise your paper documents?

Hey. I’m wondering if anyone here digitises all their paper documents

I have a a bunch and i’m considering how best to do it. Anyone here do that? if so whats your method for organising it all?

7 Likes

Genius scan for iOS. I should spring for the plus, which is a one time fee of $8, and it offers OCR along with other things. But #yolo. As for organizing, I just manually put the scans in a folder.

I’m America, there was this scanner that was always shown on infomercials, but I forgot the name. But it did everything for you. It auto scanned the documents in the tray, did OCR, and based of off the OCR, was able to tell what the document was, like a receipt or a business card, or a bill, and auto organized it into folders for you.

Edit: it’s called the neatscan. Here’s how it compares to other scanners:

4 Likes

Ill take a look at Genius Scan. I was thinking of just having a folder structure and good file naming. But I did wonder if there was something that may be able to tag things better for search. I know that iOS/OSX has tagging now, i’ve not played around with it to much yet but that might actually do the trick as a native solution.

I know readdle has a scanner app as well, and theres Adobe of course. Readdle has a one time purchase, Adobe is free with some subscription service addon for extra features. It defaults to saving to the Adobe document cloud though rather than iCloud and you cant tag documents in the adobe apps.

i try to with one of those seen on tv scanners (works on linux) scans and based on what it reads while scanning it turned the document in the a searchable pdf like for receipts and appointment cards …i can keep the clutter down … if i get to scanning again


my mom uses it more for her stuff they are really good for the amount of time saved in data entry

another fun thing to do with document scanning is telling what device printed the document… you need a higher res scanner then most to see MIC yellow dots on pages (youll see these from offices and doctors alot) the yellow dots are a unique printer id to the one printer this was mainly pushed as a thing because of a non scare of the uni bomber (though he never used a printer only a type writer which is just as traceable) by tech companies and fully implemented in the early 2000s. can know down to the S/N the exact printer that printed each doc

1 Like

How does it compare to using a phone? The technology seems to have gotten really good these days to the point that it’ll find a document, find the edges, take the shot and then flatten it all out for you.

for mass scanning i cant just take a pic of everything i can scann faster then i can set up pics.

like if i got 30 docs to scan ill scan over pics to scan but if i just got a appointment card i can snap that and forget about the card

Some time back when I was living within the Microsoft ecosystem I used OneNote and back then OneNote had a built in scan feature. All new snail-mail that came in was scanned and stored in OneNote.

Since then the scan feature was removed from OneNote bolloxing that system. Plus I’m now 100% Linux so that has killed off that idea.

I have yet to implement something new for my current set up. What made me think it was a good idea was in part because it was so brain dead simple and worked so well with minimal effort. I guess like most people I get some mail and I think it’s important then it ends up in a pile on a shelf. Stays there till it falls off the shelf 6 months later and I can’t remember why I thought it was important and I shred it! At least when it was all on OneNote I could find it if I needed it if the time came.

1 Like

I havent really done this but this is interesting and I should probably do this as well as can save me headaches in the future. This is because I use a paper shredder on most of my documents - If any of my family members have their personally identifiable information on a document/receipt/bill - it goes to the shredder. My spouse’s work documents also has a lot of personally identifiable information. It also goest to the shredder.

The problem with scanning is that its relative inconvenience vs any modern mobile phone. But if you want it flat and without motion artifacts - scanner is the way to go. You should probably go for a lower resolution - something that will fit on your screen at most so… about 150 pixels per inch should be sufficient. 300 may be overkill unless there is fine detail in the receipt that you require.

Which then comes to an even more fun part - organization. Right off the bat I am thinking of something local, which rules out the convenience of Adobe Cloud. They are likely using telemetry to snoop in on whatever you are doing. This makes me think of something like Digikam - which has the ability to sort and tag things images. There are likely OCR features in Linux/FOSS but I really have not explored it yet.

EDIT:
I guess you can sort documents in a directory like:
/year/month/company

We have Cannon Imagerunner copy machines at work with okish scanners on them. I scan the document and then use copy machine’s email function to send the scanned .pdf doc to my work email. If the document is something I need for work, like doctors note for sick leave, I keep them in my email or save them on my work laptop.

If the document is something not work related, like important paper bills and so on, I email the document to my private email address and save it to OneDrive.

I don’t really have that many paper documents I need to digitize, so I’ve never tried to find out what is the best or fastest way to do it. Regular scanners have worked well enough for me, but I know my friends use different mobile phone apps to scan paper documents.

If you can use multiple tags then that would be the best way to organize the files in my opinion, since file can have multiple properties like being both doctors note and receipt. Folders only allow you to store file in one place at the time.

1 Like

I feel like I had a bunch more paper documents a few years ago. I’d just scan, ocr with fingers crossed, and upload to Google drive for indexing.

In particular, Drive allows for a file to be in multiple folders, similar to hard links I guess.

I wouldn’t be of much help but there are opencv Foss based apps on android.

I guess that’s not really a thing for apple. Needless to say I still scan with a dedicated multifunction printer at home and use Linux to organize and tag my stuff OCR I achieve in libreoffice on occasion when I need it

I’m not. But I am thinking about it. Currently I’m going through my NAS and organize all my digital stuff into “needs to be backed up” and “doesn’t really matter”. When that’s done I’ll finalize my backblaze setup and with that I’ll probably take a look at what is out there.

So …

MONITORING!

@wendell mentioned he had some scanner setup with OCR and filing in that podcast-y episode with selfhosted. Kind of made me tempted to set something similar up.

At the moment everything just goes straight in the trash [which is going to come and bite me in the ass when I inevitably get audited by the tax-man].

1 Like

So I’ve done this on a massive scale, starting out with no budget and then scaling slightly as prices dropped. This is our current crop of scanners:
Brother mfc-8860dn x3 (1 out of service)
Brother mfc-8460n x2 (1 out of service)
Brother ads-2500w x2
Brother ads-3600w
Brother mfc-6490cw
Brother mfc-7860dw
HP 8500fn1 x4 (1 out of service)

So the first thing I would look at is what is your over all goal based on your source material. Our goal was to reduce the space taken up by years of business records, and still have the ability to print an almost exact replica of the scanned document. This defined one thing for us–scan resolution and format–which is 600dpi full color pdf.

How we wanted to organize the source material was pretty easy. It was already in large manilla envelopes so basically that was one scan or series of scans that would be organized in one place. We decided that individual scan names weren’t important since you can open multiple pdfs quickly to find what you need, just like thumbing through a stack of papers. The envelopes each held one business day of records so we named directories for the year/month/day. These were under each company name, so the full format was company/year/month/day/individual scan files.

Speed wasn’t of consequence until we started scaling up. Machines that did the speeds we wanted weren’t even close to our price range (back when we started this in 2012), so we initially paralleled. The Brother MFC machines were as cheap as $50 used on craigslist and we already had the 8460N (leftover from one of my business ventures). At 600dpi full color, these machines scanned a single side of a single page in 3 minutes. Not fast at all, but they did the job.

The key feature on the MFC machines was presets and the ability to scan to ftp. This took a computer completely out of the process with all the inherit issues that come with computers including ui training. Scanning files straight to a nas and then someone ‘filing’ the scans as they came in was the simplest workflow that allowed as much scaling as we had resources. We could have scaled to 100 scanners and 20 people filing everything if we wanted to–and luckily we never had to do that as prices came down and we purchased faster equipment. The faster equipment just improved the speed. The HP 8500fn1 can scan at over 20 pages per minute, 40 ppm if duplex and is gigabit so sending to the ftp server is faster versus fast ethernet.

So that’s scanning and workflow as we originally started. But as we saw the benefit of a paperless workflow–quick access from anywhere on our network, digital zooming to see things we couldn’t, digital collaboration at a distance–we wanted to digitize more. This is where challenges came in.

We wanted to digitize every drop of paper coming in–from business paperwork, from the mail, paper we created that weren’t already in the computer, documents given to us, everything. This moved us away from neat manilla envelopes that were a single day of 8.5x11 simplex sheets of paper to varied sizes and lengths of paper.

The mail was the biggest challenge. Mail comes in all sort of sizes at times, from the normal #10 envelopes to sizes larger than 8.5x11. How could we do that? Well, that’s a flatbed scanner’s job, and the Brother MFCs do that well. Initially speed was an issue even in parallel, so getting behind was a problem. The HPs were $4k back in the day so while they were the ideal solution, they weren’t purchased until just last year when the price dropped below $1000. We picked up a few more for under $300 as the new 8500fn2 came out and people dumped the 8500fn1, which is still a very, very capable scanner.

Formatting for mail was a bit more challenging, but after a few years the full workflow we use today emerged. We have presets for every company (or personal for personal stuff) on each scanner. Each scanned incoming document initially goes to the company’s ‘incoming’ directory, so company/incoming/scan name. Again, we don’t care about the scan filename as long as it is unique.

Someone reviews the incoming scans (primarily me) and as they are worked on, they are moved to the relevant company our company is doing business with. Everyone is generally a company, so our isp bills go under company/ispname/year-month/scan name. We have found that grouping scans within the same month range is narrow enough that even if you have to review a full month of pdfs, it literally takes a few seconds to pull them all up and close the ones you don’t need. This is a much smaller workload impact than having to try to rename all the files.

Credit card receipts get treated a bit uniquely and are filed with the credit card company vs each company on the receipt. We primarily do this because the receipt entries into the accounting system allow us to find a particular receipt by name, etc, and then we simply look up the scans for that particular month if we need the original scan. This is analogous to the paper system used by many individuals and companies for cc receipts. We also did this because filing so many cc receipts would take a lot of time.

Speaking of filiing, it is important to file stuff in a way that makes sense to you, and if you already have a system in paper, just automate it. Don’t waste time with adding new ‘features’ you don’t need as it can add significant time to the whole process (like naming individual files).

The filing system we use also works for things which are not scanned, like prints to pdf of important data that needs to be retained like online confirmations, emails, etc. We simply file these in the same structure of company/company/year-month/filename, with the filename being whatever the browser defaults to as we don’t need to worry about the filenames as long as they are unique. If there are dupes, we simply add a number or letter to the end of the filename of the newer filename as it is being saved to make it unique.

So once we had the mail down and the ‘daily packets’ taken care of, we thought about moving some of the scanning workload to the businesses that generated the daily packets–and that’s where we placed brother mfc machines at the businesses themselves and connected those machines to our local network via ipsec vpn tunnels. The mfc machines had no idea that they were hundreds of miles away because they just sent their files to the ftp server like usual, just a bit slower. This required more networking infrastructure as we needed vpn routers to create a wide area network (wan), and we happened to have some equipment that was decommissioned that fit the bill. Initially things were slow because Internet speeds topped out at 40Mbps and vpn routers weren’t very powerful, but as speeds came up and equipment became cheap enough, we upgraded this infrastructure as well (CDW Outlet is a great place to pick up cheap enterprise networking gear).

Using the machines at the businesses, employees started to directly scan documents that went into the ‘daily packets’ in real-time. So we not only had the scans faster, the workload on our end reduced to just verifying that all the pages were properly scanned.

This worked great for businesses that had standard 8.5x11 pages, but we have one business where the primary paper source is 4" thermal paper in various lengths up to 16 feet long. For this, we acquired the ads-2500w which can scan up to 3ft long and also the mfc-6490cw which has a flatbed area of 11x17. We used the ads-2500w to scan these items at our office and put the mfc-6490cw for employees to scan certain documents, typically under 3ft long, many of them just being signed credit card receipts and other smaller documents. We recently purchased the ads-3600w which can scan up to 8ft long and at 2x the speed of the ads-2500w. It also can send files to the ftp server much faster than the ads-2500w, which is why we had bought a second ads-2500w to alternate between the two as you cannot scan another document on the brother machines while it is sending the file to the ftp server. Not only can you not start scanning another document, but the sending speed is limited to 200K/sec, which means a 3MB file takes at least 15 seconds if not more. Multipage documents would take minutes and the ads-2500w’s larger file size made this worse even though it doubled the sending speed to 400K/sec. The ads-3600w fixed the sending speed problem. The HP machines send in the background, so you can continuously scan and it will process and send in the background. This is another reason why the HP machines have become our goto machines.

And so that brings me to the end of everything I can remember as of right now. The system works super great and has reduced our paper storage requirements at least 10x if not 100x as all the paper records for multiple businesses fits on a single 4TB drive. The distributed computing model allows us to scan in documents and instantly collaborate with each other no matter where we are by just logging into the vpn network, all while being secure. I know this is probably much, much larger of a scale than one needs in the home, but I actually use the exact same system for my personal documents now. I have an HP 8500fn1 at home and scan all our incoming mail and then file it in the exact same manner. It’s ridiculously fast and allows quick retrieval of almost anything.

The full closed loop on this system is to verify everything as being scanned and filed correctly before destruction. We’ve not yet started on that as we haven’t processed everything in all the incomings as of yet (there’s a lot of work to do). But that process will be pretty simple as the source material simply is checked with it’s scanned and filed copy. If they match, the the original can be destroyed.

Because all our records are now digital, it is also easier to keep safe copies of it for disaster recovery. We can put copies in safe deposit boxes and even replicate to other NAS units in different states in real-time over the vpn network. Aside from the small amount of time it takes to do the scanning and the initial cost of the equipment, I can’t see a single disadvantage to digitizing documents.

Anyways, I need to get back to scanning some documents on the ads-3600w, so if you have any questions feel free to ask, and I hope you enjoyed reading about our experience!

7 Likes

That just sounds like a bad idea.

Sometimes i do, i have a crappy Pixma MG2410 that does the job.

Thanks for the detailed answer. That’s also ridiculous :smile: in a good way I think. Do you also try to stop people sending paper in the first place.

For me since it’s only personal documents I’m really looking to digitise and backup a small amount of documents, ocasional important letters and other documents frequently used for applications etc. So I just need to be hood enough rather than perfect. I’m also wanting to minimise the clutter so testing out some phone apps I think will be idea for me.

Most papers here are scanned with my printer/scanner (an ancient Canon Pixma MP630). Bills, legal paperwork, receipts that involve warranty etc.
For the latter I set it to automatically detect the document size.

Unfortunately one of the stores I regularly visit has receipts that are about 50cm long, which is quite a bit longer than the 30cm that the scanner bed allows. I solved that by storing those in paper form still.
I do need to keep certain things on paper anyway (like the originals of my diplomas and certificates), so I already had organizers.

I keep a spreadsheet with every item that has some sort of warranty. That sheet contains the item description, the price, the date purchased, where I purchased it, the warranty duration, the warranty expiration date, and which receipt number or filename it is listed on. That makes finding anything a breeze.
For the rest of the scanned papers a good naming scheme and folder structure works miracles.

1 Like

Yeah, it’s super thorough and already saved us thousands of dollars when we needed an exact document asap. Things that used to take 2 days can take 30 minutes now.

As far as stopping sending paper–to a certain extent. Even though it’s not apparent, the world still officially works on paper, which most legal and official documents, notices, etc, all on paper. You can ‘opt-out’ of these and ‘go paperless’, but there’s always a warning that when you do so and if you don’t get the electronic version of whatever it is, it’s your fault. So we generally still let them send stuff to us to scan.

For a lot of the longer receipts that we run across, we use the ‘scan and slide’ method where you scan the first part of the receipt as a page, and then slide the receipt with an overlap point and then scan the second portion of the receipt and repeat until it’s fully scanned. This results in a multi-page pdf that has the whole receipt in pieces without actually cutting it into pieces. :smiley:

1 Like