Menu
Digitize your documents

Digitize your documents

We have tips on scanners, OCR software, Web OCR, and converting your books to e-books.

The space required to store paper documents can be a problem. Digitizing your documents renders them exquisitely portable--you can store an entire library on your e-book reader with ease. And because paper documents can be turned into editable computer documents, they become searchable. Compare typing "Roosevelt" in a search field with spending all day scanning microfiche and old newspapers by eye to research the Square Deal or the New Deal. The digital document is a boon to researchers the world over.

You can store documents digitally in one of two ways--as images or as text files. Images require far more space, but retain the character and flavor of the original document. Converting a scanned image to a text or word processing file involves what's called optical character recognition, or OCR. It's a bit of misnomer, since you're actually processing digital information, but the term has stuck.

If the original document was written by hand or is art, storing it as an image is generally more desirable--the style of the handwriting can be as meaningful as the words themselves. The other reason for storing handwritten documents as images is that there are no commercially available handwriting recognition packages that can interpret handwritten characters from scans. So far, it's a technology stuck in the PDA and tablet world. Anne-Sophie Bellaud of Vision Objects (a purveyer of handwriting recognition software) explains that with tablets you know the order in which hand-printed or -scripted characters were entered. This provides huge clues for the software. Without an entry timeline, handwriting is not nearly as easy to recognize.

Scanners

No matter which way you'll be storing your documents--as images or as text files--you'll need a scanner to digitize them. If you have relatively few documents to process, a multifunction printer or a dedicated flatbed scanner such as those discussed in "Digitize Your Pictures" will suffice. They're relatively slow, however, and only the more expensive models have automatic document feeders to handle multipage documents.

Though pricey, sheet-fed scanners are just the ticket if you need to process a lot of documents. Units such as Fujitsu's US$495 ScanSnap S1500 and HP's $450 ScanJet Professional 3000 scan both sides of a document at once and average 20 pages per minute or better. I'll give the HP props for slightly more reliable paper feeding with mixed document types, but the Fujitsu has the superior, better-integrated software.

OCR Software

Most scanners ship with OCR software that you can install on your PC, but if yours lacks it, you can buy the software separately. ABBYY's $50 FineReader 9 Express ($400 for Pro 10), Nuance's $150 OmniPage 17 Standard (the Pro version is $500), and Adobe's $299 Acrobat X Standard (Pro is $449) are all good choices. Nuance's $100 PaperPort 12 Standard (Pro is $200) also scans, does OCR, and adds document management features that make it easier to keep track of your documents. Less expensive versions exist for most of these programs, so slow your heart rate.

In my hands-on tests with clean 300-dpi scans, Acrobat did the best job of converting documents, followed closely by FineReader, and not so closely by OmniPage and PaperPort. But the latter three products did better with the three low-quality, 150-dpi scans that I included among my test documents.

For documents stored as images, 150 to 200 dpi is usually fine, but OCR software works much better with 300 dpi scans. Much depends on your needs. If you just want to retain legibility, you may be able to drop the dpi and reduce your storage requirements.

Web OCR

Several online services--such as www.free-ocr.com, www.newocr.com, and www.ocronline.com--are good for small-scale projects or one-offs. First you scan the original to your PC, then upload the document to the Website.

The services have limitations: My tests yielded results that weren't very accurate. Also, only text is recognized, not lines and other page elements.

The first service mentioned above, www.free-ocr.com, is free, but files can be no larger than 2MB, and no wider or higher than 5000 pixels (about 150 dpi for a letter-sized page); and you can do no more than 10 uploads per hour.

Another service, www.newocr.com, is also free, but the interface is primitive. It does a much better job, though, of pulling text than free-ocr.com, and it allows documents up to 5MB in size.

Finally, www.ocronline.com requires creating a free account, but allows 4MB images (about 200 dpi per page) and up to 15 uploads per hour. You get 10 free credits, but after that you must pay for them. The site sells credits in varying quantities, from 50 for $3.95 (8 cents per page) up to 5000 pages for $49.95 (1 cent per page). I got good results with this service, which handles graphic elements as well as text, though it wasn't up to the standards of Acrobat X or FineReader 10.

E-Books

There's nothing like the feel, smell, and visual stability of a real book, but more and more people are happily reading virtual books using Kindles, Nooks, iPads, and other devices. You simply can't beat their portability, and the texts are searchable. It's even possible to have a decent reading experience on smartphones and iPods; I use the latter and, no, the frequent page-turning does not bother me, though I'll undoubtedly go for something larger eventually. You can purchase most books from an online store, but you may have some books in your own collection that aren't available in digital format.

To convert a physical book into an e-book requires first scanning it page by page, and then, for lack of a better term, OCR'ing it. This is tedious at best--use a fast scanner. If you are willing to destroy the book, or know how to rebind, use a sheet-fed scanner (see "Scanners," above). Most of the aforementioned OCR programs have features that help organize the pages.

Once you have the text file (in PDF, Word, or other format) in place, grab Calibre--a very capable and free e-book reader, organizer, editor, and publisher. Convert the file to the format appropriate for your device--EPUB or PDF, say. Once you've created a viewable file, use a reader app such as Stanza to load the e-book onto your device. Your device or app must support side-loading--that is, loading from a PC.

Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.

Join the newsletter!

Or

Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

Tags ScannersperipheralsConsumer Adviceocr

More about Adobe SystemsFujitsuHewlett-Packard AustraliaHPNuanceOmniPage

Show Comments
[]