Google now Indexes Scanned Documents via OCR

By Eric Blair
19:25, October 31st 2008
62 votes
Vote this story
Google now Indexes Scanned Documents via OCR

Google has been indexing Adobe Portable Document Format (PDF) files for a long while now, even offering the option of converting the file to HTML. But so far, it’s only been able to do so with those PDF files which contained actual text data. Scanned documents are a whole different can of peas.

To clarify, scanned documents are photographs of the entire page, pixel by pixel, along with the text itself, images, paper defects and holes, stains, etc. They’re analog facsimiles. A digital document on the other hand is a binary code of the text itself.

It may make no difference to us whether a document is text or picture of a text, but a computer can only “read” the binary code making up a character, a word, a sentence. To read a scanned document, a computer has to perform Optical Character Recognition (OCR) on a picture of text to convert it into something it recognizes. The process is painstaking and error-prone though, although its efficacy has been improving over the years.

Google’s PDF indexing feature has only so far really worked on actual-text documents, because the crawler can read the document and show snippets in the search results.

Up until now, when Google met a scanned PDF it did its best to index it, but only the title was available. However Google has recently implemented OCR technology into their indexing process. The feature can be observed when such a document is searched for, as the View as HTML function performs the OCR.

The importance of the change cannot be understated as now scores of print-only books and documents (usually relating to history, academics, government and archives) that had been uploaded but unable to be indexed, can now be readily searched through.

Congrats to Google for implementing this technology quite well despite its flaws, and at such a massive scale.



© 2007 - 2009 - eFluxMedia
dotclear

Other News in

dotclear
Latest videos in Technology
Drink coffee, charge battery
'Le Croupier' brings 3D...
Parking Goes High-Tech
Facebook controversy
Solar power plant goes hybrid

dotclear
Technology You are here: Technology
» Technology   » Gadgets   » Video Games   
E-mail To A Friend Print RSS Text size: Decrease font size Increase font size
dotclear
dotclear
dotclear

Interested In This Topic?

News Alert will keep you informed. Find out more.
dotclear
Photos Gallery
dotclear