 |
|
|
Google has been indexing Adobe Portable Document Format
(PDF) files for a long while now, even offering the option of converting the
file to HTML. But so far, it’s only been able to do so with those PDF files
which contained actual text data. Scanned documents are a whole different can
of peas.
To clarify, scanned documents are photographs of the entire
page, pixel by pixel, along with the text itself, images, paper defects and
holes, stains, etc. They’re analog facsimiles. A digital document on the other
hand is a binary code of the text itself.
It may make no difference to us whether a document is text
or picture of a text, but a computer can only “read” the binary code making up
a character, a word, a sentence. To read a scanned document, a computer has to
perform Optical Character Recognition (OCR) on a picture of text to convert it
into something it recognizes. The process is painstaking and error-prone
though, although its efficacy has been improving over the years.
Google’s PDF indexing feature has only so far really worked
on actual-text documents, because the crawler can read the document and show
snippets in the search results.
Up until now, when Google met a scanned PDF it did its best to index it, but only the title was
available. However Google has recently implemented OCR technology into their
indexing process. The feature can be observed when such a document is searched for,
as the View as HTML function performs the OCR.
The importance of the change cannot be understated as now
scores of print-only books and documents (usually relating to history, academics,
government and archives) that had been uploaded but unable to be indexed, can
now be readily searched through.
Congrats to Google for implementing this technology quite
well despite its flaws, and at such a massive scale.
© 2007 - 2009 - eFluxMedia