2015-09-07

As you'd expect...

... there are many ways to search in a scanned PDF for some text.

Let's review: the SearchResearch Challenge for this week is meant to give you an additional powerful tool for importing scanned documents and making them findable.

1.  How can you transform this document (LINK) into something that you can search within?

2.  Once you've done that, can you determine how many times the authors refer to "multiple documents" in that paper?  (This was my original search task--finding interesting papers about how people read multiple documents at the same reading session. That's how I found this paper.)

So this Challenge is really about "tool finding" -- can you figure out how to convert from a scanned document into a readable / findable / searchable one?

As we've talked about before, taking a scanned document and converting the scan into recognizable text is called "Optical Character Recognition," or OCR, so I'm going to use that in my query.

I also remembered that Google Docs had some OCR capability, so my first query was:

[ Google docs OCR ]

which led me to a lovely Help Center article about how to import a PDF file into your Google Drive, then open it with Docs.  And, voila, instant OCR!

Here's what it looks like:



As you can see, I imported the scanned PDF into Docs, and then I Control-Click on the document to "Open with" Google Docs.  This will automatically run the OCR process, and give me a new Google Doc that combines the scanned version with the OCR-d text.



As you can see, the OCR process correctly recognized the text.  The scan is above the horizontal line, and the recognized text is below it.

Now I can just use Control-F to find the text  multiple documents and we should be done.  Here's what I found:



As you can see, the Control-F found 2 instances of our target string, that is, multiple documents.

But I'm a bit of a traditionalist--I like to read long papers like this on the printed page, so I printed it out and began to read.  Everything was fine, but then as I read, I saw another instance of the phrase multiple documents that was NOT one of the two I'd found by Control-F!  WHAT?  How was that possible?

I went back to my Google Doc and looked at the first page of the OCR-d PDF:

That's when I noticed that much of the first page of text had NOT been recognized!  Huh.  As you can see in the above image, you can't even Control-F for the title of the document: there are zero hits for the title.  IF the OCR process was accurate, it certainly would have located the title of the paper (which is just a few lines below).

Okay, I know that OCR is a difficult process; many OCR systems have errors, and I just found one here in the Docs OCR.  When there are strange boxes on the page, Docs OCR might skip over a chunk of the text.

But that didn't explain the "extra" instances of the phrase multiple documents I found in the printed-out version of the paper.  What's up with that?

As I scrolled down looking for the "extra" instance I'd found, I discovered that the Google Docs version ended at page 10 (out of 21 pages in the original)--there were no references, and nothing past the mid-point of the paper!  Gack.

I went back to the Help Center for some explanation, and discovered that it very clearly says "... For PDF files, we only look at the first 10 pages when searching for text to extract."

Okay, so it's documented, but it's still a huge surprise.  There should be a notice in the converted doc (in bold, red, flaming letters) that tells you this.  It should say something like "There's more text in your document, but we stopped the OCR after 10 pages."

Argh.  THAT's frustrating.  Time for another approach, one that will do more than 10 pages of OCR.

My next query was:

[ OCR recognition PDF ]

and I learned there are a number of online PDF OCR conversion tools.  But I ALSO learned that Adobe Acrobat has a conversion capability built into it.  (Note that this is for Acrobat Pro, not Acrobat Reader--that just lets you read PDF files, not convert them.)

So, like Teri, I just used the OCR tools for Acrobat to convert.  I used the default settings to OCR the text.  I opened the PDF in Acrobat, opened the "Recognize Text" tool on the right side (see below) and clicked "In This File" to run the OCR.

It took a couple of minutes, but then gave me a nice, Control-F-able document.  In this document I found 6 instances of multiple documents.

And so, that was that.  I'd found all 6 instances.

Or was it?

In the comments on this post, Jon, Aui, and Remmij found it 7 times.  What?  How's that possible?  How could I have missed one?

As Jon pointed out, you could search Google Books for the book that this chapter is in (Handbook of Research on Reading Comprehension) and then do a search for "multiple documents" in that book.  Indeed, you'll find the phrase 7 times (but only 5 of them are from this particular chapter, the other 2 are from other chapters in the book).

But Aui did an interesting thing by doing a search for:

["reading comprehension strategies" "strategies are developmental" ]

which found an already-scanned version of the paper at Academia.edu (a technical paper repository)!  A Control-F search there finds... 6 hits.  But Aui reports finding 7 hits!  What's going on?

I tried to figure out how Aui could have found 7 instances of multiple documents.  What would be a more "basic" way to do this search?  And what was I doing wrong?

To make things as basic as possible, I downloade the full-text PDF from the Academia.edu site.  I opened that document, then selected all the text (by doing a CMD+A or Control+A for PCs), then copied and pasted it into a SimpleText document (an MS Word document would work as well).

The Trick:  When you copy/paste from the PDF file into a SimpleText or MS Word document, the receiving document drops all of the formating information, including things like the new-line character.  As a consequence, it runs ALL of the text together like this:

This is a real pain if you're trying to copy formatted text from point A to point B, but when you're doing a text-find, it can be an advantage.

But notice this... When I did a Control-F in this SimpleText document (without any formating), I found 7 instances of multiple documents.  (See the number 7 on the right side of the search box above?)

Let's look at this same instance in the original PDF.  (We're looking at this instance because it's the one that wasn't found using our normal search methods.)  Here  I've put boxes around the two words:

See that?  They're on separate lines of text.  So THAT'S why doing a search in the PDF or in the Google Docs copy doesn't work--that pair of words is separated by a newline character.  When I copy-pasted it into the SimpleText editor, the paste operation dropped all of the newlines and all of a sudden, Control-F could work.

And so, yes, Aui found the correct answer: there are 7 (seven!) instance of the phrase multiple documents in this paper.

More generally, this is something to be careful of when using Control-F.  Look at the following piece of text (this happens to be in Google Docs, but it can be in almost any text editor):

Notice that the Control-F FIND box (pointed to by the red arrow) shows that there's only 1 instance found.  The Control-F command only found the multiple documents highlighted in green.

I added the orange box to show you that there's actually another multiple documents in the text--this one happens to have a newline character between the first and second line, while the second paragraph does not.

Control-F does not work across newline boundaries.  That's why the copy-paste without formatting was useful in the previous example--it deleted all of the formatting, including newlines.

Now you know.

Other ways to do this conversion:  There are, of course, other ways to OCR a scanned PDF.  As Rosemary pointed our,  Kami is one such tool.  And Remmij pointed out Free Online OCR (http://www.onlineocr.net/) which has a 5 Mb limit, so it doesn't quite work for this example.

Beyond that, there are various paid methods you can use.  These web services such as CometDocs (which I hear good things about), and there are apps you can buy to do this as well.  Prizmo and ABBYY FineReader Express (both Mac apps) and EverNote (both platforms)

Search Lessons
There are several lessons in this week's Challenge (not all of which I understood before taking on the Challenge myself).

1.  Be sure you know the limits of your tools.  I was somewhat surprised to find out (the hard way) that the Google Docs OCR process would only convert 10 pages of your text.  I found out the limit accidentally, but then followed up by checking the documentation and doing a bit of testing myself.

2. Always sanity check your results.  When I noticed that the paper printout version of the paper seemed to have more instance of our phrase than the online version, that made a little bell go off in my brain.  That's what started me to sanity check things.  Be aware, be sensitive, and be willing to spend the extra couple of minutes running down funny little anomalies.  (There's a famous book, The Cuckoo's Egg, that tells the story of how Cliff Stoll brought down an international hacking scandal by tracking down a missing 9 seconds of computer time.  Moral: Pay attention to small discrepancies. They can be important.)

3. REALLY understand the limits of your tools.  As we see from Aui's clever result, sometimes even something as simple as Control-F won't work across newline boundaries, and you very well might miss a result that you care about.  This is true for many (all?) text editors, including MS Word and Google Docs.

4. Sometimes searching for text fragments can lead you to another version of the document that's more amenable to search.  Aui's search for a couple of phrases from the original paper led directly to an already-scanned and searchable version of the paper.  I hadn't found that version in any of my searches.  It's another version of the "One more search" aphorism--in this case, searching for the same document in a very different way leads to success.

5.  Control-F does not work across newlines.  As always, pay attention.  If you're looking for just a single word, there's no issue here, a newline can't sneak into the middle of a word (although a smart document editor might hyphenate it on you).  But if you're searching for phrases, be careful--the longer the phrase, the more likely it is that you're going to miss an instance or two.

This week's Challenge certainly taught me a lot.  Now I know when I can use Google Docs OCR tools, and when to NOT use it.  I also now know how to use Acrobat's OCR feature to convert a scanned PDF of any length.  Handy tools.  

Show more