Geekonjava.blogspot.com

Search Text In PDF Using Java (Apache Lucene and PDFBox)

2015-08-19

I came across this requirement recently, to find whether a specific word is present or not in a PDF file. Initially I thought this is a very simple requirement and created a simple application in Java, that would first extract text from PDF files and then do a linear character matching like mystring.contains(mysearchterm) == true.
It did give me the expected output, but linear character matching operations are suitable only when the content you are searching is very small. Otherwise it is very expensive, in complexity terms O(np) where n= number of words to search and p= number of search terms.

The best solution is to go for a simple search engine which will first pre-parse all your data in to tokens to create an index and then allow us to query the index to retrieve matching results. This means the whole content will be first broken down into terms and then each of it will point to the content. For example, consider the raw data,

hello world

god is good all the time

all is well

the big bang theory

The search engine will create an index like this,
all-> 2,3
hello-> 1
is->2,3
good->2
world->1
the->2,4
god->2
big->4
Full Text Search engines are what I am referring to here and these search engines quickly and effectively search large volume of unstructured text. There are many other things you can do with a search engine but I am not going to deal with any of it in this post. The aim is to let you know how to create a simple java application that can search for a particular keyword in PDF documents and tell you whether the document contains that particular keyword or not.

You can check also : Save Tabular PDF into TXT using java
That being said, the open source full text search engine that I am going to use for this purpose is Apache Lucene, which is a high performance, full-featured text search engine completely written in Java. Apache Lucene does not have the ability to extract text from PDF files. All it does is, creates index from text and then enables us to query against the indices to retrieve the matching results. To extract text from PDF documents, let us use Apache PDFBox, an open source java library that will extract content from PDF documents which can be fed to Lucene for indexing.

Lets get started by downloading the required libraries. Please stick to the version of software's that I am using, since latest versions may require different kind of implementation.

Download Apache lucene 3.6.1 from here. Unzip the content and find lucene-core-3.6.1.jar.

Download Apache PDFBox 0.7.3 from here. Unzip it and find pdfbox-0.7.3.jar

Download fontbox-0.1.0.jar from here. This project will throw Class not found exception if this library is not present.

Next step is to create a Java Project in Eclipse. Right click the project in project explorer, Go to -> Configure build Path -> Add External jars -> add lucene-core-3.6.1.jar,pdfbox-0.7.3.jar and fontbox-0.1.0.jar -> Click OK.

Create a class and name it as SimplePDFSearch.java. This is the main class that is going to perform each action one by one. Copy paste the below code in this class. Edit the package name to the name of package in which you are creating this class.

We have to create a class to set and get the items that need to be indexed from a PDF file. Create a class and name it as IndexItem.java and copy the below code and paste in it.

Have you do this : Open and Display File in Swing

By doing this we are instructing the search engine to create and to retrieve the following contents of the PDF file, an Unique ID, the file name and the contents (text) of the file.

Next step is to create a class to index the contents of the PDF documents. Create a new class and name it as Indexer.java as we have referred here. Copy and paste the below code to Indexer.java,

The last step is to create a class that provides features to query the index that is created using the indexer class. Create a class and name it as Searcher.java. Copy and paste the below code in it.

That is all we have to do before we run this program to find whether a word is present in a PDF file or not in a more quick and efficient way. Note in the main class (SimplePDFSearch.java), I have used a field named INDEX_DIR which contains the path where the index will be stored. Every time this program is run, the old index will be cleared and new index will be created. I have used a sample PDF document that consists of the following text in it,