How To Extract Text From A Scanned PDF Using OCR In Java

Sometimes you need to extract text from a scanned pdf or a scanned image that is output in a PDF document. So lets look here are a simple bit of code to do that. We will setup a simple eclipse project with the relevant maven dependencies and show how this is easily achieved in java.

Is there a way to extract data from a PDF?

The simple answer to do that is yes. And in Java its relatively simple to get the text. We will follow a number of steps to do that.

  • Read in the PDF
  • Use Apache PDFBox to convert the PDF into images
  • Use Tesseract via tess4j to extract the text from those images
  • Print out the text

Lets Code Our Text Extract From PDF Using OCR

So follow the steps above and code our text extraction. First lets setup our environment

Setup Eclipse Maven Project

In eclipse, do File–>New–>Maven Project, and setup your project.

Add Dependencies To The Pom

Lets add PDFBox and tess4j to maven and update the dependencies, so we have those available.

Right click the project in eclipse and then Maven–>Update

Grab A Sample Scanned PDF

Heres a sample scanned PDF that I grabbed from the internet that looks like this.

Code To Convert PDF Into Images

Heres the code we use to convert a scanned PDF into image files

Code To Read Text From Images Using OCR

Similar to what we did in the post on extracting text from a PNG using tesseract, we will use Tesseract and Tess4j to grab text from the resulting images. Heres the code

Putting It All Together

So lets put this all together into one class and run it and see what we get.

if we run this we get an error from Tesseract

We need to tell it where to find the trained data files. have a look in the PNG extract text post, and follow the steps in there where we resolve this issue. You should end up with the tessdata folder in your project, and the TESSDATA_PREFIX environment variable set for your run configuration.

Running The OCR Text Extract

If we now run the code again, we should get the text results. Heres what I get below

Conclusion

So here we have shown that its fairly straight forward to extract text from a scanned document in a PDF using Java. We used PDFBox, Tess4j and Tesseract to achieve that. This was all done with default settings, which of course could be tweaked to get even better results which might be needed if you are dealing with a poor scan. Also if you have a lot of documents to deal with this could take some time, so then you would probably need to look at multi threading the process to work on multiple files at a time to speed it up. Java streaming would also help and make it easy to run multiple processes in parallel with fairly simple coding.

PDFBox

Tess4j

Tesseract

Leave a Comment