How To Extract Text From A Scanned PDF Using OCR In Java

Sometimes you need to extract text from a scanned pdf or a scanned image that is output in a PDF document. So lets look here are a simple bit of code to do that. We will setup a simple eclipse project with the relevant maven dependencies and show how this is easily achieved in java.

Is there a way to extract data from a PDF?

The simple answer to do that is yes. And in Java its relatively simple to get the text. We will follow a number of steps to do that.

Read in the PDF
Use Apache PDFBox to convert the PDF into images
Use Tesseract via tess4j to extract the text from those images
Print out the text

Lets Code Our Text Extract From PDF Using OCR

So follow the steps above and code our text extraction. First lets setup our environment

Setup Eclipse Maven Project

In eclipse, do File–>New–>Maven Project, and setup your project.

Add Dependencies To The Pom

Lets add PDFBox and tess4j to maven and update the dependencies, so we have those available.

  	<dependency>
  		<groupId>net.sourceforge.tess4j</groupId>
  		<artifactId>tess4j</artifactId>
  		<version>4.3.1</version>
  	</dependency>
  	<dependency>
  		<groupId>org.apache.pdfbox</groupId>
  		<artifactId>pdfbox</artifactId>
  		<version>2.0.21</version>
  	</dependency>

<groupId>net.sourceforge.tess4j</groupId>

</dependency>

<groupId>org.apache.pdfbox</groupId>

<artifactId>pdfbox</artifactId>

</dependency>

Right click the project in eclipse and then Maven–>Update

Grab A Sample Scanned PDF

Heres a sample scanned PDF that I grabbed from the internet that looks like this.

Code To Convert PDF Into Images

Heres the code we use to convert a scanned PDF into image files

	PDDocument document = PDDocument.load(new File("scansmpl.pdf"));
	PDFRenderer pdfRenderer = new PDFRenderer(document);

	    BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);

	    // Create a temp image file
	    File tempFile = File.createTempFile("tempfile_" + page, ".png");
	    ImageIO.write(bufferedImage, "png", tempFile);

PDDocument document = PDDocument.load(new File("scansmpl.pdf"));

PDFRenderer pdfRenderer = new PDFRenderer(document);

BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);

// Create a temp image file

File tempFile = File.createTempFile("tempfile_" + page, ".png");

ImageIO.write(bufferedImage, "png", tempFile);

Code To Read Text From Images Using OCR

Similar to what we did in the post on extracting text from a PNG using tesseract, we will use Tesseract and Tess4j to grab text from the resulting images. Heres the code

	ITesseract _tesseract = new Tesseract();
	_tesseract.setDatapath("tessdata");
	_tesseract.setLanguage("eng");

	    String result = _tesseract.doOCR(tempFile);

ITesseract _tesseract = new Tesseract();

_tesseract.setDatapath("tessdata");

_tesseract.setLanguage("eng");

String result = _tesseract.doOCR(tempFile);

Putting It All Together

So lets put this all together into one class and run it and see what we get.

public class ExtractTextPdf {

    public static void main(String[] args) throws Exception {
	ExtractTextPdf demo = new ExtractTextPdf();
	demo.run();

    }

    private void run() throws Exception {
	PDDocument document = PDDocument.load(new File("scansmpl.pdf"));
	String text = extractTextFromScannedDocument(document);
	System.out.println(text);
    }

    private String extractTextFromScannedDocument(PDDocument document) throws IOException, TesseractException {

	// Extract images from file
	PDFRenderer pdfRenderer = new PDFRenderer(document);
	StringBuilder out = new StringBuilder();

	ITesseract _tesseract = new Tesseract();
	_tesseract.setDatapath("tessdata");
	_tesseract.setLanguage("eng");

	for (int page = 0; page < document.getNumberOfPages(); page++) {
	    BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);

	    // Create a temp image file
	    File tempFile = File.createTempFile("tempfile_" + page, ".png");
	    ImageIO.write(bufferedImage, "png", tempFile);

	    String result = _tesseract.doOCR(tempFile);
	    out.append(result);

	    // Delete temp file
	    tempFile.delete();

	}

	return out.toString();

    }
}

public class ExtractTextPdf {

public static void main(String[] args) throws Exception {

ExtractTextPdf demo = new ExtractTextPdf();

demo.run();

}

private void run() throws Exception {

PDDocument document = PDDocument.load(new File("scansmpl.pdf"));

String text = extractTextFromScannedDocument(document);

System.out.println(text);

}

private String extractTextFromScannedDocument(PDDocument document) throws IOException, TesseractException {

// Extract images from file

PDFRenderer pdfRenderer = new PDFRenderer(document);

StringBuilder out = new StringBuilder();

ITesseract _tesseract = new Tesseract();

_tesseract.setDatapath("tessdata");

_tesseract.setLanguage("eng");

for (int page = 0; page < document.getNumberOfPages(); page++) {

BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);

// Create a temp image file

File tempFile = File.createTempFile("tempfile_" + page, ".png");

ImageIO.write(bufferedImage, "png", tempFile);

String result = _tesseract.doOCR(tempFile);

out.append(result);

// Delete temp file

tempFile.delete();

}

return out.toString();

}

if we run this we get an error from Tesseract

Error opening data file tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Exception in thread "main" java.lang.Error: Invalid memory access

Error opening data file tessdata/eng.traineddata

Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.

Failed loading language 'eng'

Tesseract couldn't load any languages!

Exception in thread "main" java.lang.Error: Invalid memory access

We need to tell it where to find the trained data files. have a look in the PNG extract text post, and follow the steps in there where we resolve this issue. You should end up with the tessdata folder in your project, and the TESSDATA_PREFIX environment variable set for your run configuration.

Running The OCR Text Extract

If we now run the code again, we should get the text results. Heres what I get below

. SAPORS LANE - BOOLE - DORSET - BH 25 8ER
TELEPHONE BOOLE (945 13) 51617 - TELEX 123456

Our Ref. 350/PJC/EAC 18th January, 1972.
Dr. P.N. Cundall,
Mining Surveys Ltd.,
Holroyd Road,
Reading,
Berks.
Dear Pete,

Permit me to introduce you to the facility of facsimile
transmission.

In facsimile a photocell is caused to perform a raster scan over

; the subject copy. The variations of print density on the document
cause the photocell to generate an analogous electrical video signal.
This signal is used to modulate a carrier, which is transmitted to a
remote destination over a radio or cable communications link.

At the remote terminal, demodulation reconstructs the video
signal, which is used to modulate the density of print produced by a
printing device. This device is scanning in a raster scan synchronised
with that at the transmitting terminal. As a result, a facsimile
copy of the subject document is produced.

Probably you have uses for this facility in your organisation.

Yours sincerely,
ThA.
P.J. CROSS
Group Leader - Facsimile Research
Registered in England: No. 2038
No. 1 Registered Office: 80 Vicara Lane, Ilford. Eseex,

. SAPORS LANE - BOOLE - DORSET - BH 25 8ER

TELEPHONE BOOLE (945 13) 51617 - TELEX 123456

Our Ref. 350/PJC/EAC 18th January, 1972.

Dr. P.N. Cundall,

Mining Surveys Ltd.,

Holroyd Road,

Reading,

Berks.

Dear Pete,

Permit me to introduce you to the facility of facsimile

transmission.

In facsimile a photocell is caused to perform a raster scan over

; the subject copy. The variations of print density on the document

cause the photocell to generate an analogous electrical video signal.

This signal is used to modulate a carrier, which is transmitted to a

remote destination over a radio or cable communications link.

At the remote terminal, demodulation reconstructs the video

signal, which is used to modulate the density of print produced by a

printing device. This device is scanning in a raster scan synchronised

with that at the transmitting terminal. As a result, a facsimile

copy of the subject document is produced.

Probably you have uses for this facility in your organisation.

Yours sincerely,

ThA.

P.J. CROSS

Group Leader - Facsimile Research

Registered in England: No. 2038

No. 1 Registered Office: 80 Vicara Lane, Ilford. Eseex,

Conclusion

So here we have shown that its fairly straight forward to extract text from a scanned document in a PDF using Java. We used PDFBox, Tess4j and Tesseract to achieve that. This was all done with default settings, which of course could be tweaked to get even better results which might be needed if you are dealing with a poor scan. Also if you have a lot of documents to deal with this could take some time, so then you would probably need to look at multi threading the process to work on multiple files at a time to speed it up. Java streaming would also help and make it easy to run multiple processes in parallel with fairly simple coding.

PDFBox

Tess4j

Tesseract