Sometimes you need to extract text from a scanned pdf or a scanned image that is output in a PDF document. So lets look here are a simple bit of code to do that. We will setup a simple eclipse project with the relevant maven dependencies and show how this is easily achieved in java.
Is there a way to extract data from a PDF?
The simple answer to do that is yes. And in Java its relatively simple to get the text. We will follow a number of steps to do that.
- Read in the PDF
- Use Apache PDFBox to convert the PDF into images
- Use Tesseract via tess4j to extract the text from those images
- Print out the text
Lets Code Our Text Extract From PDF Using OCR
So follow the steps above and code our text extraction. First lets setup our environment
Setup Eclipse Maven Project
In eclipse, do File–>New–>Maven Project, and setup your project.
Add Dependencies To The Pom
Lets add PDFBox and tess4j to maven and update the dependencies, so we have those available.
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.21</version>
</dependency>
Right click the project in eclipse and then Maven–>Update
Grab A Sample Scanned PDF
Heres a sample scanned PDF that I grabbed from the internet that looks like this.
Code To Convert PDF Into Images
Heres the code we use to convert a scanned PDF into image files
PDDocument document = PDDocument.load(new File("scansmpl.pdf"));
PDFRenderer pdfRenderer = new PDFRenderer(document);
BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
// Create a temp image file
File tempFile = File.createTempFile("tempfile_" + page, ".png");
ImageIO.write(bufferedImage, "png", tempFile);
Code To Read Text From Images Using OCR
Similar to what we did in the post on extracting text from a PNG using tesseract, we will use Tesseract and Tess4j to grab text from the resulting images. Heres the code
ITesseract _tesseract = new Tesseract();
_tesseract.setDatapath("tessdata");
_tesseract.setLanguage("eng");
String result = _tesseract.doOCR(tempFile);
Putting It All Together
So lets put this all together into one class and run it and see what we get.
public class ExtractTextPdf {
public static void main(String[] args) throws Exception {
ExtractTextPdf demo = new ExtractTextPdf();
demo.run();
}
private void run() throws Exception {
PDDocument document = PDDocument.load(new File("scansmpl.pdf"));
String text = extractTextFromScannedDocument(document);
System.out.println(text);
}
private String extractTextFromScannedDocument(PDDocument document) throws IOException, TesseractException {
// Extract images from file
PDFRenderer pdfRenderer = new PDFRenderer(document);
StringBuilder out = new StringBuilder();
ITesseract _tesseract = new Tesseract();
_tesseract.setDatapath("tessdata");
_tesseract.setLanguage("eng");
for (int page = 0; page < document.getNumberOfPages(); page++) {
BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
// Create a temp image file
File tempFile = File.createTempFile("tempfile_" + page, ".png");
ImageIO.write(bufferedImage, "png", tempFile);
String result = _tesseract.doOCR(tempFile);
out.append(result);
// Delete temp file
tempFile.delete();
}
return out.toString();
}
}
if we run this we get an error from Tesseract
Error opening data file tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Exception in thread "main" java.lang.Error: Invalid memory access
We need to tell it where to find the trained data files. have a look in the PNG extract text post, and follow the steps in there where we resolve this issue. You should end up with the tessdata folder in your project, and the TESSDATA_PREFIX environment variable set for your run configuration.
Running The OCR Text Extract
If we now run the code again, we should get the text results. Heres what I get below
. SAPORS LANE - BOOLE - DORSET - BH 25 8ER
TELEPHONE BOOLE (945 13) 51617 - TELEX 123456
Our Ref. 350/PJC/EAC 18th January, 1972.
Dr. P.N. Cundall,
Mining Surveys Ltd.,
Holroyd Road,
Reading,
Berks.
Dear Pete,
Permit me to introduce you to the facility of facsimile
transmission.
In facsimile a photocell is caused to perform a raster scan over
; the subject copy. The variations of print density on the document
cause the photocell to generate an analogous electrical video signal.
This signal is used to modulate a carrier, which is transmitted to a
remote destination over a radio or cable communications link.
At the remote terminal, demodulation reconstructs the video
signal, which is used to modulate the density of print produced by a
printing device. This device is scanning in a raster scan synchronised
with that at the transmitting terminal. As a result, a facsimile
copy of the subject document is produced.
Probably you have uses for this facility in your organisation.
Yours sincerely,
ThA.
P.J. CROSS
Group Leader - Facsimile Research
Registered in England: No. 2038
No. 1 Registered Office: 80 Vicara Lane, Ilford. Eseex,
Conclusion
So here we have shown that its fairly straight forward to extract text from a scanned document in a PDF using Java. We used PDFBox, Tess4j and Tesseract to achieve that. This was all done with default settings, which of course could be tweaked to get even better results which might be needed if you are dealing with a poor scan. Also if you have a lot of documents to deal with this could take some time, so then you would probably need to look at multi threading the process to work on multiple files at a time to speed it up. Java streaming would also help and make it easy to run multiple processes in parallel with fairly simple coding.