How To Extract Text From A PNG Image

In this tutorial, lets see how we can use OCR to extract text from a PNG image, using Java. We should be able to do this using free open source tools, so no need to spend any money here.

How We Will Extract The Text From The Image

  • We discuss the technology we will use
  • Create a web application with Spring Boot to drive the process and return results
  • Add the OCR logic to the web application

Introduction

If you have text in an image such as a PNG file, its likely either that its a graphic of some sort with some text, or a screenshot, or it may even been and image thats been generated by a scanner.

However its been create, your image contains text that you want to extract. To extract the text from this image or images, we need to have some software that will let us run OCR on the images.

What Is OCR?

OCR is technology that used to recognise text in images. This is usually used to allow the text to either be stored or analysed, for use in some other process. For instance you may have a scenario where you want to capture data from screenshots, and then store that data in your database.

OCR PNG Sample Image

So above we have a PNG file that we can use as a sample image file with text.

Technologies We Will Use

Lets take a look at the tech stack that we will use for this.

Spring Boot

Blog | Payara | Spring Boot

Spring Boot is a web container framework that makes the process of getting up and running with a java application, in particular a web application, much simpler. It doesn’t this by assuming simple defaults for many of the items that you would have to configure yourself before getting an application up and running. For us in this context it will make it super simple to create a test environment and visualisation for performing OCR on our PNG image.

Tesseract

Tesseract is an OCR Engine that was originally developed by Hewlett-Packard, and is now developed and supported by Google. We will use it here to extract the text from the images. Tesseract is still considered to be one of the most reliable OCR engines available.

https://github.com/tesseract-ocr/tesseract

Lets Code Our WebApp

We will use eclipse as our IDE, and we just need to ensure we have the Spring tooling setup within that. If you dont already have eclipse installed you can download it from here

https://www.eclipse.org/downloads/

Once you have eclipse up and running, go to Help –> Eclipse Marketplace, to access the Eclipse Marketplace. Search for Spring Tools in the search box..

Once we have Spring Tools installed, we can create a new Spring project. From the menu, choose File–>New–>Other–>Spring Boot Starter Project.

Eclipse spring Boot Starter Project

File in some details and then click next.

Spring Boot Starer Project Maven Wizard

Select Thymeleaf as the template engine we will use for the web pages

Spring Boot Starter Project Wizard Maven Dependencies

Click Finish and eclipse should setup your Spring Boot project.

Spring Boot Project in Eclipse

Right click the application class, which will have been generated by the Spring Boot Tools plugin, and Run As Java Application. This should run your spring boot container, as this is the Spring Boot main class that fires up the application. Spring has more details and another simple application here https://spring.io/guides/gs/spring-boot/

The container gives a warning, because Thymeleaf cant find the default location for its templates. So lets create that now. File–>New–>Other–>General–>Folder and the create the templates folder under src/main/resources.

Lets also create a template in there. For now we just need a HTML page, File–>New–>Other–>Web–>Html File. Click Next and call it index.html, then Finish.

<!DOCTYPE html>
<html>
<head>
<meta charset="ISO-8859-1">
<title>Insert title here</title>
</head>
<body>
Hello World
</body>
</html>

Use the run button the run the application again

When we run the Spring Boot application again we should have lost the Thymeleaf warning message. And now we have a working Spring Boot app.

Lets add some Spring Boot Web Dependencies to our pom.xml that we will need to run the web app. Once you add this do, right click the project and then Maven–>Update Project.

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-web</artifactId>
</dependency>

Now lets create a controller in Spring, to handle the incoming HTTP requests.

package com.example.demo;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

@Controller
public class PngOcrController {

	Logger logger = LoggerFactory.getLogger(PngOcrController.class);

}

As its in the same package, Spring Boot would find it with its default settings, otherwise we would have needed to tell it where to find it by adding a @ComponentScan annotation to the main Spring Boot application. We also need to add a @Controller annotation to tell Spring Boot we want this class to act as a controller.

Lets add a form to the homepage with a field to specify a file location.

We can try and run the application again, and Spring Boot will start running and wait for our request

Tomcat initialized with port(s): 8080 (http)
Starting service [Tomcat]
Starting Servlet engine: [Apache Tomcat/9.0.38]
Initializing Spring embedded WebApplicationContext
Root WebApplicationContext: initialization completed in 1896 ms
Initializing ExecutorService 'applicationTaskExecutor'
Adding welcome page template: index
Tomcat started on port(s): 8080 (http) with context path ''
Started PngOcrApplication in 3.419 seconds (JVM running for 4.288)

If we visit our homepage in the browser now at http://localhost:8080, we should see Hello World.

Ok, so now we have the welcome page working in a Spring Boot web application, so lets add a form to accept the location of the png file, and display the resulting text.

<!DOCTYPE html>
<html>
<head>
<meta charset="ISO-8859-1">
<title>Insert title here</title>
</head>
<body>

<form action="/action">
  <label for="directory">directory:</label><br>
  <input type="text" id="directory" name="directory" value="Directory"><br>
  <label for="result">Result:</label><br>
  <textarea id="result" name="result" rows="50" cols="150"></textarea><br><br>
  <input type="submit" value="Submit">
</form> 

</body>
</html>

If we stop and restart our Spring Boot app now, and navigate the the homepage we see the simple form.

If we click the submit button now, we get an error, as we havent included any code in out controller to process the form yet.

Lets populate the directory on the homepage with an actual local directory, as we will use that later. If we add a method mapped to the root page, this will run in the controller before showing the welcome page. In this one, we will populate the Spring MVC model, with the directory name.

package com.example.demo;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.stereotype.Controller;
import org.springframework.ui.Model;
import org.springframework.web.bind.annotation.GetMapping;

@Controller
public class PngOcrController {

	Logger logger = LoggerFactory.getLogger(PngOcrController.class);

	@GetMapping("/")
	public String parameters(Model model) {
		model.addAttribute("directoryName", "C:/dev/test");

		return "index";
	}

}

We need to modify the welcome page, index.html to use thymeleaf expressions to read the value and write it out to the form variable using:

th:value="${directoryName}"

we also added the thymeleaf namespace to the top of the webpage so that we dont get errors in the editor for the thymeleaf expressions.

<html xmlns="http://www.w3.org/1999/xhtml"
 xmlns:th="http://www.thymeleaf.org">
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"
	xmlns:th="http://www.thymeleaf.org">
<head>
<meta charset="ISO-8859-1">
<title>Insert title here</title>
</head>
<body>

<form action="#" th:action="@{/extractText}" method="post">
  <label for="directory">directory:</label><br>
  <input type="text" id="directory" name="directory" th:value="${directoryName}"/><br>
  <label for="result">Result:</label><br>
  <textarea id="result" name="result" rows="25" cols="75" readonly th:placeholder="${extractedText}"></textarea><br><br>
  <input type="submit" value="Submit">
</form> 

</body>
</html>

and if we restart the app and reload the page, the directory field is now populated.

Lets add some logic to the controller to process the form. For now lets show the directory in the result field to show we have received it.


	@PostMapping("/extractText")
	public String extractText(String directory, Model model) {
		model.addAttribute("extractedText", "Received: " + directory);

		return "index";
	}

and modify the index.html so the form points to our new mapping/method in the Spring MVC controller, and writes out the extractText varaible.

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"
	xmlns:th="http://www.thymeleaf.org">
<head>
<meta charset="ISO-8859-1">
<title>Insert title here</title>
</head>
<body>

<form action="#" th:action="@{/extractText}" method="post">
  <label for="directory">directory:</label><br>
  <input type="text" id="directory" name="directory" th:value="${directoryName}"/><br>
  <label for="result">Result:</label><br>
  <textarea id="result" name="result" rows="25" cols="75" readonly th:placeholder="${extractedText}"></textarea><br><br>
  <input type="submit" value="Submit">
</form> 

</body>
</html>

So now we have our webapp ready to display the result of our text extract process, lets proceed to extract the text from the image with the java source code.

Coding Extracting The Text From The PNG

We can create a new component to extract the text

package com.example.demo;

public class ExtractComponent {

}

and add that to our controller

@Autowired
ExtractComponent extractComponent;

Now lets get the component to return some text so we know its working

package com.example.demo;

import org.springframework.stereotype.Component;

@Component()
public class ExtractComponent {

	public String extractFromPng() {
		return "Extracted Text";
	}

}

then modify the controller to call the extract component and return it to the welcome page


	@PostMapping("/extractText")
	public String extractText(String directory, Model model) {
		model.addAttribute("directoryName", directory);
		model.addAttribute("extractedText", extractComponent.extractFromPng());

		return "index";
	}

Now if we restart the app, and reload the home page we should see the dummy text.

So now we know that we are clicking submit on the homepage, calling the extractText method in the controller, and then calling our extractText method in the component, and returning the dummy text. So lets focus on extracting the text. Theres a number of things we need to do.

  • Load the first file we find in the directory
  • run OCR on the file to extract the text as a string
  • return the text so its display in the web page

So heres the logic in the component to list the files in the directory


@Component()
public class ExtractComponent {

    Logger logger = LoggerFactory.getLogger(ExtractComponent.class);

    public String extractFromPng(String directory) {
	File[] files = getFiles(directory);
	return processFiles(files[0]);
    }

    private File[] getFiles(String directory) {
	File dir = new File(directory);
	File[] files = dir.listFiles();
	return files;
    }

    private String processFiles(File file) {
	logger.info(file.getAbsolutePath());
	return file.getAbsolutePath();
    }

So when we click submit on the welcome page it lists the files in the directory

com.example.demo.ExtractComponent        : C:\dev\test\sampe (1).png

as per windows file explorer

How do I use Tesseract to read text from an image?

As mentioned previously, Tesseract is an open-source OCR engine originally developed by HP, and can function as an accurate image to text converter that we can use for our purposes. To make it easy to use from Java, we will use the Tess4J framework. Tess4J allows use to use Tesseract straight from Java, without having to know much about JNI ourselves to do native calls.

Lets add the Tess4J dependency to our pom.xml

<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
</dependency>

Lets modify our processFiles method in the ExtractComponent to use Tess4j to call Tesseract to do the OCR processing on the image.


    private String processFiles(File file) {
	logger.info(file.getAbsolutePath());

	ITesseract _tesseract = new Tesseract();
	_tesseract.setDatapath("tessdata");
	_tesseract.setLanguage("eng"); // choose your language

	String result = "";
	try {
	    result = _tesseract.doOCR(file);
	} catch (TesseractException e) {
	    // TODO Auto-generated catch block
	    e.printStackTrace();
	    throw new RuntimeException("Error occurred running OCR on file");
	}

	return result;
    }

Restarting our app and rerunning and we get some errors shown

Error opening data file tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Warning: Invalid resolution 0 dpi. Using 70 instead.
2020-09-30 08:57:08.629 ERROR 5404 --- [nio-8080-exec-1] o.a.c.c.C.[.[.[/].[dispatcherServlet]    : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Handler dispatch failed; nested exception is java.lang.Error: Invalid memory access] with root cause

java.lang.Error: Invalid memory access
	at com.sun.jna.Native.invokePointer(Native Method) ~[jna-5.4.0.jar:5.4.0 (b0)]

So as per the error message, tesseract couldnt find its traindata file, Error opening data file tessdata/eng.traineddata. So to resolve this we need to extract that from the tess4j jar file, and then put it somewhere and tell tesseract where to find it.

First work out where tess4j is in your local maven repository. In your project, click on maven depencies, scroll down till you find the tess4j jar, and then right click and properties, and this should tell you where to find your tess4j jar file.

Navigate to the folder where your jar file is, then use 7-zip to open and work with the contents of the jar file.

Navigate into the tessdata folder, right click the eng.trainddata file, and click copy to. Copy it to a location.

Once you have that, create a tessdata folder in the root of your project, and then copy the end.traindata file into the tessdata folder. Then we need to tell tesseract where that is. Edit the run configuration created to start the spring boot app (Menu, run configurations… and select the right configuration). Then switch to the environment tab.

Create a new variable called TESSDATA_PREFIX. This will set an environment variable that can be accessed by tesseract to determine where the traindata is. if we restart our spring boot app, and run the web page and click submit we get the below.

So we see here that the text has been extract from the original image. This happens here in the code, where tess4j calls the tesseract .dll, and uses it to extract the text and return that as a string.

	    result = _tesseract.doOCR(file);

Lets have a look at the original image for comparison.

Conclusion

So we have created a simple spring boot project that calls a component that will perform OCR on an png image and return the text. Hopefully this will help you when you come to do a similar project of your own.

Leave a Comment

Your email address will not be published. Required fields are marked *