OCR made easy using tesserocr

2017-01-08

There are numerous OCR libraries for python. tesserocr is the only library I found that has a decent, humanly-approachable API.

In this blog post we’ll use tesserocr to extract text from a nutrition facts image.

What is it exactly?

tesserocr is a simple, Pillow-friendly, wrapper around tesseract-ocr API.
Pillow is a friendly PIL fork (PIL is the Python Imaging Library).

Demo

We’ll extract text from this image:

chocolate-nutrition-facts

First, install all the requirements:

$ sudo apt install tesseract-ocr \
                   libtesseract-dev \  
                   libleptonica-dev  
$ pip install Pillow cython tesserocr

Now run the following gist:

from tesserocr import PyTessBaseAPI
import sys
import os

# tesserocr -> https://pypi.python.org/pypi/tesserocr
# cython -> https://pypi.python.org/pypi/Cython
# Pillow -> https://pypi.python.org/pypi/Pillow

if len(sys.argv) != 2:
    print("you need to pass the path to the image as first argument")
    sys.exit(1)

path = sys.argv[1]
if not os.path.exists(path):
    print("image doesn't exist at: " + path)
    sys.exit(2)

with PyTessBaseAPI() as api:
    api.SetImageFile(os.path.abspath(path))
    lines = [l.strip() for l in api.GetUTF8Text().split("\")
             if l.strip() != ""]

for l in lines:
    print(l)

And viola!

$ python ocr.py /path/to/chocolate.jpg

Nutrition Facts  
 Serving Size 1 cup (249g)
 Servings Per Container 8
 Amount Per Sewing
 Calories 210 Calories from Fat 80
 % Daily Value
 Total Fat 8g 13%  
 Saturated Fat 5g 26%  
 Trans Fat 0g  
 Cholesterol 30mg 10%  
 Sodium 200mg 9%  
 Total Carbohydrate 27g 9%  
 Dietary Fiber 1g 5%  
 Sugars 25g  
 Protein 9g  
 Vitamin A 6% Vitamin C 0%  
 Calcium 30% Iron 6%  
 Vitamin D 30%  
 *Percent Daily Values are based on a 2,000 calorie  
 diet.