OCR made easy using tesserocr

There are numerous OCR libraries for python. tesserocr is the only library I found that has a decent, humanly-approachable API.

In this blog post we’ll use tesserocr to extract text from a nutrition facts image.

What is it exactly?

tesserocr is a simple, Pillow-friendly, wrapper around tesseract-ocr API.
Pillow is a friendly PIL fork (PIL is the Python Imaging Library).

Demo

We’ll extract text from this image:

chocolate-nutrition-facts

First, install all the requirements:

$ sudo apt install tesseract-ocr \
libtesseract-dev \
libleptonica-dev
$ pip install Pillow cython tesserocr

Now run the following gist:

from tesserocr import PyTessBaseAPI
import sys
import os

# tesserocr -> https://pypi.python.org/pypi/tesserocr
# cython -> https://pypi.python.org/pypi/Cython
# Pillow -> https://pypi.python.org/pypi/Pillow

if len(sys.argv) != 2:
print("you need to pass the path to the image as first argument")
sys.exit(1)

path = sys.argv[1]
if not os.path.exists(path):
print("image doesn't exist at: " + path)
sys.exit(2)

with PyTessBaseAPI() as api:
api.SetImageFile(os.path.abspath(path))
lines = [l.strip() for l in api.GetUTF8Text().split("\")
if l.strip() != ""]

for l in lines:
print(l)

And viola!

$ python ocr.py /path/to/chocolate.jpg

Nutrition Facts
Serving Size 1 cup (249g)
Servings Per Container 8
Amount Per Sewing
Calories 210 Calories from Fat 80
% Daily Value
Total Fat 8g 13%
Saturated Fat 5g 26%
Trans Fat 0g
Cholesterol 30mg 10%
Sodium 200mg 9%
Total Carbohydrate 27g 9%
Dietary Fiber 1g 5%
Sugars 25g
Protein 9g
Vitamin A 6% Vitamin C 0%
Calcium 30% Iron 6%
Vitamin D 30%
*Percent Daily Values are based on a 2,000 calorie
diet.