OCR that works

OCR (optical character recognition: image to text) with Tesseract.

"OCR that works" refers to the comparison with my first attempts at Free OCR -- "gocr" -- which is about as useful as

  rm -f in.xxx
  strings </dev/urandom | fmt | head -n500 >out.txt

Given an input jpeg in.jpg containing single-column text,

convert in.jpg tmp.ppm
unpaper tmp.ppm tmp_.ppm
convert tmp_.ppm tmp.tif
tesseract tmp.tif out -l eng

This produces a file "out.txt" of the OCR'd text.

The commands are:

Do similarly for pdf input, but instead of starting with

  convert in.jpg tmp.ppm

use

  pdftoppm <in.pdf >tmp.ppm

where "pdftoppm" is from the poppler-utils package (it might be ok with convert, setting suitably high resolution).

Put in a loop for whole documents, using e.g.

  pdftk in.pdf burst

to separate a multi-page pdf file into single pages.

Page started: 2011-05-31
Last change: 2011-05-31