OCR (optical character recognition: image to text) with Tesseract.
"OCR that works" refers to the comparison with my first attempts at Free OCR -- "gocr" -- which is about as useful as
rm -f in.xxx strings </dev/urandom | fmt | head -n500 >out.txt
Given an input jpeg in.jpg
containing single-column text,
convert in.jpg tmp.ppm unpaper tmp.ppm tmp_.ppm convert tmp_.ppm tmp.tif tesseract tmp.tif out -l eng
This produces a file "out.txt" of the OCR'd text.
The commands are:
Do similarly for pdf input, but instead of starting with
convert in.jpg tmp.ppmuse
pdftoppm <in.pdf >tmp.ppmwhere "pdftoppm" is from the poppler-utils package (it might be ok with convert, setting suitably high resolution).
Put in a loop for whole documents, using e.g.
pdftk in.pdf burstto separate a multi-page pdf file into single pages.
Page started: 2011-05-31
Last change: 2011-05-31