|
The purpose of tessboxes is to make training tesseract
less painful. The following description is intended to
conveniently summarize the various steps, as they apply to
tesseract v.2.03. More comprehensive information can be
found here
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
In normal usage, tesseract can read the text in an image
as follows:
% tesseract input.tif output [-l lang]
This command produces a file output.txt which contains
any recognized text, with an optional language name. All
available languages can be found in a directory named
tessdata somewhere on the system. Due to limitations in
tesseract, the input image must be a black and white TIFF
file without alpha layer etc., otherwise the output.txt will
be blank. The simplest way to ensure the image can be
recognized is to convert it to PBM format, and then convert
it back to TIFF, as follows:
% convert input.tif new_input.pbm % convert new_input.pbm
new_input.tif
tessboxes reads a pbm(5) file for simplicity (which can
optionally be gzipped), so you will need to do this anyway.
If you have a directory full of input files, this can be
done in bash(1) as follows:
% for f in *.tif; do convert $f new_${f/tif/pbm}; convert
new_${f/tif/pbm} new_$f; done
To train a new language, you must compile a set of
language files. Suppose that input.tif is a sample image.
First create a boxfile
% tesseract image.tif boxes batch.nochop makebox
This creates a file boxes.txt with the coordinates of
boxes surrounding the characters in the image. Due to
limitations in tesseract, it is a good idea to rename the
file to match the image name, with a .box extension:
% mv boxes.txt image.box
You can edit the boxfile with tessboxes as follows:
% tessboxes -e image.pbm image.box
You should check that each box surrounds a character
properly, and has the correct label for the character. (This
is tedious).
Once you have created a few boxfiles, it remains to
compile them into a tesseract language. Here is the first
step in bash(1):
% for f in *.box; do tesseract ${f/box/tif} junk nobatch
box.train done
In this command, tesseract expects the TIFF image name,
and will find the corresponding boxfiles itself, which is
why we had to rename them earlier. For each boxfile, if the
command was successful, then you should now have a file with
the same name and a .tr extension (ie you now have
image.tif, image.pbm, image.box, image.tr).
You should watch out for error messages which indicate
FAILURE or FATALITY. These messages can occur when boxes
overlap, for example, and may indicate unprocessable data.
In the worst case, tesseract will perhaps not create a .tr
file at all. In a FAILURE, a box may be ignored, whereas
FATALITY or REBALANCE REQD occur when tesseract has less
than 3 sample boxes for some character.
The easiest way to fix these types of problems is to
delete a box, or to change its coordinates. The -g switch
can be used to go directly to such problem boxes. Just cut
and paste the coordinates as given in a FAILURE message, for
example:
% tessboxes -e image.pbm image.box -g 1871,1154
When you have enough *.tr files, you can compile the
remaining language files as follows:
% mftraining *.tr % cntraining *.tr %
unicharset_extractor *.box
It may be a good idea to combine several *.tr files if
they represent the same typeface. In that case, do the
following (the order of the files must be identical in all
commands):
% cat image1.tr image2.tr > combined.tr % cat
image1.box image2.box > combined.box % mftraining
combined.tr % cntraining combined.tr % unicharset_extractor
combined.box
Now choose a name for your language, eg
"mylang". Due to limitations in tesseract, all the
compiled language files must be named mylang.* and must
reside in a directory called tessdata. Therefore:
% mkdir tessdata % mv inttemp tessdata/mylang.inttemp %
mv normproto tessdata/mylang.normproto % mv pffmtable
tessdata/mylang.pffmtable % mv unicharset
tessdata/mylang.unicharset
You still need some extra files. If you’re training
a variant of English, then you can simply copy the tesseract
system files. Find your system tessdata directory. For
example:
% cp /usr/share/tessdata/eng.DangAmbigs
tessdata/mylang.DangAmbigs % cp
/usr/share/tessdata/eng.freq-dawg tessdata/mylang.freq-dawg
% cp /usr/share/tessdata/eng.word-dawg
tessdata/mylang.word-dawg % cp
/usr/share/tessdata/eng.user-words
tessdata/mylang.user-words
You are now done. To read a new image file with the
language "mylang", try this
% export TESSDATA_PREFIX=./tessdata % tesseract image.tif
output -l mylang
If you don’t want to set TESSDATA_PREFIX, you can
also copy all the files tessdata/mylang.* into the system
tessdata directory you found earlier.
|