tessboxes reads a (black and white) PBM (image) file and a corresponding box file suitable for training the tesseract(1) OCR tool to recognize new languages and character sets. tessboxes draws the boxes on the image, and can be used to interactively edit the box file.

When the -e switch is missing, tessboxes writes a (colour) ppm(5) file to STDOUT which has the boxes overlaid on the original image. This is intended as a simple tool that can be used as a component of a more comprehensive training process, and the input and output formats are deliberately chosen to be as simple as possible.

% tessboxes image.pbm boxes | pnmtopng > image_with_boxes.png

When the -e switch is used, tessboxes becomes an interactive editor for the BOXFILE. The terminal shows a list of labelled boxes, while the corresponding bitmap is shown in a separate X11 window. Typing one or more ordinary keys replaces the label of the current box. Typing Ctrl+F cycles through faster editing modes for bulk processing, where the cursor moves automatically to the next box, ENTER/SPACE moves forward in some modes, and BACKSPACE moves backward.

The following special keys are recognized, and do not change the current label.

Ctrl+x quit editor and save BOXFILE
ESC quit editor but do NOT save BOXFILE
arrow keys select a new box in the list
Ins insert a new (blank) box just before the current box
Del delete the current box
Ctrl+arrow keys grow or shrink the currently selected box
Alt+arrow keys move the currently selected box keeping its size
Alt+s shrink the currently selected box
Alt+c crop the currently selected box (can use repeatedly)
Alt+a crop ALL the boxes in the image at once
Ctrl+F cycle through fast(er) editing modes
Ctrl+A toggle append/overwrite mode (default is overwrite)
Ctrl+Z cycle through magnification factors up to MAGNIFY
F1-F8 annotate symbol with a predefined string

EXIT_STATUS

tessboxes returns zero on success, nonzero if an error occurs.

TRAINING_LANGUAGES

The purpose of tessboxes is to make training tesseract less painful. The following description is intended to conveniently summarize the various steps, as they apply to tesseract v.2.03. More comprehensive information can be found here

http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract

In normal usage, tesseract can read the text in an image as follows:

% tesseract input.tif output [-l lang]

This command produces a file output.txt which contains any recognized text, with an optional language name. All available languages can be found in a directory named tessdata somewhere on the system. Due to limitations in tesseract, the input image must be a black and white TIFF file without alpha layer etc., otherwise the output.txt will be blank. The simplest way to ensure the image can be recognized is to convert it to PBM format, and then convert it back to TIFF, as follows:

% convert input.tif new_input.pbm
% convert new_input.pbm new_input.tif

tessboxes reads a pbm(5) file for simplicity (which can optionally be gzipped), so you will need to do this anyway. If you have a directory full of input files, this can be done in bash(1) as follows:

% for f in *.tif; do
convert $f new_${f/tif/pbm};
convert new_${f/tif/pbm} new_$f;
done

To train a new language, you must compile a set of language files. Suppose that input.tif is a sample image. First create a boxfile

% tesseract image.tif boxes batch.nochop makebox

This creates a file boxes.txt with the coordinates of boxes surrounding the characters in the image. Due to limitations in tesseract, it is a good idea to rename the file to match the image name, with a .box extension:

% mv boxes.txt image.box

You can edit the boxfile with tessboxes as follows:

% tessboxes -e image.pbm image.box

You should check that each box surrounds a character properly, and has the correct label for the character. (This is tedious).

Once you have created a few boxfiles, it remains to compile them into a tesseract language. Here is the first step in bash(1):

% for f in *.box; do
tesseract ${f/box/tif} junk nobatch box.train
done

In this command, tesseract expects the TIFF image name, and will find the corresponding boxfiles itself, which is why we had to rename them earlier. For each boxfile, if the command was successful, then you should now have a file with the same name and a .tr extension (ie you now have image.tif, image.pbm, image.box, image.tr).

You should watch out for error messages which indicate FAILURE or FATALITY. These messages can occur when boxes overlap, for example, and may indicate unprocessable data. In the worst case, tesseract will perhaps not create a .tr file at all. In a FAILURE, a box may be ignored, whereas FATALITY or REBALANCE REQD occur when tesseract has less than 3 sample boxes for some character.

The easiest way to fix these types of problems is to delete a box, or to change its coordinates. The -g switch can be used to go directly to such problem boxes. Just cut and paste the coordinates as given in a FAILURE message, for example:

% tessboxes -e image.pbm image.box -g 1871,1154

When you have enough *.tr files, you can compile the remaining language files as follows:

% mftraining *.tr
% cntraining *.tr
% unicharset_extractor *.box

It may be a good idea to combine several *.tr files if they represent the same typeface. In that case, do the following (the order of the files must be identical in all commands):

% cat image1.tr image2.tr > combined.tr
% cat image1.box image2.box > combined.box
% mftraining combined.tr
% cntraining combined.tr
% unicharset_extractor combined.box

Now choose a name for your language, eg "mylang". Due to limitations in tesseract, all the compiled language files must be named mylang.* and must reside in a directory called tessdata. Therefore:

% mkdir tessdata
% mv inttemp tessdata/mylang.inttemp
% mv normproto tessdata/mylang.normproto
% mv pffmtable tessdata/mylang.pffmtable
% mv unicharset tessdata/mylang.unicharset

You still need some extra files. If you’re training a variant of English, then you can simply copy the tesseract system files. Find your system tessdata directory. For example:

% cp /usr/share/tessdata/eng.DangAmbigs tessdata/mylang.DangAmbigs
% cp /usr/share/tessdata/eng.freq-dawg tessdata/mylang.freq-dawg
% cp /usr/share/tessdata/eng.word-dawg tessdata/mylang.word-dawg
% cp /usr/share/tessdata/eng.user-words tessdata/mylang.user-words

You are now done. To read a new image file with the language "mylang", try this

% export TESSDATA_PREFIX=./tessdata/
% tesseract image.tif output -l mylang

If you don’t want to set TESSDATA_PREFIX (never forget the trailing /), you can also copy all the files tessdata/mylang.* into the system tessdata directory you found earlier.

OPTIONS

	-e		Edit the BOXFILE. This consists of an interactive editor in the current terminal, and a graphical window showing the boxes surrounding the letters. The window can be resized as convenient. In the editor, the highlight can be moved with the cursor keys, and anything typed will replace the box label. To change the box dimensions, use ALT or CTRL and the cursor keys.
	-g		This also turns on the -e switch automatically. After loading the BOXFILE, go directly to the first box whose corner coordinates are XY. The string XY can be either "NUMBER,NUMBER" or "(NUMBER,NUMBER)". If the coordinates are not found, tessboxes exits immediately.
	-s		In conjunction with the -e switch, shifts horizontally the highlighted box in the graphical display by a FACTOR in the range [0.01, 0.99].
	-2		save box files in Tesseract v2 legacy format. Default is to save in current v3 format, which has an extra page number column at the end of each line.
	-m		In conjunction with -e switch, magnify the image by integer factor MAGNIFY. Use Ctrl+Z to cycle through original size.
	-a		Redefine annotation phrases. Expects a string ANNOT of up to 8 phrases delimited by semicolons, eg ";bold;italic" associates the empty phrase with FN1, "bold" with FN2, "italic" with FN3 and leaves FN4-FN8 at their defaults.

NOTES

The annotations (keys FN1-FN8) are saved as comments at the end of each line of the box file. This shouldn’t cause problems with tesseract(1), since (at least in v3.x of tesseract) the extra information is ignored. If the file format is ever changed, this will become a bug.

BUGS

tessboxes uses too much CPU when idle.

SOURCE

The source code for the latest version of this program is available at the following locations:

http://www.lbreyer.com/gpl.html

AUTHOR

Laird A. Breyer <laird@lbreyer.com>