welcome/
java-mcmc/
software/
papers/
links/
email me

Indexing Postscript and PDF documents containing mathematical equations

Users with large personal document collections invariably want to make them easily searchable at some point. This can be accomplished on UNIX with free tools such as ht://Dig, SWISH-E and glimpse. In case some of the documents are in Postscript or PDF format, they must first be converted to plain text using a tool such as pstotext(1).

For mathematical documents, conversion with pstotext(1) results in text interspersed with many lines of random characters, because displayed equations aren't handled properly. In these cases, dbacl(1) can act as a filter to remove the noise lines, by recognizing only lines which appear to be mostly English text. This somewhat prevents the noise from polluting the list of indexed terms.

The following shell command converts a Postscript or PDF document and filters the noisy lines:

% pstotext Diestel-Graph_Theory.pdf | dbacl -c shake -Rf shake > output.txt

For this to work, first check that you have both pstotext(1) and dbacl(1) in your path, and create an "English text" category if necessary as follows:

% zcat Shakespeare-Complete_Works.txt.gz | dbacl -l shake 

The sample English text can be freely downloaded, e.g. from Project Gutenberg.