Language models, classification and dbacl

Laird A. Breyer

Introduction

This is a non-mathematical tutorial on how to use the dbacl Bayesian text classifier. The mathematical details can be read here.

dbacl is a UNIX command line tool, so you will need to work at the shell prompt (here written %, even though we use Bash semantics). The program comes with five sample text documents and a few scripts. Look for them in the same directory as this tutorial, or you can use any other plain text documents instead. Make sure the sample documents you will use are in the current working directory. You need all *.txt, *.pl and *.risk files.

The tutorial below is geared towards generic text classification. If you intend to use dbacl for email classification, please read this.

For the impatient

dbacl has two major modes of operation. The first is learning mode, where one or more text documents are analysed to find out what make them look the way they do. At the shell prompt, type (without the leading %)

% dbacl -l one sample1.txt
% dbacl -l two sample2.txt

This creates two files named one and two, which contain the important features of each sample document.

The second major mode is classification mode. Let's say that you want to see if sample3.txt is closer to sample1.txt or sample2.txt; type

% dbacl -c one -c two sample3.txt -v
one

and dbacl should tell you it thinks sample3.txt is less like two (which is the category learned from sample2.txt) and more like one (which is the category learned from sample1.txt). That's it.

You can create as many categories as you want, one, two, three, good, bad, important, jokes, but remember that each one must be learned from a representative collection of plain text documents.

dbacl is designed to be easy to use within a script, so you can make it part of your own projects, perhaps a spam detection script, or an agent which automatically downloads the latest newspaper articles on your favourite topic...

Tip: The category files are created in the current directory, or whichever path you indicate. If you want, you can keep all your category files in a single common directory. For example, if you use the bash shell, type

% mkdir $HOME/.dbacl
% echo "DBACL_PATH=$HOME/.dbacl" >> $HOME/.bashrc
% source $HOME/.bashrc

From now on, all your categories will be kept in $HOME/.dbacl, and searched for there.

Language models

dbacl works by scanning the text it learns for features, which can be nearly anything you like. For example, unless you tell it otherwise, the standard features are all alphabetic single words in the document. dbacl builds a statistical model, ie a probability distribution, based only on those features, so anything that is not a feature will be ignored both during learning, and during classification.

This dependence on features is a double edged sword, because it helps dbacl focus on the things that matter (single alphabetic words by default), but if something else matters more then it is ignored if you don't tell dbacl it's a feature. This is the hard part, and it's up to you.

When telling dbacl what kind of features to look out for, you must use the language of regular expressions. For example, if you think the only interesting features for category one are words which contain the letter 'q', then you would type

% dbacl -l justq -g '^([a-zA-Z]*q[a-zA-Z]*)' \
  -g '[^a-zA-Z]([a-zA-Z]*q[a-zA-Z]*)' sample2.txt

The rule is that dbacl always takes as a feature whatever it finds within round brackets. Reading this can be painful if you don't know regular expressions, however.

In English, the first expression after the -g option above reads: take as a feature any string which looks like: "start of the line" (written ^) followed by "zero or more characters within the range a-z or A-Z" (written [a-zA-Z]*), followed by "the character q" (written q), followed by "zero or more characters within the range a-z or A-Z" (written [a-zA-Z]*). The second expression is nearly identical: "a single character which is not in the range a-zA-Z" (written [^a-zA-Z]), followed by "zero or more characters within the range a-z or A-Z" (can you guess?), followed by "the character q", followed by "zero or more characters within the range a-z or A-Z". The single quote marks are used to keep the whole expression together.

A regular expression is a simultaneous superposition of many text strings. Just like a word, you read and write it one character at a time.

Symbol What it means

. any character except newline

* zero or more copies of preceding character or parenthesized expression

+ one or more copies of preceding character or parenthesized expression

? zero or one copies of preceding character or parenthesized expression

^ beginning of line

$ end of line

a|b a or b

[abc] one character equal to a, b or c

[^abc] one character not equal to a, b or c

\*, \?, or \. the actual character *, ? or .

To get a feel for the kinds of features taken into account by dbacl in the example above, you can use the -D option. Retype the above in the slightly changed form

% dbacl -l justq -g '^([a-zA-Z]*q[a-zA-Z]*)' \
 -g '[^a-zA-Z]([a-zA-Z]*q[a-zA-Z]*)' sample2.txt -D | head -10

This lists the first few matches, one per line, which exist in the sample1.txt document. Obviously, only taking into account features which consist of words with the letter 'q' in them makes a poor model.

Sometimes, it's convenient to use parentheses which you want to throw away. dbacl understands the special notation ||xyz which you can place at the end of a regular expression, where x, y, z should be digits corresponding to the parentheses you want to keep. Here is an example for mixed Japanese and English documents, which matches alphabetic words and single ideograms:

% LANG=ja_JP dbacl -D -l konichiwa japanese.txt -i \
 -g '(^|[^a-zA-Z0-9])([a-zA-Z0-9]+|[[:alpha:]])||2'

Note that you need a multilingual terminal and Japanese fonts to view this, and your computer must have a Japanese locale available.

In the table below, you will find a list of some simple regular expressions to get you started:

If you want to match... Then you need this expression... Examples

alphabetic words (^|[^[:alpha:]]) ([[:alpha:]]+) ||2 hello, kitty

words in capitals (^|[^[A-Z]]) ([A-Z]+) ||2 MAKE, MONEY, FAST

strings of characters separated by spaces (^|[ ]) ([^ ]+) ||2 w$%&tf9(, amazing!, :-)

time of day (^|[^0-9]) ([0-9?[0-9]:[0-9][0-9](am|pm)) ||2 9:17am, 12:30pm

words which end in a number (^|[^a-zA-Z0-9]) ([a-zA-Z]+[0-9]+) [^a-zA-Z] ||2 borg17234, A1

alphanumeric word pairs (^|[^[:alnum:]]) ([[:alnum:]]+) [^[:alnum:]]+ ([[:alnum:]]+) ||23 good morning, how are

The last entry in the table above shows how to take word pairs as features. Such models are called bigram models, as opposed to the unigram models whose features are only single words, and they are used to capture extra information.

For example, in a unigram model the pair of words "well done" and "done well" have the same probability. A bigram model can learn that "well done" is more common in food related documents (provided this combination of words was actually found within the learning corpus).

However, there is a big statistical problem: because there exist many more meaningful bigrams than unigrams, you'll need a much bigger corpus to obtain meaningful statistics. One way around this is a technique called smoothing, which predicts unseen bigrams from already seen unigrams. To obtain such a combined unigram/bigram alphabetic word model, type

% dbacl -l smooth -g '(^|[^a-zA-Z])([a-zA-Z]+)||2' \
 -g '(^|[^a-zA-Z])([a-zA-Z]+)[^a-zA-Z]+([a-zA-Z]+)||23' sample1.txt

If all you want are alphabetic bigrams, trigrams, etc, there is a special switch -w you can use. The command

% dbacl -l slick -w 2 sample1.txt

produces a model slick which is nearly identical to smooth (the difference is that a regular expression cannot straddle newlines, but -w ngrams can).

Obviously, all this typing is getting tedious, and you will eventually want to automate the learning stage in a shell script. Use regular expressions sparingly, as they can quickly degrade the performance (speed and memory) of dbacl. See Appendix A for ways around this.

Evaluating the models

Now that you have a grasp of the variety of language models which dbacl can generate, the important question is what set of features should you use?

There is no easy answer to this problem. Intuitively, a larger set of features seems always preferable, since it takes more information into account. However, there is a tradeoff. Comparing more features requires extra memory, but much more importantly, too many features can overfit the data. This results in a model which is so good at predicting the learned documents, that virtually no other documents are considered even remotely similar.

It is beyond the scope of this tutorial to describe the variety of statistical methods which can help decide what features are meaningful. However, to get a rough idea of the quality of the model, we can look at the cross entropy reported by dbacl.

The cross entropy is measured in bits and has the following meaning: If we use our probabilistic model to construct an optimal compression algorithm, then the cross entropy of a text string is the predicted number of bits which is needed on average, after compression, for each separate feature. This rough description isn't complete, since the cross entropy doesn't measure the amount of space also needed for the probability model itself.

To compute the cross entropy of category one, type

% dbacl -c one sample1.txt -vn
cross_entropy 7.60 bits complexity 678

The cross entropy is the first value returned. The second value essentially measures how many features describe the document. Now suppose we try other models trained on the same document:

% dbacl -c slick sample1.txt -vn
cross_entropy 4.74 bits complexity 677
% dbacl -c smooth sample1.txt -vn
cross_entropy 5.27 bits complexity 603

According to these estimates, both bigram models fit sample1.txt better. This is easy to see for slick, since the complexity (essentially the number of features) is nearly the same as for one. But smooth looks at fewer features, and actually compresses them just slightly better in this case. Let's ask dbacl which category fits better:

% dbacl -c one -c slick -c smooth sample1.txt -v
smooth

WARNING: dbacl doesn't yet cope well with widely different feature sets. Don't try to compare categories built on completely different feature specifications unless you fully understand the statistical implications.

Decision Theory

If you've read this far, then you probably intend to use dbacl to automatically classify text documents, and possibly execute certain actions depending on the outcome. The bad news is that dbacl isn't designed for this. The good news is that there is a companion program, bayesol, which is. To use it, you just need to learn some Bayesian Decision Theory.

We'll suppose that the document sample4.txt must be classified in one of the categories one, two and three. To make optimal decisions, you'll need three ingredients: a prior distribution, a set of conditional probabilities and a measure of risk. We'll get to these in turn.

The prior distribution is a set of weights, which you must choose yourself, representing your beforehand beliefs. You choose this before you even look at sample4.txt. For example, you might know from experience that category one is twice as likely as two and three. The prior distribution is a set of weights you choose to reflect your beliefs, e.g. one:2, two:1, three:1. If you have no idea what to choose, give each an equal weight (one:1, two:1, three:1).

Next, we need conditional probabilities. This is what dbacl is for. Type

% dbacl -l three sample3.txt
% dbacl -c one -c two -c three sample4.txt -N
one  0.00% two 100.00% three  0.00%

As you can see, dbacl is 100% sure that sample4.txt resembles category two. Such accurate answers are typical with the kinds of models used by dbacl. In reality, the probabilities for one and three are very, very small and the probability for two is really close, but not equal to 1. See Appendix B for a rough explanation.

We combine the prior (which represents your own beliefs and experiences) with the conditionals (which represent what dbacl thinks about sample4.txt) to obtain a set of posterior probabilities. In our example,

Posterior probability that sample4.txt resembles one: 0%*2/(2+1+1) = 0%
Posterior probability that sample4.txt resembles two: 100%*1/(2+1+1) = 100%
Posterior probability that sample4.txt resembles three: 0%*1/(2+1+1) = 0%

Okay, so here the prior doesn't have much of an effect. But it's there if you need it.

Now comes the tedious part. What you really want to do is take these posterior distributions under advisement, and make an informed decision.

To decide which category best suits your own plans, you need to work out the costs of misclassifications. Only you can decide these numbers, and there are many. But at the end, you've worked out your risk. Here's an example:

If sample4.txt is like one but it ends up marked like one, then the cost is 0
If sample4.txt is like one but it ends up marked like two, then the cost is 1
If sample4.txt is like one but it ends up marked like three, then the cost is 2
If sample4.txt is like two but it ends up marked like one, then the cost is 3
If sample4.txt is like two but it ends up marked like two, then the cost is 0
If sample4.txt is like two but it ends up marked like three, then the cost is 5
If sample4.txt is like three but it ends up marked like one, then the cost is 1
If sample4.txt is like three but it ends up marked like two, then the cost is 1
If sample4.txt is like three but it ends up marked like three, then the cost is 0

These numbers are often placed in a table called the loss matrix (this way, you can't forget a case), like so:

correct category misclassified as

one two three

one 0 1 2

two 3 0 5

three 1 1 0

We are now ready to combine all these numbers to obtain the True Bayesian Decision. For every possible category, we simply weigh the risk with the posterior probabilities of obtaining each of the possible misclassifications. Then we choose the category with least expected posterior risk.

For category one, the expected risk is 0*0% + 3*1000% + 1*0% = 3
For category two, the expected risk is 1*0% + 0*100% + 1*0% = 0 <-- smallest
For category three, the expected risk is 2*0% + 5*100% + 0*0% = 5

The lowest expected risk is for caterogy two, so that's the category we choose to represent sample4.txt. Done!

Of course, the loss matrix above doesn't really have an effect on the probability calculations, because the conditional probabilities strongly point to category two anyway. But now you understand how the calculation works. Below, we'll look at a more realistic example (but still specially chosen to illustrate some points).

One last point: you may wonder how dbacl itself decides which category to display when classifying with the -v switch. The simple answer is that dbacl always displays the category with maximal conditional probability (often called the MAP estimate). This is mathematically completely equivalent to the special case of decision theory when the prior has equal weights, and the loss matrix takes the value 1 everywhere, except on the diagonal (ie correct classifications have no cost, everything else costs 1).

Using bayesol

bayesol is a companion program for dbacl which makes the decision calculations easier. The bad news is that you still have to write down a prior and loss matrix yourself. Eventually, someone, somewhere may write a graphical interface.

bayesol reads a risk specification file, which is a text file containing information about the categories required, the prior distribution and the cost of misclassifications. For the toy example discussed earlier, the file toy.risk looks like this:

categories {
    one, two, three
}
prior {
    2, 1, 1
}
loss_matrix {
"" one   [ 0, 1, 2 ]
"" two   [ 3, 0, 5 ]
"" three [ 1, 1, 0 ]
}

Let's see if our hand calculation was correct:

% dbacl -c one -c two -c three sample4.txt -vna | bayesol -c toy.risk -v
two

Good! However, as discussed above, the misclassification costs need improvement. This is completely up to you, but here are some possible suggestions to get you started.

To devise effective loss matrices, it pays to think about the way that dbacl computes the probabilities. Appendix B gives some details, but we don't need to go that far. Recall that the language models are based on features (which are usually kinds of words). Every feature counts towards the final probabilities, and a big document will have more features, hence more opportunities to steer the probabilities one way or another. So a feature is like an information bearing unit of text.

When we read a text document which doesn't accord with our expectations, we grow progressively more annoyed as we read further into the text. This is like an annoyance interest rate which compounds on information units within the text. For dbacl, the number of information bearing units is reported as the complexity of the text. This suggests that the cost of reading a misclassified document could have the form (1 + interest)^complexity. Here's an example loss matrix which uses this idea

loss_matrix { 
"" one   [ 0,               (1.1)^complexity,  (1.1)^complexity ]
"" two   [(1.1)^complexity, 0,                 (1.7)^complexity ] 
"" three [(1.5)^complexity, (1.01)^complexity, 0 ]
}

Remember, these aren't monetary interest rates, they are value judgements. You can see this loss matrix in action by typing

% dbacl -c one -c two -c three sample5.txt -vna | bayesol -c example1.risk -v
three

Now if we increase the cost of misclassifying two as three from 1.7 to 2.0, the optimal category becomes

% dbacl -c one -c two -c three sample5.txt -vna | bayesol -c example2.risk -v
two

bayesol can also handle infinite costs. Just write "inf" where you need it. This is particularly useful with regular expressions. If you look at each row of loss_matrix above, you see an empty string "" before each category. This indicates that this row is to be used by default in the actual loss matrix. But sometimes, the losses can depend on seeing a particular string in the document we want to classify.

Suppose you normally like to use the loss matrix above, but in case the document contains the word "Polly", then the cost of misclassification is infinite. Here is an updated loss_matrix:

loss_matrix { 
""          one   [ 0,               (1.1)^complexity,  (1.1)^complexity ]
"Polly"     two   [ inf,             0,                 inf ]
""          two   [(1.1)^complexity, 0,                 (2.0)^complexity ] 
""          three [(1.5)^complexity, (1.01)^complexity, 0 ]
}

bayesol looks in its input for the regular expression "Polly", and if it is found, then for misclassifications away from two, it uses the row with the infinite values, otherwise it uses the default row, which starts with "". If you have several rows with regular expressions for each category, bayesol always uses the first one from the top which matches within the input. You must always have at least a default row for every category.

The regular expression facility can also be used to perform more complicated document dependent loss calculations. Suppose you like to count the number of lines of the input document which start with the character '>', as a proportion of the total number of lines in the document. The following perl script transcribes its input and appends the calculated proportion.

#!/usr/bin/perl 
# this is file prop.pl

$special = $normal = 0; 
while(<SDTIN>) {
    $special++ if /^ >/; 
    $normal++; 
    print; 
} 
$prop = $special/$normal; 
print "proportion: $prop\n";

If we used this script, then we could take the output of dbacl, append the proportion of lines containing '>', and pass the result as input to bayesol. For example, the following line is included in the example3.risk specification

"^proportion: ([0-9.]+)" one [ 0, (1+$1)^complexity, (1.2)^complexity ]

and through this, bayesol reads, if present, the line containing the proportion we calculated and takes this into account when it constructs the loss matrix. You can try this like so:

% dbacl -T email -c one -c two -c three sample6.txt -nav \
  | perl prop.pl | bayesol -c example3.risk -v

Note that in the loss_matrix specification above, $1 refers to the numerical value of the quantity inside the parentheses. Also, it is useful to remember that when using the -a switch, dbacl outputs all the original lines from unknown.txt with an extra space in front of them. If another instance of dbacl needs to read this output again (e.g. in a pipeline), then the latter should be invoked with the -A switch.

Miscellaneous

Be careful when classifying very small strings. Except for the multinomial models (which includes the default model), the dbacl calculations are optimized for large strings with more than 20 or 30 features. For small text lines, the complex models give only approximate scores. In those cases, stick with unigram models, which are always exact.

In the UNIX philosophy, programs are small and do one thing well. Following this philosophy, dbacl essentially only reads plain text documents. If you have non-textual documents (word, html, postscript) which you want to learn from, you will need to use specialized tools to first convert these into plain text. There are many free tools available for this.

dbacl has limited support for reading mbox files (UNIX email) and can filter out html tags in a quick and dirty way, however this is only intended as a convenience, and should not be relied upon to be fully accurate.

Appendix A: memory requirements

When experimenting with complicated models, dbacl will quickly fill up its hash tables. dbacl is designed to use a predictable amount of memory (to prevent nasty surprises on some systems). The default hash table size in version 1.1 is 15, which is enough for 32,000 unique features and produces a 512K category file on my system. You can use the -h switch to select hash table size, in powers of two. Beware that learning takes much more memory than classifying. Use the -V switch to find out the cost per feature. On my system, each feature costs 6 bytes for classifying but 17 bytes for learning.

For testing, I use the collected works of Mark Twain, which is a 19MB pure text file. Timings are on a 500Mhz Pentium III.

command Unique features Category size Learning time

dbacl -l twain1 Twain-Collected_Works.txt -w 1 -h 16 49,251 512K 0m9.240s

dbacl -l twain2 Twain-Collected_Works.txt -w 2 -h 20 909,400 6.1M 1m1.100s

dbacl -l twain3 Twain-Collected_Works.txt -w 3 -h 22 3,151,718 24M 3m42.240s

As can be seen from this table, including bigrams and trigrams has a noticeable memory and performance effect during learning. Luckily, classification speed is only affected by the number of features found in the unknown document.

command features Classification time

dbacl -c twain1 Twain-Collected_Works.txt unigrams 0m4.860s

dbacl -c twain2 Twain-Collected_Works.txt unigrams and bigrams 0m8.930s

dbacl -c twain3 Twain-Collected_Works.txt unigrams, bigrams and trigrams 0m12.750s

The heavy memory requirements during learning of complicated models can be reduced at the expense of the model itself. dbacl has a feature decimation switch which slows down the hash table filling rate by simply ignoring many of the features found in the input.

Appendix B: Extreme probabilities

Why is the result of a dbacl probability calculation always so accurate?

% dbacl -c one -c two -c three sample4.txt -N
one 0.00% two  100.00% three  0.00%

The reason for this has to do with the type of model which dbacl uses. Let's look at some scores:

% dbacl -c one -c two -c three sample4.txt -n
one 10512.00 two 7908.17 three 10420.36
% dbacl -c one -c two -c three sample4.txt -nv
one 20.25 * 519 two 15.24 * 519 three 20.08 * 519

The first set of numbers are minus the logarithm (base 2) of each category's probability of producing the full document sample4.txt. This represents the evidence away from each category, and is measured in bits. one and three are fairly even, but two has by far the lowest score and hence highest probability (in other words, the model for two is the least bad at predicting sample4.txt, so if there are only three possible choices, it's the best). To understand these numbers, it's best to split each of them up into a product of cross entropy and complexity, as is done in the second line.

Remember that dbacl calculates probabilities about resemblance by weighing the evidence for all the features found in the input document. There are 519 features in sample4.txt, and each feature contributes on average 20.25 bits of evidence against category one, 15.24 bits against category two and 20.08 bits against category three. Let's look at what happens if we only look at the first 25 lines of sample4.txt:

% head -25 sample4.txt | dbacl -c one -c two -c three -nv
one 19.01 * 324 two 14.61 * 324 three 18.94 * 324

There are fewer features in the first 25 lines of sample4.txt than in the full text file, but the picture is substantially unchanged.

% head -25 sample4.txt | dbacl -c one -c two -c three -N
one  0.00% two 100.00% three  0.00%

dbacl is still very sure, because it has looked at many features (324) and found small differences which add up to quite different scores. However, you can see that each feature now contributes less information (19.01, 14.61, 18.98) bits compared to the earlier (20.25, 15.24, 20.08).

Since category two is obviously the best (closest to zero) choice among the three models, let's drop it for a moment and consider the other two categories. We also reduce dramatically the number of features (words) we shall look at. The first line of sample4.txt has 15 words:

% head -1 sample4.txt | dbacl -c one -c three -N
one 13.90% three 86.10%

Finally, we are getting probabilities we can understand! Unfortunately, this is somewhat misleading. Each of the 15 words gave a score and these scores were added for each category. Since both categories here are about equally bad at predicting words in sample4.txt, the difference in the final scores for category one and three amounts to less than 3 bits of information, which is why the probabilities are mixed:

% head -1 sample4.txt | dbacl -c one -c three -nv
one 15.47 * 15 three 15.30 * 15

So the interpretation of the probabilities is clear. dbacl weighs the evidence from each feature it finds, and reports the best fit among the choices it is offered. Because it sees so many features separately, it believes its verdict is very sure. Of course, whether these features are the right features to look at for best classification is another matter entirely, and it's entirely up to you to decide.

Symbol	What it means
.	any character except newline
*	zero or more copies of preceding character or parenthesized expression
+	one or more copies of preceding character or parenthesized expression
?	zero or one copies of preceding character or parenthesized expression
^	beginning of line
$	end of line
a\|b	a or b
[abc]	one character equal to a, b or c
[^abc]	one character not equal to a, b or c
\*, \?, or \.	the actual character *, ? or .

If you want to match...	Then you need this expression...	Examples
alphabetic words	(^\|[^[:alpha:]]) ([[:alpha:]]+) \|\|2	hello, kitty
words in capitals	(^\|[^[A-Z]]) ([A-Z]+) \|\|2	MAKE, MONEY, FAST
strings of characters separated by spaces	(^\|[ ]) ([^ ]+) \|\|2	w$%&tf9(, amazing!, :-)
time of day	(^\|[^0-9]) ([0-9?[0-9]:[0-9][0-9](am\|pm)) \|\|2	9:17am, 12:30pm
words which end in a number	(^\|[^a-zA-Z0-9]) ([a-zA-Z]+[0-9]+) [^a-zA-Z] \|\|2	borg17234, A1
alphanumeric word pairs	(^\|[^[:alnum:]]) ([[:alnum:]]+) [^[:alnum:]]+ ([[:alnum:]]+) \|\|23	good morning, how are

correct category	misclassified as
correct category	one	two	three
one	0	1	2
two	3	0	5
three	1	1	0

command	Unique features	Category size	Learning time
dbacl -l twain1 Twain-Collected_Works.txt -w 1 -h 16	49,251	512K	0m9.240s
dbacl -l twain2 Twain-Collected_Works.txt -w 2 -h 20	909,400	6.1M	1m1.100s
dbacl -l twain3 Twain-Collected_Works.txt -w 3 -h 22	3,151,718	24M	3m42.240s

command	features	Classification time
dbacl -c twain1 Twain-Collected_Works.txt	unigrams	0m4.860s
dbacl -c twain2 Twain-Collected_Works.txt	unigrams and bigrams	0m8.930s
dbacl -c twain3 Twain-Collected_Works.txt	unigrams, bigrams and trigrams	0m12.750s