email me

Comparing dbacl with other classifiers

Note: this tutorial shows one way of comparing classifiers using the cross validation method. A different approach, which compares spam filters more realistically, has been pioneered at the TREC 2005 conference. The instructions for using dbacl with the TREC spamjig are here.

The key to comparing dbacl(1) with other classifiers is the mailcross(1) testsuite command.

Simply put, this command allows you to compare the error rates of several classifiers on a common set of training documents. The rates you obtain are of course only estimates, and likely to vary somewhat depending on the actual sample emails you use. Thus it is possible for one classifier to perform better than another with one set of documents, while performing worse with a different set.

Unfortunately, there does not exist a truly representative set of email documents for everyone on the planet. Moreover, one person's email characteristics vary slowly over time. Consequently, it makes little sense to compare the performance of different classifiers on different sets of documents. Instead, the task of choosing the best classifier for yourself can only be done reliably by referencing your own email, and by comparing classifiers on exactly the same emails.

The mailcross(1) testsuite must be given a set of categories, with sample emails from each category in mbox format. After selecting all the classifiers to be compared, it remains only to leave the script running over night. The summary is usually inspected the next morning.

The method used to estimate classification errors is a standard cross validation. The training emails are split into a number of roughly equal sized subsets, all of which, except for one, are used for learning. The remaining subset, which wasn't learned, is predicted. Finally, the percentage of errors is calculated for each category by averaging results over all possible choices of the prediction subset.

Note that this is neither the only way to estimate prediction errors, nor even accepted as a good way by all academics. However, it's independent of the classifier, widely used around the world, and easy to program.

You can cross validate as many categories as you like, provided the classifiers all support multiple categories. For example, you could compare dbacl(1) and ifile on many categories.

However, most email junk filters can only cope with two categories, representing junk mail and regular mail. When comparing the performance of these classifiers, such as bogofilter for example, the mailcross(1) testsuite is hard coded to function with two categories named spam and notspam. You must use these category names, or the results will not make sense.

An Example: Preparations

Before you can run mailcross(1) testsuite, you need a set of sample emails for each category. Here, we shall test on two categories, named spam and notspam.

Take a moment to sift through your local mail folders for sample emails.

The instructions below assume you have two Unix mbox format files named $HOME/sample_spam.mbox and $HOME/sample_notspam.mbox, containing junk email and ordinary email respectively. These will not be modified in any way during the test.

Fill these folders with as many messages as you can. While this will lengthen the time it takes for the cross validation to complete, it also gives more accurate results. You should expect the tests to run overnight anyway.

If your emails aren't in mbox format, you must convert them. For example, if $HOME/myspam is a directory containing your emails, one file per email, you can type:

% find $HOME/myspam -type f | while read f; \
do formail <$f; done > $HOME/sample_spam.mbox

Alternatively, if you don't have many emails for testing, you can download samples from a public corpus. For example, SpamAssassin maintains suitable sets of messages at http://spamassassin.org/publiccorpus/. Be kind to their server!

The SpamAssassin corpus doesn't come in mbox format. Here's what you must do to obtain usable files: Download a compressed message archive. For example, you can download the file 20021010_hard_ham.tar.bz2, which contains a selection of nonjunk messages. Type

% tar xfj 20021010_hard_ham.tar.bz2

which will extract the files into a directory named hard_ham. If you inspect the directory by typing

% ls hard_ham
you will see many files named something like 0053.ccd1056dc3ff533d76a97044bac52087. These are all individual messages. Watch out for files named out of the ordinary. Some archives contain a file named cmds which is NOT a mail message. Delete all such files before proceeding. Next, type:

% find hard_ham -type f | while read f; \
      do formail <$f; done > $HOME/sample_notspam.mbox

You can repeat this command for as many archives as needed, but remember to change the destination mbox name, as it will get overwritten otherwise.

An Example: Running the Tests

Before you cross validate, make sure you have ample disk space available. As a rough rule, expect to require up to 20 times the combined size of your $HOME/sample_*.mbox files if you do the following.

% mailcross prepare 10
% mailcross add spam $HOME/sample_spam.mbox
% mailcross add notspam $HOME/sample_notspam.mbox

Note that if you have several mbox files with spam, you can repeat the add spam command several times with each mbox file. All this command does is merge the contents of the mbox file into a specially created directory named maicross.d. Once this is done, you don't need the original *.mbox files around any longer, at least for cross validation purposes.

You are now ready to select the classifiers you wish to compare. Type

% mailcross testsuite list
The following classification wrappers are selectable:

annoyance-filter - Annoyance Filter 1.0b with prune
antispam - AntiSpam 1.1 with default options
bmf - bmf 0.9.4 with default options
bogofilterA - bogofilter 0.15.7 with Robinson-Fischer algorithm
bogofilterB - bogofilter 0.15.7 with Graham algorithm
bogofilterC - bogofilter 0.15.7 with Robinson algorithm
crm114A - crm114 20031129-RC11 with default settings
crm114B - crm114 20031129-RC11 with Jaakko Hyvatti's normalizemime
dbaclA - dbacl 1.6 with alpha tokens
dbaclB - dbacl 1.6 with cef,headers,alt,links
dbaclC - dbacl 1.6 with alpha tokens and risk matrix
ifile - ifile 1.3.3 with to,from,subject headers and default tokens
popfile - POPFile (unavailable?) with default options
qsf - qsf 0.9.4 with default options
spamassassin - SpamAssassin 2.60 (Bayes module) with default settings
spambayes - SpamBayes x with default settings
spamoracle - SpamOracle x with default settings
spamprobe - SpamProbe v0.9e with default options

The exact list you see depends on the classifiers installed on your system. If a classifier is marked unavailable, you must first download and install it somewhere in your path. Once this is done, select the classifiers you are going to test, for example:

% mailcross testsuite select dbaclB bogofilterA annoyance-filter

Note that some of these only work with two categories spam and notspam. You can see the state of the testsuite by typing:

% mailcross testsuite status
The following categories are to be cross validated:

notspam.mbox - counting...    2500 messages
spam.mbox - counting...     500 messages

Cross validation is performed on each of these classifiers:

annoyance-filter - Annoyance Filter 1.0b with prune
bogofilterA - bogofilter 0.15.7 with Robinson algorithm
dbaclB - dbacl 1.5 with cef,headers,alt,links

Finally, to start the test, type

% mailcross testsuite run

The cross validation may take a long time, depending on the classifier and the number of messages. You can check progress by keeping an eye on the log files in the directory mailcross.d/log/

An Example: Viewing The Results

Once the cross validation test has completed, you can see the results as follows:

% mailcross testsuite summarize

Each selected classifier is scored in two complementary ways.

The first question asked is Where do misclassifications go?, which shows roughly how good the predictions are from an objective standpoint.

The percentage of notspam messages predicted as spam is sometimes called the false negative rate. The percentage of spam messages predicted as notspam is sometimes called the false positive rate. This terminology is however not standardized and confusing (as it depends on the purpose of the test) and won't be used here.

The second question asked is What is really in each category after prediction?, which is really a dual form of the previous question.

Normally, the purpose of mail classification is to separate your messages so that you save time. Here you can see how "clean" your mailboxes would be after classification.

Here is a sample summary output by mailcross(1) testsuite. Remember that results such as these make no sense unless you try them out on your own emails. You have no idea what emails were used to obtain these results, and I am not going to tell you.

Annoyance Filter 1.0b with prune
Fri Nov 14 11:26:58 EST 2003
Where do misclassifications go?

  true     | but predicted as...
    *      |    notspam      spam
notspam    |    100.00%     0.00%
spam       |      9.40%    90.60%

What is really in each category after prediction?

category   | contains mixture of...
    *      |    notspam      spam
notspam    |     98.15%     1.85%
spam       |      0.00%   100.00%

bogofilter 0.15.7 with Robinson algorithm
Fri Nov 14 11:30:25 EST 2003
Where do misclassifications go?

  true     | but predicted as...
    *      |    notspam      spam
notspam    |    100.00%     0.00%
spam       |      8.40%    91.60%

What is really in each category after prediction?

category   | contains mixture of...
    *      |    notspam      spam
notspam    |     98.35%     1.65%
spam       |      0.00%   100.00%

dbacl 1.5 with cef,headers,alt,links
Fri Nov 14 11:33:33 EST 2003
Where do misclassifications go?

  true     | but predicted as...
    *      |    notspam      spam
notspam    |    100.00%     0.00%
spam       |      5.80%    94.20%

What is really in each category after prediction?

category   | contains mixture of...
    *      |    notspam      spam
notspam    |     98.85%     1.15%
spam       |      0.00%   100.00%