welcome/
java-mcmc/
software/
papers/
links/
email me

How to classify junk email with Bayesian statistics

(This is a short version of a longer tutorial on dbacl.)

In UNIX, incoming mail for each user can be processed with procmail(1) before ending up in the user's inbox. For example, this can be used to filter out junk email messages, which get redirected into a special spam folder. To identify the junk messages, conventional filters rely on (variations of) recognizing known trigger words in the email. This leads to a race to keep up to date word lists to cope with new variations of junk email over time.

With Bayesian methods, the user's old email archives are analyzed statistically, and incoming messages are tested by comparison with these statistics. As the user's email archives grow, the Bayesian method learns to recognize incoming messages specific to the user ever more accurately, without relying on arbitrarily chosen word lists. The race is broken.

As an added bonus, Bayesian classification works for any number of categories, so besides a junk email category one could test for work email, personal email, email viruses, etc. However, note that each category needs a corpus of representative documents from which to build the statistics.

The following instructions show how to build a simple Bayesian spam classifier using procmail(1). The method relies on dbacl(1), which can be downloaded here. A more in-depth discussion worth reading is here.

If you haven't done so already, download and install dbacl(1). You'll also need procmail(1). Check that it is installed with the following command:

% which procmail

You need a collection of old email messages, split into spam and notspam folders. We'll assume both reside in a directory $HOME/mail. To have dbacl(1) learn each category, placing the result in a hidden directory $HOME/.dbacl, type:

% mkdir $HOME/.dbacl
% dbacl -T email -l $HOME/.dbacl/spam $HOME/mail/spam
% dbacl -T email -l $HOME/.dbacl/notspam $HOME/mail/notspam

The last two commands should be repeated once a day or more frequently, either manually, through a shell script, or automatically, by placing them in your crontab(1) file. At the prompt, type

% crontab -l > existing_crontab.txt

Next, edit the file existing_crontab.txt with your favourite editor and add the following three lines at the end:

CATS=$HOME/.dbacl
5 0 * * * dbacl -T email -l $CATS/spam $HOME/mail/spam
10 0 * * * dbacl -T email -l $CATS/notspam $HOME/mail/notspam

Now you can install the new crontab file by typing

% crontab existing_crontab.txt

You must also edit or create a procmail(1) recipe. First, verify that the file $HOME/.forward exists and contains the single line:

|/usr/bin/procmail

Next, create the file $HOME/.procmailrc and place this in it. You may need to edit the paths to reflect your system. If you already have a file $HOME/.procmailrc, then the appropriate modifications are your responsibility.

Your incoming mail will now go directly to the spam and notspam folders in the $HOME/mail directory. Moreover, in case of a technical problem, email will be collected in the folder inbox. Make sure your mailreader knows how to find these.

If over time you see an email which is classified in the wrong folder, you must move it to the correct folder, otherwise the wrong statistics will be learned.

If you want to stop classifying your email for some reason, just deleting the two files $HOME/.forward and $HOME/.procmailrc should be enough.

Related software: The classification engine in dbacl(1) uses the Maximum Entropy principle. At the time of writing, there are several other free Bayesian junk email filters: bogofilter, bmf, annoyance filter appear to be based upon an influential essay by Paul Graham, which describes an ad-hoc implementation of the "Naive Bayes" assumption of Learning Theory. Two statistically correct implementations are ifile, and POPFile. For an altogether different statistical approach, see SpamBayes.