How to classify junk email with Bayesian statistics
(This is a short version of a longer tutorial on
dbacl.)
In UNIX, incoming mail for each user
can be processed with procmail(1) before ending up in the user's inbox.
For example, this can be used to filter out junk email messages, which get redirected into
a special spam folder. To identify the junk messages, conventional
filters rely on (variations of)
recognizing known trigger words in the email. This leads to a race to keep up
to date word lists to cope with new variations of junk email over time.
With Bayesian methods, the user's old email archives are analyzed
statistically, and incoming messages are tested by comparison with
these statistics. As the user's email archives grow, the Bayesian method
learns to recognize incoming messages specific to the user
ever more accurately,
without relying on arbitrarily chosen word lists.
The race is broken.
As an added bonus, Bayesian classification works for any number of categories,
so besides a junk email category one could test for work email, personal email, email viruses, etc. However, note that
each category needs a corpus of representative
documents from which to build the statistics.
The following instructions show how to build a simple Bayesian spam
classifier
using procmail(1). The method relies on dbacl(1), which can be downloaded here. A more in-depth discussion worth reading is here.
If you haven't done so already, download and install dbacl(1). You'll also need procmail(1). Check that it is installed with the following command:
% which procmail
You need a collection of old email messages, split into spam and
notspam folders. We'll assume both reside in a directory $HOME/mail.
To have dbacl(1) learn each category, placing the result in a hidden directory
$HOME/.dbacl, type:
% mkdir $HOME/.dbacl
% dbacl -T email -l $HOME/.dbacl/spam $HOME/mail/spam
% dbacl -T email -l $HOME/.dbacl/notspam $HOME/mail/notspam
The last two commands should be repeated once a day or more frequently, either
manually, through a shell script, or automatically, by
placing them in your crontab(1) file. At the prompt, type
% crontab -l > existing_crontab.txt
Next, edit the file existing_crontab.txt with your favourite editor and add the
following three lines at the end:
CATS=$HOME/.dbacl
5 0 * * * dbacl -T email -l $CATS/spam $HOME/mail/spam
10 0 * * * dbacl -T email -l $CATS/notspam $HOME/mail/notspam
Now you can install the new crontab file by typing
% crontab existing_crontab.txt
You must also edit or create a procmail(1) recipe. First, verify that the file
$HOME/.forward exists and contains the single line:
|/usr/bin/procmail
Next, create the file $HOME/.procmailrc and place this in it. You may need to edit the paths to reflect your system.
If you already have a file $HOME/.procmailrc, then the appropriate modifications are your responsibility.
Your incoming mail will now go directly to the spam and notspam folders in the $HOME/mail directory. Moreover, in case of a technical problem, email will be collected in the folder inbox. Make sure your mailreader knows how to find these.
If over time
you see an email which is classified in the wrong folder, you must move it
to the correct folder, otherwise the wrong statistics will be learned.
If you want to stop classifying your email for some reason, just deleting
the two files $HOME/.forward and $HOME/.procmailrc should be enough.
Related software:
The classification engine in dbacl(1) uses the Maximum Entropy principle.
At the time of writing, there are several other free
Bayesian junk email filters: bogofilter, bmf, annoyance filter appear to
be based upon an influential essay by Paul Graham, which describes an ad-hoc implementation of the "Naive Bayes" assumption of Learning Theory. Two statistically correct
implementations are ifile,
and POPFile.
For an altogether different statistical approach, see
SpamBayes.
|