sa-harvest

Summary

sa-harvest is a configurable script for training SpamAssassin. It combines the process from How To Train SpamAssassin with automatic generation of the whitelist based on the contents of your ham inboxes and your outbox

The goal is to let you type one simple command (sa-harvest) rather than a series of complex commands with varying flags.

Caveats

This works for me. I have not tested it on anyone else’s mail yet.
I’m putting it out there so harebrained lunatics who don’t mind catching on fire in order to test the latest and greatest way to save time and keystrokes will play with it and give me feedback. If you don’t fit into that category this isn’t for you.
Right now it only works with Maildir, mbox and mbx formats. If you don’t know what this means, sa-harvest is not yet for you yet.
Support for mbox and mbx formats has not not tested on any serious level. Each requires a little extra configuration. if you wish to test this, please review the script to understand exactly what it does, and expect some level of sharp edges.
This will overwrite your user_prefs with abandon. See the setup section for how to keep your preferences.
It doesn’t work with mailboxes with spaces in their names (does anything else?)

Details

In your ~/.spamassassin directory you will create 6 new files:

user_prefs.base : everything in the user_prefs except whitelist_from entries – these are auto-generated by the script by using the addressbook files and your history of recent mail

addressbook : a list of any addresses you want let through

addressbook.negative : patterns of addresses you want sacked. e.g. your own email address because you don’t want to give a free pass to any spammer who knew to forge your address as the from address. This may be overly greedy – it’s a substring match, so paypal.com might as well be paypal.com.

mail.spam : list of paths - relative to your home dir - to mailboxes you consider spam, one mailbox per mailbox, e.g.

   Maildir/.Spam/cur

mail.ham : similarly, a list of paths to mailboxes you consider ham, e.g.

   Maildir/cur
   Maildir/.family/cur

mail.sent : mailboxes you consider sent mail, e.g. Maildir/.Sent/cur

What It Does

builds a list of ham boxes, spam boxes, and sent boxes.
runs the learner for ham on the ham boxes, and spam on the spam boxes
builds a working list of “good” addresses based on from addresses in your ham boxes, the first person on the to line in the sent boxes (this is imperfect – ideally it would grab everything in the To: header but right now it just reads the first line, which produces lousy results if you sent to more than one person, are forwading mail, etc.), and previously cached “good” addresses.
filters that list to remove what’s in addressbook.negative
caches the clean results for future use
builds a new user_prefs by taking user_prefs.base and adding whitelist_from’s for all of the above
builds an inverted histogram
reports load statistics

Setup

Backup your mail
cd ~/.spamassassin
cp user_prefs user_prefs.base
Using a text edtior, remove all whitelist_from lines from user_prefs.base, and add all those addresses (just as addresses) to addressbook
Add anything you don’t want showing up to addressbook.negative. Keep in mind that this is a regular expression substring match. If you put a line with a lone * in this file then nothing will be added to user_prefs.
Populate mail.spam, mail.ham and mail.sent as outlined above
Download the script. Install it somewhere useful. Run it.

Future Work

probable: make the addressbook rebuild optional
possible: subsets of folders so you don’t necessarily rescan everything every time
possible: some sort of setup system so you don’t have to read these docs
unlikely: some way to autogenerate the addressbook file out of popular addressbook formats

One Last Note

If you’re really feeling lucky you can set this up with cron. Note that it can be fairly processor intensive.

If you do that, you need to check your ham and spam when you log in so you quickly catch any mis-identification (and fix the resulting incorrect training). If you use Maildir then restricting your training to cur directories helps cut down on that problem, but it isn’t perfect, and mbox and mbx don’t have such an option.

Feedback

Please send me some: faisal@faisal.com.