sa-harvest
Summary
sa-harvest
is a configurable script for training SpamAssassin. It combines the process from How To Train SpamAssassin with automatic generation of the whitelist based on the contents of your ham inboxes and your outbox
The goal is to let you type one simple command (sa-harvest
) rather than a series of complex commands with varying flags.
Caveats
- This works for me. I have not tested it on anyone else’s mail yet.
- I’m putting it out there so harebrained lunatics who don’t mind catching on fire in order to test the latest and greatest way to save time and keystrokes will play with it and give me feedback. If you don’t fit into that category this isn’t for you.
- Right now it only works with Maildir, mbox and mbx formats. If you don’t know what this means, sa-harvest is not yet for you yet.
- Support for mbox and mbx formats has not not tested on any serious level. Each requires a little extra configuration. if you wish to test this, please review the script to understand exactly what it does, and expect some level of sharp edges.
- This will overwrite your user_prefs with abandon. See the setup section for how to keep your preferences.
- It doesn’t work with mailboxes with spaces in their names (does anything else?)
Details
In your ~/.spamassassin
directory you will create 6 new files:
user_prefs.base : everything in the user_prefs except whitelist_from entries – these are auto-generated by the script by using the addressbook files and your history of recent mail
addressbook : a list of any addresses you want let through
addressbook.negative : patterns of addresses you want sacked. e.g. your own email address because you don’t want to give a free pass to any spammer who knew to forge your address as the from address. This may be overly greedy – it’s a substring match, so paypal.com might as well be paypal.com.
mail.spam : list of paths - relative to your home dir - to mailboxes you consider spam, one mailbox per mailbox, e.g.
Maildir/.Spam/cur
mail.ham : similarly, a list of paths to mailboxes you consider ham, e.g.
Maildir/cur
Maildir/.family/cur
mail.sent
: mailboxes you consider sent mail, e.g. Maildir/.Sent/cur
What It Does
- builds a list of ham boxes, spam boxes, and sent boxes.
- runs the learner for ham on the ham boxes, and spam on the spam boxes
- builds a working list of “good” addresses based on from addresses in your ham boxes, the first person on the to line in the sent boxes (this is imperfect – ideally it would grab everything in the To: header but right now it just reads the first line, which produces lousy results if you sent to more than one person, are forwading mail, etc.), and previously cached “good” addresses.
- filters that list to remove what’s in addressbook.negative
- caches the clean results for future use
- builds a new user_prefs by taking user_prefs.base and adding whitelist_from’s for all of the above
- builds an inverted histogram
- reports load statistics
Setup
- Backup your mail
cd ~/.spamassassin
cp user_prefs user_prefs.base
- Using a text edtior, remove all whitelist_from lines from user_prefs.base, and add all those addresses (just as addresses) to addressbook
- Add anything you don’t want showing up to addressbook.negative. Keep in mind that this is a regular expression substring match. If you put a line with a lone * in this file then nothing will be added to user_prefs.
- Populate mail.spam, mail.ham and mail.sent as outlined above
- Download the script. Install it somewhere useful. Run it.
Future Work
- probable: make the addressbook rebuild optional
- possible: subsets of folders so you don’t necessarily rescan everything every time
- possible: some sort of setup system so you don’t have to read these docs
- unlikely: some way to autogenerate the addressbook file out of popular addressbook formats
One Last Note
If you’re really feeling lucky you can set this up with cron. Note that it can be fairly processor intensive.
If you do that, you need to check your ham and spam when you log in so you quickly catch any mis-identification (and fix the resulting incorrect training). If you use Maildir then restricting your training to cur directories helps cut down on that problem, but it isn’t perfect, and mbox and mbx don’t have such an option.
Feedback
Please send me some: faisal@faisal.com.