You are currently looking at an older section of the wincent.dev website.
Please check the new version of the site at https://wincent.dev/ for updated content.

wincent knowledge base

« freshclam: "Problem with internal logger" | Main | Knowledge base in wiki format »

June 21, 2006

Clamping down on spam

I run SpamAssassin on my mail server (anti-spam). I also run ClamAV (anti-virus). After extensively trialing ClamAV I was convinced of its reliability and decided to have it automatically delete all detected incoming viruses. SpamAssassin still produces far too many false positives and false negatives for me to perform such deletion but I decided today that I wanted to tighten things up a little bit.

Step 1: Getting tough on non-English messages

A huge proportion of the spam I receive is in non-English languages. I've often wondered why SpamAssassin has allowed waves of incoming messages to spill into my inbox consisting entirely of unreadable (at least for me) hieroglyphics when at other times it marks legitimate messages from my friends as spam. Almost all of my legitimate correspondence is in either English or Spanish. Very, very occasionally I get a customer inquiry written in French or German.

So I am prepared to make SpamAssassin get tough on non-English (and non-Spanish) messages at the expensive of risking a few very rare false positives; and such false positives are easily distinguished from spam messages with subjects like "-6月23-24日(深-圳)开-讲-" and "Автомобильные шины и диски! Низкие цены!", thanks to familiar words like "Synergy" in their subject lines.

So the first thing I did was add this to my ~/.spamassassin/user_prefs file:

# Mail using languages used in these country codes will not be marked
# as being possibly spam in a foreign language.
# - english spanish 
ok_languages            en es 

# Mail using locales used in these country codes will not be marked # as being possibly spam in a foreign language. # - en Western character sets in general ok_locales en

This change immediately made a big difference; notice how many points were derived from the non-English aspects of the first incoming spam message that I received after making the changes:

  Content analysis details:   (14.10 points, 5 required)
  UNDESIRED_LANGUAGE_BODY (4.0 points)  BODY: Written in an undesired language
  CHARSET_FARAWAY    (3.2 points)  BODY: Character set indicates a foreign language
  BODY_8BITS         (1.5 points)  BODY: Body includes 8 consecutive 8-bit characters
  MSG_ID_ADDED_BY_MTA_3 (0.7 points)  'Message-Id' was added by a relay (3)
  CHARSET_FARAWAY_HEADERS (2.1 points)  A foreign language charset used in headers
  DATE_IN_FUTURE_24_48 (2.6 points)  Date: is 24 to 48 hours after Received: date

Compare this with a similar, obviously spam message received prior to making the changes and which easily slipped past SpamAssassin without being tagged:

X-Spam-Status: No, hits=1.6 required=5.0
	tests=HTML_60_70,HTML_FONT_BIG,HTML_FONT_COLOR_BLUE,
	      HTML_FONT_COLOR_RED,HTML_MESSAGE,MANY_EXCLAMATIONS,
	      MIME_HTML_ONLY

These changes mean that foreign language spam will be caught much more often by SpamAssassin from here on, and being a user-level preference setting I can make the change without affecting other users on the same mail server.

Step 2: Diverting spam into a separate mailbox

The next thing I did was automatically divert messages that are very likely to be spam (those that score 15 or over) into a separate mailbox on the server. Messages that score between 5 and 15 continue to be tagged by SpamAssassin and I filter them on my local machine using SpamSieve.

This is done by creating a ~/.procmailrc file with the following contents:

# mail very likely to be spam (>= 15) quarantined in "spam" folder
:0
* ^X-Spam-Level: \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
spam
#/dev/null

Ideally, messages in the less-than-5 range will be legitimate and contain very few false negatives. Messages in the 15-and-greater range will hopefully be spam 100% of the time and if an extended trial supports this then I'll mark them for automatic deletion. The remaining messages (in the 5 to 15 range) should be mostly spam with very few false positives which I can manually check with a quick visual scan in Mail.app. The truth is that SpamSieve does such an excellent job (much better than SpamAssassin) that I don't use SpamAssassin's score as a basis for sorting my mail; I just let SpamSieve do its job and SpamAssassin's role is reduced to a simple message tagger.

Step 3: Automatically deleting spam

Once I am convinced that the set-up works I'll change my .procmailrc file to this:

# mail very likely to be spam (>= 15) quarantined in "spam" folder
:0
* ^X-Spam-Level: \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
#spam
/dev/null

Now the only remaining issue for me with SpamAssassin is this bug, long ago fixed but probably never to be backported to the older version that comes with Red Hat Enterprise Linux ES 3. I find that I can't override the weighting of the defective FORGED_MUA_OUTLOOK rule because it would allow too much spam to slip under the radar; so the workaround is to instead use whitelist_from directives in my user preferences file to handle the few cases in which this bug actually causes legitimate email to be misclassified as spam.

So that auto-deleted messages don't just disappear into the ether without a trace you could add logging for them like this:

LOGFILE=$HOME/procmail.log
VERBOSE=off
LOGABSTRACT=all

# mail very likely to be spam (>= 15) piped directly to /dev/null :0 * ^X-Spam-Level: \*\*\*\*\*\*\*\*\*\*\*\*\*\*\* /dev/null
# clean the environment before continuing LOGFILE= LOGABSTRACT= VERBOSE=

This will produce log messages like this in the procmail.log file for each auto-deleted spam:

From yuumi_basel0903@grate.half-time.info  Tue Jun 27 10:24:03 2006
 Subject: [SPAM] 
  Folder: /dev/null                                                       14890

Needless to say it is probably a good idea to check this log file frequently to be sure that legitimate messages and getting blown away.

Update: Step 4: Enabling auto-whitelisting

The old version of SpamAssassin that comes with Red Hat Enterprise Linux ES 3 doesn't have the auto-whitelisting feature enabled by default. I've decided to turn it on to see if improves SpamAsssasin's abysmal accuracy. Overnight I received 213 spam messages; of these:

  • None crossed the 15 point threshhold, not even the "spammiest" ones.
  • SpamSieve identified all 213 messages correctly as spam (100% accuracy, 0 false positives, 0 false negatives).
  • Only 92 were tagged as spam by SpamAssassin (43% as accurate as SpamSieve).

So I changed my procmail recipe from:

# Filter messages through SpamAssassin:
# - use a lock file to reduce load
# - only filter files less than 256KB in size (save CPU/memory)
:0fw: spamassassin.lock
* < 256000
| /usr/bin/spamassassin

To:

# Filter messages through SpamAssassin:
# - use a lock file to reduce load
# - only filter files less than 256KB in size (save CPU/memory)
# - the "-a" switch enables the auto-whitelisting feature of SpamAssassin
:0fw: spamassassin.lock
* < 256000
| /usr/bin/spamassassin -a

Update: Step 5: Bayesian training

In an effort to improve on SpamAssassin's poor accuracy (only 43% as accurate as SpamSieve even with auto-whitelisting turned on) I decided to do some Bayesian training. It is necessary to train the filter using both spam and ham (non-spam) messages.

I had 1558 recent spams on hand that had been sent to my business account, and 547 sent to my personal account. I moved these into a new IMAP mailbox called "train-spam" in each account.

I then grabbed the 1558 most recent ham messages for my business account using Mail.app's powerful search facility (which allows you to select multiple mailboxes and sort search results by date). I did the same to locate 547 ham messages sent to my personal account. I copied (not moved) these messages to a new IMAP mailbox called "train-ham" in each account. Copying can be achieved in Mail.app just like in the Finder by holding down the Option key while dragging.

My first attempts at using sa-learn on the server were problematic because the training has to be run as the user to whom the account belongs, but on my server such accounts do not have shell access. I tried using sudo but this didn't work:

$ sudo -u john%example.com sa-learn --showdots --spam --mbox /home/john%example.com/mail/train-spam
Password: [enter password of admin account]
Failed to create default user preference file /home/xxxxxx/.spamassassin/user_prefs
lock: 12345 cannot create tmp lockfile /home/xxxxxx/.spamassassin/bayes.lock.x.12345 for /home/xxxxxx/.spamassassin/bayes.lock: Permission denied

As you can see this allowed the command to run as the correct user but it still tried writing to files in another home directory. I then tried using su but that didn't work because the email accounts do not have login shells.

$ su john%example.com ls
Password: [enter password of email account]
This account is currently not available.

My first workaround was to temporarily enable login shells for those accounts and su to them before training, but I then went in search of a better solution. It turns out that the -H switch to sudo was what was needed; this sets the HOME environment variable before executing the command:

$ sudo -u john%example.com -H sa-learn --showdots --spam --mbox /home/john%example.com/mail/train-spam

There was still one problem to solve. Training using this command produced lots of messages about problems untainting paths:

security: cannot untaint path: "/home/john%example.com/.spamassassin"
security: cannot untaint path: "/home/john%example.com/.spamassassin/user_prefs"
security: cannot untaint path: "/home/john%example.com/.spamassassin"
security: cannot untaint path: "/home/john%example.com/.spamassassin/bayes"
security: cannot untaint path: "/home/john%example.com/.spamassassin"
security: cannot untaint path: "/home/john%example.com/.spamassassin/bayes"

I filed a bug report with the SpamAssassin team and it looks like the problem will be fixed for version 3.2.0 of SpamAssassin. I also made the necessary changes on my local install.

My plan is to use the initial training run as a base and from here on do mistake based training. This means feeding only false positives and false negatives back to SpamAssassin back for training. My reading indicates that mistake based training is the best in the long run. Here is a short script I threw together to automate future runs of the training process:

#!/bin/sh

USERS="john%example.com barry%example.com"
echo "Starting SpamAssassin training run: will use sudo to run sa-learn as appropriate user(s)." sudo -v
for USER in ${USERS} do echo "Spam training for user ${USER}:" sudo -u "${USER}" -H sa-learn --showdots --spam --mbox "/home/${USER}/mail/train-spam" echo "Ham training for user ${USER}:" sudo -u "${USER}" -H sa-learn --showdots --ham --mbox "/home/${USER}/mail/train-ham" done
echo "Training run complete: you should now empty the train-spam and train-ham folders for these users:" echo " ${USERS}"

So after training for a few days accuracy is greatly improved but there is still a long way to go before SpamAssassin catches up with SpamSieve. It seems that without Bayesian training, SpamAssassin is next to useless. We'll see how accurate it can become after more training.

Last night in an approximately 12 hour period I obtained the following results:

  • 86 spams received in total
  • 100% correct identification rate by SpamSieve; 0 false positives, 0 false negatives
  • SpamAssassin was 64% as accurate as SpamSieve (55 messages classified as spam); 0 false positives, 31 false negatives

Update: Performance after one week

  • The total incoming message count was approximately 1904 messages.
  • Approximately 1020 ham messages made it through to the client.
  • Approximately 884 spam messages were received in all.
  • That figure includes the 332 messages that SpamAssassin gave a score higher than 15 and auto-deleted them at the server (8 messages sent to my personal address, 324 messages sent to my business address); these messages were never downloaded to the client and were never seen by SpamSieve.
  • SpamSieve incorrectly classified 4 spam messages as ham (4 false negatives).
  • SpamSieve incorrectly classified 2 ham messages as spam (2 false positives).
  • SpamSieve's overall accuracy rate was 99.68% (6 errors in total).
  • Of the spam messages that made it through to the client, SpamAssassin correctly labelled 464 as spam (163 sent to my personal address and 301 sent to my business address).
  • In other words, SpamAssassin failed to label about 420 spam messages as spam.
  • SpamAssassin's accuracy rate was approximately 78% overall.
  • SpamAssassin's error rate for false negatives was 70 times worse than SpamSieve's.

So, I will continue to train SpamAssassin on its errors, but it seems unlikely that it will ever catch up with SpamSieve. Seems like I'm approaching the limit of possible accuracy using the SpamSieve/SpamAssassin pair, so I have a few more steps in mind that I can use to reduce the amount of spam that gets to the filters in the first place... watch this space for more details.

Posted by wincent at June 21, 2006 3:40 PM