Mail Matching Research

Statement of Problem

Yes, there's an awful lot of unsolicited email, and most of it comes from email addresses that have been found by crawling the web and extracting them from old and current documents. Our good-natured willingness to share information has made us victims of unsolicited emails. After a few years of using kjw@rightsock.com, I currently get about 10 of these messages per day. Below, I describe a fairly effective, albeit computationally expensive, method of discovering these messages.

Existing solutions

Mail header analysis

Several systems look at the mail headers, and try and figure out whether an email was really for them. I created a set of header analysis filters, but can only catch about 30-50% of the unwanted messages. The order of the rules is very important, and is also partially to blame for why my particular filters do not catch very much. Here's what I use:

If the 'Charset' header contains non-english languages (I don't speak russian, korean, or chinese), throw the message away.
If the mail is 'To:' one of my mailing lists, put it into its appropriate folder (This is where a bunch of the messages slip by the below rules)
If the mail is defective and important headers are either missing or damaged, such as:
- missing the 'From:' header
- only addressed as 'To: ', thus missing the @[host].domain.com portion
- addressed as 'To: <@[host].domain.com>'
- addressed as 'To: .* @' -- space in front of the @
- addressed as 'To: .*@localhost' -- invalid host/domain portion
- if certain headers are present, but blank (note that cc: is not in this list; some mailers insert it by accident! ^(X-Mailer|Apparently-To|Bcc|Reply-to):
reject email if it contains known headers inserted by bulk mail programs: ^X-Mailer:.* (Aureate Group Mail Free Edition|DiffondiCool|fastmail)
accept friendly users from free email (thus abused) domains: '^FROM.*(user@yahoo.com)' @hotmail.com, and more.
reject all email from free email domains (see list above)
If addressed to me, keep it. This rule accepts probably 30-50% of the unsolicited email that makes it to this point, and they get rejected (by the next rule) emails that are bcc'ed to me, or I forget to add them to the accepted mailing lists section.
last rule - by default, store all remaining emails in the 'suspect' folder.

This works relatively well, but the headers keep changing and these rules require a lot of maintenence. Too much for my busy schedule these days.

Proposed mail content matching algorithm

Step one. Trick the senders of these emails to purposely send messages to a separate email address. This address should NEVER be used for anything legitimate, because it will be used as input to the mail removal algorithm. For example, I use trawler@rightsock.com.

Step two. Use the incomming mails as a database. Edit them into the 'internal' direct comparison format. This means stripping off things that make the algorithm break easily, and probably won't make a difference. This includes leading ">"'s and various symbols. run the bodies of the messages through 'fmt -1' and store them.

Step three. Apply the database against the desired mailbox. Using diff (a line-by-line comparison algorithm) and convincing diff to only output 1- a + or 2- a - or 3- a = to indicate that one whitespace-separated word is 1- one file A but not in B, or 2- vice versa, or 3- match identically between the two files. Count the characters, and you can now calculate how sequentially the same the messages are between the database and the inbox.

Results

I tried this with a test set of about 1000 messages, and I was amazed that this simple algorthm resulted in a COMPLETELY bi-modal distribution. Either the messages were 0-30% alike or 90-100% alike. There were virtually no messages in the middle (fewer than 10!). One could very easily set the threshold at 70% and catch virtually everything.

The problem is that this method requires input, and that input may come after the database has been applied to the mailbox. Therefore, the process must be reapplied. But to try and do this to the entire mailbox (my main inbox typically runs 200-600 messages backed up) is incredibly slow. Therefore, one must incrementally reapply the algorithm in two ways:

when a message gets added to the database, apply only that message to the mailboxes. This filters after the fact of receipt, and means there will be a delay in removal, but much activity happens overnight, so by morning, the cleanup should have already happened.
when a message gets added to the mailbox, it needs to be compared to the database. just the one. Therefore, a list of all messages already processed needs to be kept track of. IMAP may be of some help here, since it guarantees unique IDs for all messages.

Have I done anything more than test out my theory? no, since I haven't a reliable input. I hope that by publishing the address trawler@rightsock.com here on this page, that some crawler will pick it up and over the next few months it should start to get fed. I could also reply to known unsolicited emails, masquerading as the trawler account, but that takes time that I haven't decided to dedicate yet. Anyways, once the database gets a good, regular, and maintenence-free continuous feeding, I will start putting it to use.

Further needs

MIME, honeypot, more

last modified - 2002.12.06 kjw
created - 2002.12.06 kjw