What is spam?

The most common definition for spam is Unsolicited Commercial E-mail (or UCE in short) but an e-mail does not need to be commercial to qualify as spam. The more general definition for spam is e-mail that is not personal and you did not request it. However the latter definition is too broad and therefore it is hard to use for automatic recognition of spam.

Why it is a problem?

Spam is a problem because it wastes resources. It wastes capacity in your mailbox, wastes your time and money downloading it. But more importantly it wastes several order of magnitude more resources of Internet service providers who has to handle it.

Categories of spam

  1. Commercial spam. This is probably the most common kind of spam. They usually contain information about how good and cheap a service or a product is, that this is a very special offer just for you and other things which might or might not be true (usually not).
  2. Charity letters. These letters usually request that you send some amount of money to help someone. Usually these are also chain letters.
  3. Chain letters. Chain letters contain a request that you should forward this e-mail to your colleagues or friends. These usually contain charity requests, false virus warnings or other false information.
  4. E-mail viruses. Technically these are also unwanted mail messages and therefore spam but they can be more serious than the other categories because they might take over your computer and do whatever they like with it – at least on Microsoft systems. If you do not use Microsoft products, you are more or less safe (at least today).

Differences between spam and real e-mail

There can be several differences between spam and legitimate e-mail and some of them might be used for detection of spam.
  1. Fake sender address. Most of spam seems to come from an address which in fact does not exist and never existed. So if you want to reply to it saying you are pissed off, the reply will bounce back to you and fill your mailbox again. There are however spammers who use real sender addresses and if you reply them they will now that your e-mail address is valuable because you actually read your messages and they will send you more spam. So it is usually not a good idea to try to reply to spam messages.
    Verifying if the sender of the message is really exist can be a hard thing (or even impossible) and usually not done. What is much more common is verifying at least the host name of the sender e-mail address: if it does not exist, the mail is almost surely a spam ("almost" is there because there could be a problem in the DNS configuration at the time of testing which resulted in a false negative answer but this is very, very rare).
    There is a technique which unfortunately seems to become more common when the sender address is in fact an existing address but is picked randomly and has nothing to do with the real spammer. It is very annoying to receive tons of spam complaints if a spammer happened to choose your e-mail address when sending out spam.
  2. Sender is a dial-up connection. Normally if you are using a dial-up connection and you want to send an e-mail to someone else, you send it to the mail server of your ISP who then forwards it to the real recipient. This method however would allow the ISP to filter out known spam so spammers very often avoid it and send their messages directly from their own machine. The other advantage of using dial-up connections is that it is very hard to determine the real user behind the connection so the spammer cannot be tracked back.
  3. Forged headers. E-mail messages contain some technical information, which is normally not shown to the user reading them. These information include the complete route the message took before it arrived into your mailbox. This information can be used to detect mail coming from untrusted sources so spammers usually try to forge it.
  4. Formatting. Many spam comes in HTML form. There are people who use HTML-formatted e-mail regularly but the majority of users still uses plain text so HTML formatting can be a good indication of spam.
  5. Encoding. If an e-mail contains Chinese characters but you do not speak Chinese, it is more than likely that the e-mail is spam.
  6. Style. Many spam mails use lots of capital letters (means yelling in e-mail) and excessive punctuation. Unfortunately there are a (fortunately small) number of people who use these in normal e-mail. But if you do not know such a strange person these style properties can be very useful to detect spam.
  7. Strange attachments. Nearly all e-mail virus uses them.

Detecting and filtering spam

Well, if you have an idea what makes the difference between spam and non-spam, you can use this information to automatically detect spam and filter it out. The most common spam detection methods:
  1. Pattern matching. This is one of the oldest, simplest and most widely used methods. It works by specifying a pattern (actually a list of patterns) of text that you expect not to be found in legitimate e-mails but common in spam messages (like "FREE SEX", "$$$$$", "Earn a lot of money" and so on). The main advantage of this method is that it is very easy to implement and rather fast. The disadvantages are numerous:
    1. The hit rate is usually low and the list of patterns needs constant updating.
    2. The more complex the pattern is the more resources it requires.
    3. Even small modifications not important to humans (like putting in extra spaces, changing a word to a synonym) can result in false negatives.
    4. Makes speaking about spam with your friend really hard or impossible.

    Due to these difficulties pattern matching nowadays is rarely used alone but only in combination with other methods.

  2. Checking the sender address. It has already been discussed above.
  3. Checking the "Path:" header. This header contains the full path the message took before it arrived in your mailbox. The most common checks performed on it are the following:
    1. Checking for open relays. There are mail servers that accept mail from anyone to anyone – these are called open relays. It is evident that a spammer can easily use such a machine to send out a large number of messages. Even worse, the spammer can easily mask its own identity by forging the headers of messages because the recipients can only be sure that the mail passed the open relay, they have no information where it really came from.
    2. Checking for dial-up addresses. As it was already mentioned spammers often use dial-up accounts to send out spam while real users very seldom do so.
    3. Checking for known spam sources. There are some domains and IP address ranges that are known to send out large amounts of spam. Blocking them can reduce the amount of spam received.

    Most of the above checks use specially constructed DNS zones to do the checking. Example systems providing such services are OpenRBL or MAPS.

  4. Fingerprint checking. This method calculates a checksum for every message and uses some central database of known spam fingerprints to decide whether the e-mail is one of them or not. Unfortunately simple checksums are very easy to circumvent as changing only one character changes the checksum completely. There are solutions:
    1. Selecting a random part of the message and calculating the checksum for that part only. This way the spammer has no knowledge what part of the message will be checked and can not modify it to circumvent the checksum.
    2. Use fuzzy hash algorithms that are robust against small changes in the input. Unlike regular checksums which are designed to detect the smallest changes in the input, these hash algorithms were designed to produce same or very similar output for similar messages.

    The major problem with every checksum-based solution is that spam is a relative term. For example, a conference call-for-papers might be a spam for several people but might be important for me, and if someone has registered its checksum in the database I also use, I might not get it.

  5. Checking the attachments. This technique is used mostly by virus scanners.
  6. Scoring. None of the above methods are foolproof and even legitimate e-mail might match one or more of them. What you can do is to combine them in some fashion. Such a combination technique is scoring, when every rule that matches a message gets a numerical score. Spam messages usually match several rules at the same time so if you treat messages with a high resulting score as spam you are very likely get the desired effect. The hard part is to find the right scores and the limit between spam and non-spam scores.

Software solutions for fighting spam

All major mailing server software nowadays contain facilities for sender address verification and many support other address-based checks like open relay or dial-up host detection. For other checks, special filter programs can be used.

Postfix

Postfix is a Mail Transport Agent having very good anti-spam capabilities. It can be configured to detect spam in every phase of message delivery:

Razor

Razor is a distributed spam detection and filtering network. It stores multiple kinds of hashes and checksums, which identify known spam. Mail filters can connect to the central database and verify if the incoming message can be found in the database. Razor uses traditional checksum-like signatures as well as ephemeral randomized signatures and fuzzy Nilsimsa signatures. E-mail messages can be preprocessed to remove non-significant parts of the message like HTML formatting, or decode BASE64 and QP encoded messages. Razor has a trust system where every spam reporter is given a trust value which changes over time and users can specify the level of confidence a signature must have in order to qualify a message as spam. Messages which turn out not to be spam can be revoked from the database.

Spamassassin

Spamassassin is a mail filtering tool which uses pattern matching and internet-based blacklists (like ORBL or MAPS) to filter out spam. The various detection rules all have scores automatically generated by using a genetic algorithm on a large database of known spam and non-spam messages. Spamassassin can also use Razor (and can assign a certain score to a match in the Razor database). It also supports automatic whitelisting which prevents false positives from known-not-to-be-spam sources.

Bogofilter

Bogofilter implements a statistical method based on Bayesian filtering of words. This works by breaking up incoming mail messages to tokens (words, numbers etc.) and counting how many times each token appears in a database of known spam and known non-spam messages. These counts are then turned into probabilities and are used to calculate a final probability whether the original message was a spam or not. An extensive explanation can be found here.

Procmail

While procmail isn't directly a spam filter, it can be used to forward incoming messages to the filters mentioned above and can be configured to discard or redirect messages that has been identified as spam. Procmail can also be used as a pattern-matching filter and there are projects such as SpamBouncer providing up-to-date spam filtering rules for procmail.