What is spam?
The most common definition for spam is Unsolicited Commercial E-mail (or UCE
in short) but an e-mail does not need to be commercial to qualify as spam. The
more general definition for spam is e-mail that is not personal and you did not
request it. However the latter definition is too broad and therefore it is hard
to use for automatic recognition of spam.
Why it is a problem?
Spam is a problem because it wastes resources. It wastes capacity in your
mailbox, wastes your time and money downloading it. But more importantly it
wastes several order of magnitude more resources of Internet service providers
who has to handle it.
Categories of spam
- Commercial spam. This is probably the most common kind of spam. They
usually contain information about how good and cheap a service or a product is,
that this is a very special offer just for you and other things which might or
might not be true (usually not).
- Charity letters. These letters usually request that you send some amount of
money to help someone. Usually these are also chain letters.
- Chain letters. Chain letters contain a request that you should forward this
e-mail to your colleagues or friends. These usually contain charity requests,
false virus warnings or other false information.
- E-mail viruses. Technically these are also unwanted mail messages and
therefore spam but they can be more serious than the other categories because
they might take over your computer and do whatever they like with it – at
least on Microsoft systems. If you do not use Microsoft products, you are more
or less safe (at least today).
Differences between spam and real e-mail
There can be several differences between spam and legitimate e-mail and some of
them might be used for detection of spam.
- Fake sender address. Most of spam seems to come from an address which in
fact does not exist and never existed. So if you want to reply to it saying you
are pissed off, the reply will bounce back to you and fill your mailbox again.
There are however spammers who use real sender addresses and if you reply them
they will now that your e-mail address is valuable because you actually read
your messages and they will send you more spam. So it is usually not a good
idea to try to reply to spam messages.
Verifying if the sender of the message is really exist can be a hard thing (or
even impossible) and usually not done. What is much more common is verifying at
least the host name of the sender e-mail address: if it does not exist, the
mail is almost surely a spam ("almost" is there because there could be a
problem in the DNS configuration at the time of testing which resulted in a
false negative answer but this is very, very rare).
There is a technique which unfortunately seems to become more common when the
sender address is in fact an existing address but is picked randomly and has
nothing to do with the real spammer. It is very annoying to receive tons of
spam complaints if a spammer happened to choose your e-mail address when
sending out spam.
- Sender is a dial-up connection. Normally if you are using a dial-up
connection and you want to send an e-mail to someone else, you send it to the
mail server of your ISP who then forwards it to the real recipient. This method
however would allow the ISP to filter out known spam so spammers very often
avoid it and send their messages directly from their own machine. The other
advantage of using dial-up connections is that it is very hard to determine the
real user behind the connection so the spammer cannot be tracked back.
- Forged headers. E-mail messages contain some technical information, which
is normally not shown to the user reading them. These information include the
complete route the message took before it arrived into your mailbox. This
information can be used to detect mail coming from untrusted sources so
spammers usually try to forge it.
- Formatting. Many spam comes in HTML form. There are people who use
HTML-formatted e-mail regularly but the majority of users still uses plain text
so HTML formatting can be a good indication of spam.
- Encoding. If an e-mail contains Chinese characters but you do not speak
Chinese, it is more than likely that the e-mail is spam.
- Style. Many spam mails use lots of capital letters (means yelling in
e-mail) and excessive punctuation. Unfortunately there are a (fortunately
small) number of people who use these in normal e-mail. But if you do not know
such a strange person these style properties can be very useful to detect
spam.
- Strange attachments. Nearly all e-mail virus uses them.
Detecting and filtering spam
Well, if you have an idea what makes the difference between spam and
non-spam, you can use this information to automatically detect spam and filter
it out. The most common spam detection methods:
- Pattern matching. This is one of the oldest, simplest and most widely used
methods. It works by specifying a pattern (actually a list of patterns) of text
that you expect not to be found in legitimate e-mails but common in spam
messages (like "FREE SEX", "$$$$$", "Earn a lot of money" and so on). The
main advantage of this method is that it is very easy to implement and rather
fast. The disadvantages are numerous:
- The hit rate is usually low and the list of patterns needs constant updating.
- The more complex the pattern is the more resources it requires.
- Even small modifications not important to humans (like putting in extra
spaces, changing a word to a synonym) can result in false negatives.
- Makes speaking about spam with your friend really hard or impossible.
Due to these difficulties pattern matching nowadays is rarely used alone but
only in combination with other methods.
- Checking the sender address. It has already been discussed above.
- Checking the "Path:" header. This header contains the full path
the message took before it arrived in your mailbox. The most common checks
performed on it are the following:
- Checking for open relays. There are mail servers that accept mail from
anyone to anyone – these are called open relays. It is evident that a spammer
can easily use such a machine to send out a large number of messages. Even
worse, the spammer can easily mask its own identity by forging the headers of
messages because the recipients can only be sure that the mail passed the open
relay, they have no information where it really came from.
- Checking for dial-up addresses. As it was already mentioned spammers often
use dial-up accounts to send out spam while real users very seldom do so.
- Checking for known spam sources. There are some domains and IP address
ranges that are known to send out large amounts of spam. Blocking them can
reduce the amount of spam received.
Most of the above checks use specially constructed DNS zones to do the checking.
Example systems providing such services are OpenRBL
or MAPS.
- Fingerprint checking. This method calculates a checksum for every message
and uses some central database of known spam fingerprints to decide whether the
e-mail is one of them or not. Unfortunately simple checksums are very easy to
circumvent as changing only one character changes the checksum completely.
There are solutions:
- Selecting a random part of the message and calculating the checksum for
that part only. This way the spammer has no knowledge what part of the message
will be checked and can not modify it to circumvent the checksum.
- Use fuzzy hash algorithms that are robust against small changes in the
input. Unlike regular checksums which are designed to detect the smallest
changes in the input, these hash algorithms were designed to produce same or
very similar output for similar messages.
The major problem with every checksum-based solution is that spam is a
relative term. For example, a conference call-for-papers might be a spam for
several people but might be important for me, and if someone has registered its
checksum in the database I also use, I might not get it.
- Checking the attachments. This technique is used mostly by virus
scanners.
- Scoring. None of the above methods are foolproof and even legitimate e-mail
might match one or more of them. What you can do is to combine them in some
fashion. Such a combination technique is scoring, when every rule that matches
a message gets a numerical score. Spam messages usually match several rules at
the same time so if you treat messages with a high resulting score as spam you
are very likely get the desired effect. The hard part is to find the right
scores and the limit between spam and non-spam scores.
Software solutions for fighting spam
All major mailing server software nowadays contain facilities for sender
address verification and many support other address-based checks like open
relay or dial-up host detection. For other checks, special filter programs can
be used.
Postfix
Postfix is a Mail Transport Agent having very good anti-spam capabilities.
It can be configured to detect spam in every phase of message delivery:
- When a client connects to a MTA it has to identify itself. Postfix
can accept or reject clients based on the client's IP address and the
validity of the greeting message.
- Header checking. Invalid or missing headers can be detected and filtered.
- Sender checking. If the sender address is not legitimate the connection
can be rejected.
- Body checking. The body of the incoming message can be matched against
a set of patterns which can be used for spam detection.
- External filters. Postfix can be easily configured to interact with
external filters (see below for a list of them).
Razor
Razor is a distributed spam detection and filtering network. It stores
multiple kinds of hashes and checksums, which identify known spam. Mail filters
can connect to the central database and verify if the incoming message can be
found in the database. Razor uses traditional checksum-like signatures as well
as ephemeral randomized signatures and fuzzy Nilsimsa signatures. E-mail
messages can be preprocessed to remove non-significant parts of the message
like HTML formatting, or decode BASE64 and QP encoded messages. Razor has a
trust system where every spam reporter is given a trust value which changes
over time and users can specify the level of confidence a signature must have
in order to qualify a message as spam. Messages which turn out not to be spam
can be revoked from the database.
Spamassassin
Spamassassin is a mail filtering tool which uses pattern matching and
internet-based blacklists (like ORBL or MAPS) to filter out spam. The various
detection rules all have scores automatically generated by using a genetic
algorithm on a large database of known spam and non-spam messages.
Spamassassin can also use Razor (and can assign a certain score to a match in
the Razor database). It also supports automatic whitelisting which prevents
false positives from known-not-to-be-spam sources.
Bogofilter
Bogofilter implements a statistical method based on Bayesian filtering of words.
This works by breaking up incoming mail messages to tokens (words, numbers etc.)
and counting how many times each token appears in a database of known spam and
known non-spam messages. These counts are then turned into probabilities and
are used to calculate a final probability whether the original message was a spam
or not. An extensive explanation can be found
here.
Procmail
While procmail isn't directly a spam filter, it can be used to forward incoming
messages to the filters mentioned above and can be configured to discard or
redirect messages that has been identified as spam. Procmail can also be used
as a pattern-matching filter and there are projects such as
SpamBouncer providing up-to-date
spam filtering rules for procmail.