Fighting and avoiding ‘Spam,’ or Unsolicited Publicitary E-mail

What is spam?

If you don't know what spam is — aside from a brand of canned meat when spelled “SPAM,” it is a general name for those irritating publicitary mails with subjects like:
E-mail marketing works!, Credit card problems? The solution is RIGHT HERE, Generic Viagra!, UNIVERSITY DIPLOMAS, or Are You Getting the Best Rate on Your Mortgage?
Synonyms are ‘junk mail’ and ‘UCE’ (Unsolicited Commercial E-mail). If you still don't know what I am talking about, praise yourself lucky and hope that you'll never receive any, because once you have received one, you can be pretty sure that thousands will follow soon…

The term “spam” originates from a certain sketch by Monty Python's Flying Circus, involving the repeating of the word countless times. The first major case of ‘spamming’ was in April 1994, when the same advertising message was sent to thousands of usenet newsgroups. After this incident, the term was more and more commonly used to indicate unwanted commercial mails. When I first started using the internet back in 1996, it was mostly spam-free. I have seen the rise of spam over the years and it was not a pretty sight.

The things advertised for in spam mails range from mortgages to medical products. However, a vast amount of these products are either cheap imitations of the thing they are supposed to be, or they don't even exist. So if you do pay, you are likely to just lose money and either get nothing or total rubbish in return. Taking a drug ordered via a spam message is playing with your health. If there's one thing you should remember from this page, it is the advice to consider all spam mails as total garbage which must be paid no more attention to than required to get rid of it.

Aside from all this, and of course from being slightly up to extremely irritating, the largest problem with spam is that it causes an unbelievable amount of useless network traffic. Spammers send their garbage to millions of addresses, in the hopes that at least a few of those belong to people dumb enough to buy their product. The rest deletes the mail, bounces it back or whatsoever. All wasted network bandwidth, and network bandwidth is not for free.

The problem with spam is that it has a net profit, even when just one single person replies to it in a positive way. The reason is that sending spam doesn't cost a thing, so the worst case scenario is a null operation. A relatively simple proposal to solve this, would be to charge people for sending e-mail. Even a puny 1 (Euro-/Dollar-)cent per e-mail, would already discourage sending a million mails, knowing that only a few percent of those mails will generate revenues. Normal people only send a few dozen mails per day at most, so the costs for them would be negligible. Although these are all very interesting ideas, it is unfortunately very hard to implement them in a waterproof way. The only thing that can be done is making the sending of spam illegal. However, this only will work if there is a strict uniform regulation across the entire world.

Remember, spam only exists because there are people who respond to it. Spread the news and tell everyone you know to ignore spam mails. Replying to spam in whatever way, is asking for more. Ignoring spam is making it die.

Types of spam

Nowadays, there are multiple kinds of unwanted e-mail. While originally they were mostly just attempts to lure people to a site to buy things, soon other, more malicious types emerged. In the rest of this text, I'll use the word ‘spam’ for all types, but here is an overview of the correct names for each type.

Keeping Spam Out of Your Mailbox

When I first set up this page, its main purpose was to provide statistics about spam subjects and senders. The idea was to allow people to use these resources to set up mail filters in an optimal way. Nowadays however, simple mail filters simply won't do because most spammers use random subjects. Moreover, collecting the statistics became intractable due to the sheer volume of spam I started receiving, so they are no longer available. What I recommend is reading the advices below, and if necessary installing a specialised spam filter like SpamAssassin or a Bayesian filter.

A few good advices to avoid having your address added to a Spam list

The short list:

The long list:

Help! I'm receiving massive amounts of Spam, what can I do?

Unfortunately, you will have to learn to live with the fact that you will always keep on receiving junk mail, unless you completely destroy your e-mail account and create a new one with a new address which cannot be easily guessed. There is no remedy against spam, except putting your poor mail account out of its misery by killing it.
I already said this, but I'll repeat because it is so important: Rule N°1 when receiving spam: never reply to it! In most cases your reply will never arrive because the sender's address doesn't exist anyway. In some other cases, another innocent victim will receive your reply. In the rest of the cases where the sender does receive your reply, (s)he will be happy with the attention you have given to him/her and will probably be stimulated to send more junk! It really doesn't matter what's in your reply. Those people see every incoming mail, especially insults, as a begging for more.
Also, don't bounce messages. Some programs allow to send fake “unknown account” messages back to the sender, in the hopes that the spammer will think that your address doesn't exist. Don't do this, because in most cases you are only doubling the amount of useless network traffic. Spammers will likely be able to recognise these fake bounces after a while, and then you're screwed.

If you can't afford destroying your mail account, there are a few things you can do to recognise the inevitable junk mails, so you can delete or even filter them without wasting your time.

Protecting internet forums, guestbooks, and blogs against spambots

This section is intended for people who run a website which contains a forum, guestbook or blog (in other words, for ‘webmasters’). In the early days of the internet, a guestbook was as simple as a CGI script which appended the input from a web form to a webpage. If you would do this today, your guestbook would be stuffed within a few months with utmost garbage. Moreover, the few real persons who would sign the guestbook or leave a message on the forum together with their e-mail address, would be spammed to death after a few weeks. These two phenomena are due to two types of ‘spambots’ that roam the internet today:

Using a honey pot

The robots.txt file is not a miracle solution against crawlers that are specifically designed to gather mail addresses or other data from websites. Those crawlers will simply ignore the robots.txt file, or worse: use it to figure out what URLs are forbidden hence potentially interesting. Even if your webpages do not contain sensitive information, such crawlers can still be a major nuisance because they will eat up your bandwidth. They are often written in such a primitive manner that they will download everything, including large data files.

What you need in such case is some way to detect that a crawler is misbehaving, and stop it in its tracks immediately. The trick is to create one or more special URLs that will never be visited by normal visitors. As soon as such a ‘honey pot’ page is being requested, the visitor's address is added to a blacklist, and is instantly served error pages for every subsequent request. There is a rather simple but effective way to implement this if you can run scripts and configure .htaccess files for your website.

Create a few special webpages in the root of your site and put ‘Disallow’ directives in robots.txt for all these pages. The root is a good place because this is where most crawlers will scan first. Put invisible links to these URLs on a few normal pages. I recommend two places to put these links: your main page, and any page that contains links to large download files. Put the links at both the start and end of the page. You can simply open and close an <A> tag without any content in between, or only a space or dot. To make sure that no human visitor will see the link, put it inside a DIV that has style display:hidden. To really make sure that the links will not be visited by a human (for instance by someone who uses a screen reader that ignores the hidden property), you can give the links obvious names like “do_not_follow_this_link.html”. Be a little creative and add some variation. If everyone would use the same URLs then eventually the bots would be programmed to avoid those.

If you really want zero risk that a normal visitor will ever accidentally visit one of these ‘honey pot’ URLs, you can drop the links altogether and only put a ‘Disallow’ directive in robots.txt with the URLs. This will not be as effective, but it will still catch the worst crawlers of them all: the ones that intentionally abuse robots.txt in the hopes of finding the most crawl-worthy pages.

Next, create a script in Perl or Python or whatever, that when invoked takes the visitor's IP address and creates an empty marker file inside a directory. Next, configure the .htaccess file in your website's root as follows:

RewriteRule do_not_follow_this_link.html$ /cgi-bin/blockmyip.pl [L]
RewriteCond /blocked_ips/%{REMOTE_ADDR} -f
RewriteRule .* - [F]

This does two things: first, it sends anyone trying to open the forbidden file to the script. Otherwise, it checks whether the visitor's IP address exists as a marker file inside the directory ‘blocked_ips’ and if yes, serves a 403 Forbidden error page. You can precede this with other RewriteRules to allow a few pages to be visited even by blocked users, etc. I will not go into the details of .htaccess here.

I will however give a slightly more advanced variation on the above, that allows to serve a custom 403 page specifically to blocked users. This is a good idea so you can explain why the user is blocked and how they may be able to contact you, in case a human visitor still managed to set off the honey pot trap despite all precautions. You could even do something fancy like allowing the visitor to unblock themselves through a captcha of some kind.

RewriteRule do_not_follow_this_link.html$ /cgi-bin/blockmyip.pl [L]
RewriteRule ^(forbidden-Crawl.html|robots\.txt)$ - [L]

<Files blocked403>
ErrorDocument 403 /forbidden-Crawl.html
</Files>
RewriteCond /blocked_ips/%{REMOTE_ADDR} -f
RewriteRule !blocked403 /blocked403 [PT]
RewriteRule blocked403 - [F]

This is a little hack that passes through to a non-existing path, and defines a custom 403 page for that path. Of course we must whitelist that error page, and we also whitelist robots.txt of course to remind the bot why it has been naughty. The line with ‘!’ is needed to avoid an infinite loop.

It may be wise to also rely on the X-Forwarded-For header to prevent blocking entire networks behind a NAT if only one idiot behind that NAT runs a bad crawler. For instance, update the ‘-f’ line as follows, and let the blockmyip script create marker files whose name is the same underscore-joined concatenation of REMOTE_ADDR and HTTP_X_FORWARDED_FOR (or empty string if the latter is not defined).

RewriteCond /blocked_ips/%{REMOTE_ADDR}_%{HTTP:X-Forwarded-For} -f

You should do an occasional clean-up of the marker files, for instance delete all files older than a month. Spam crawlers may move between different IP addresses and if you would never clean up, you would be blocking access to an ever increasing part of the internet.

You can also implement this same robot trap without relying on robots.txt. This requires adding an extra layer of indirection between your real webpages and the spider trap links. Instead of sprinkling your real webpage with hidden links that point immediately to the dangerous trap pages, make the invisible links point to simple static pages instead. These pages must have a ‘robots’ META tag with content="noindex, nofollow". This has several advantages:

If you want to go even further, you can use Project Honey Pot to block visitors based on globally gathered data about misbehaving crawlers and spammers, but it may require a bit more effort to set it up than a few small scripts and a .htaccess file.

©2004-2020 Alexander Thomas