String blacklists for filtering blog spam

There are a lot of methods people have devised for identifying spam. Some of these techniques are very sophisticated: Bayesian methods, neural networks, etc. The method I use for filtering spam on this blog, however, is very simple: string blacklists.

There are obvious downsides to this approach, such as the potential for false positives (good comments that are incorrectly classified as spam, perhaps due to the infamous Scunthorpe problem) as well as the high rate of false negatives (spam comments that are not recognized as such and have to be deleted manually). However, word blacklists are available as a built-in feature of WordPress, so I don’t have to use a paid subscription blog spam filtering service such as Akismet. Also, the simplicity and controllability of the approach are nice.

In the rest of this post, I will list and describe all of the string filters I use, so that other bloggers can copy them if so desired.

The single most effective set of blacklisted strings that I use is a short list of common Cyrillic characters. Since this is an English language blog but a great deal of spam is written in Russian (or pseudo-Russian gibberish), this filter is very powerful for its small size. The particular list of characters that I use is taken from an article elsewhere on the Internet which, sadly, I can no longer find. The list is as follows:

Another common language that I receive spam comments in is Japanese. Almost all Japanese text can be efficiently filtered out with this even shorter list of characters:

Next, we have the medications. This is a very effective filter, but unfortunately the list has to be updated frequently as the distribution of drugs being pushed in the spam I receive changes over time. Also, cialis cannot be included, since as Wikipedia notes, it is contained as a substring in the common word specialist; nor ambien, as it is a substring of ambient. The brand name ultram is probably safe, however, unless I start posting Warhammer 40K content.

Next up, we have distinctive phrases that occur in certain fixed spam messages that get posted over and over again. This filter is not very effective in the long run, since the particular spam messages tend to change over time, but as a short-term fix to get rid of individual really persistent spammers, it can work pretty well.

(Yes, the phrase “going to put you in the freezer as punishment” was actually present in a spam comment I received over and over again for a while several years ago. It’s from a joke about a guy putting his pet parrot in the freezer. Look it up if you are really desperate to know.)

Similarly, I also have a small pile of fixed URLs and website names that get spammed over and over again for a period of time. I’m not going to list them here, since including them seems likely to get this site banned from search engine results. Besides, they usually only work as filters for a short period of time before the spammers move on to greener pastures.

On the other end of the spectrum, we have common and widely-used phrases that happen to occur frequently in spam from many different sources while not being likely in legitimate comments relevant to the content on my blog. This is a particularly tricky category, since these phrases could easily occur in genuine comments if the subject matter of my blog strays too far into certain territory. Because of this, I only have two phrases of this sort blacklisted at the moment:

The largest category of filtered strings that I have at the moment is types and brand names of products that the spam purports to offer for sale at cheap prices. This is another category where a level of care is required, because it would be easy to accidentally filter out a legitimate comment that just happens to mention one of these items. Here is the list I am currently using:

And last but certainly not least, we have the vices. These should be reasonably safe to filter out as long as my blog doesn’t get too, uh, spicy.

And there you have it. A few simple word filters can catch the majority of the spam comments this blog receives. Not bad for what it is.

