String blacklists for filtering blog spam

There are a lot of methods people have devised for identifying spam. Some of these techniques are very sophisticated: Bayesian methods, neural networks, etc. The method I use for filtering spam on this blog, however, is very simple: string blacklists.

There are obvious downsides to this approach, such as the potential for false positives (good comments that are incorrectly classified as spam, perhaps due to the infamous Scunthorpe problem) as well as the high rate of false negatives (spam comments that are not recognized as such and have to be deleted manually). However, word blacklists are available as a built-in feature of WordPress, so I don’t have to use a paid subscription blog spam filtering service such as Akismet. Also, the simplicity and controllability of the approach are nice.

In the rest of this post, I will list and describe all of the string filters I use, so that other bloggers can copy them if so desired.

Continue reading String blacklists for filtering blog spam