A plam for splam

A few weeks ago I wrote a fun plugin to fight spammers everywhere; I call him, Splam. I thought I wrote this up somewhere, but I can’t seem to find the article. So, I must have dreamed it. I soft-launched it on Github, so those of you
following my github profile will have seen some commits.

Splam is a “Simple, pluggable, easily customizable score-based spam filter plugin for Ruby-based applications”. I couldn’t find any other Ruby projects outside of Defensio and Akismet, both hosted services, so while you might say, “but those work perfectly well!”, you can run this locally and get instant feedback. Install it as a plugin, and include it into your ActiveRecord (or other PORO) like so:

class Comment < ActiveRecord::Base
  include Splam  
  splammable :body
end

Easy, right? Splam works by looking at the field, and applying a set of rules. Some of these rules are pretty simple; most forum/comment spam is pretty simple, too. For example, it looks for words like "porn" or "erotic" or "viagra", and gives 10 points for each of these. Then it looks in links in the body, and gives another 20 points each time a word appears in the link text. (Actually, I modified the code so it gives 10^ rather than 10*. That means that each time you use a banned word, it's exponentially more likely it's spam).

Some of the other rules target the idiots who try to spam your Ruby forum with bbcode: [url= or [b] Then it gives you points for each chinese character, and more points for Russian glyphs. It looks for bad HTML (a href=http://...) as well as extra long lines, sentences with lots of words and no punctuation, too many words with lots of letters, and so on.

Splam gives you a spam? boolean, a splammable_score score, and the splam_reasons why it marked something as spam, along with the points for each infraction.

I originally wrote Splam for the new app, Tender Support we've been building at entp -- I've been tweaking it until it gets zero false positives against Defensio, Akismet AND our sturdy human spam checker, Will the Defender, against the complete set of support/help requests in the Lighthouse project. Interestingly enough, I had to add a set of "good words" which takes spam points away (things related to our business).

In this way, you can see splam as this horrible manual system with no training ability outside the code. It's an arms race, but I think we're not up against a particularly clever enemy[1]. I'd really like to add some clever bayesian magic to it, but since it works well enough for me right now, I'm gonna throw it down to you guys. I'd also like to make the points themselves a percentage rating (adding % chance that it's spam) rather than an absolute (>100 points, and it’s spam).

Splam has a test suite, so you can check it out, put some of your corpus in individual text files in test/fixtures/comment, and send me a diff or pull request with anything that gets incorrectly marked as spam or ham.

Splam is at http://github.com/courtenay/splam/tree/master.

[1] This isn’t really meant to be a challenge to spammers; interestingly enough, most of the spam we get is just mass-blasted crap that isn’t really targeted at all. We were talking in the company campfire about how you could really make a bunch of money out of “spam”; if you’re intent on selling shady drugs, peddling nootropics to programmers would work better than sex enhancers, HGH to competitive cyclists, and so on. You could probably build some clever markov chains to interact on forums, leading people back to your own site where you start the pitch.