Comments is comments, mostly
I’ve been muttering about this in my comments, and a few other people’s, long enough that I thought I should bring it up into an entry. I’m a comment slut. I always have been. If I give you three lines of code, and it changes your world, and makes you stop drinking (or, better yet, start), I love hearing about it. If I’m wrong as wrong can be, I love hearing about it. If you just haven’t said hello for a while, I’m delighted to hear from you in a comment.
Commenting is also a great way to get yourself known among webloggers. If you write something interesting, or just write well, in someone’s comments, I’m quite likely to click your link, to see what else you have to say. I found a good share of my favorite blogs that way, following links from comments. I don’t want to see that go away. I also don’t want to see the tiny bit of Google-juice that comes from leaving a comment and a link go away: most of the best content around here is in the comments, and a little bump in Google is hardly payment enough for all you do for me.
But, the inevitable but, I’ve been noticing lately that there are quite a few people who are spamming comments, only with a link to their blogs rather than to a casino, a pill, or a naked woman. If you only have six words to say to me, that’s cool, but if I then see that someone in my blogroll has just updated, and I go over there only to find that you had six similar words to say to her, and searching for link:yoursite.com shows that there are dozens or hundreds of links to your blog, all coming from six-word, nothing-much comments, I’m going to feel used, and even a slut doesn’t like to feel used.
Despite having hated the MT 2.661 shoved-down-your-throat redirect so much that I copied the previous version back into my upgraded file, I’ve installed David Raynes Optional Redirect plugin (with a couple of fixes I left in his comments), so that I can get a redirect link from <MTCommentAuthorLink redirect=”1″>. And despite having little faith in the ability of a Bayesian filter to tell comment spam from real comments in general, I’ve installed James Seng’s MT-Bayesian plugin/hack, so that the filter and I, working together, can call some comments spam, and others ham. Then, with <MTIfSpam>, if one or the other of us (hard to tell which, isn’t it?) doesn’t think highly of your comment, you get redirected rather than directly linked. At the moment, it seems to mostly think that any stranger is spam, and anyone who has left a non-spam comment is ham, which works for me as a first approximation: I usually remember to train it whenever someone comments, so at most you get redirected for a few hours until I introduce you.
If you find yourself redirected for more than just a few hours, then things get interesting. Anyone who knows me at all will of course scream at me, throwing things, and threatening flying monkeys until I get on the stick and tell MT-Bayesian not to be so foolish, but if you don’t know me, and do notice, you’ve got a delicate situation. Did I just miss it? Did I think you were just self-promoting? If you mention it, will I give you your direct link, plus a post extolling the brilliance of your last several entries? Or, will I slip you into my blacklist as a poison pill of sorts, and call you out in public, insisting that you wear a scarlet S for all time? I’d recommend leaving another comment, one so obviously relevant, brilliant, and useful that I’ll be ashamed of my unseemly thoughts. As I said, that’s where it gets interesting :)
This Comment’s Just Six Words Long :D
So is this not the time to tell you about my online casino, cheap viagra, and hot xxx action? I thought by doing so I’d be providing a friendly service, but I guess not. Let me know if you’re interested, though.
Damn cheap filter, should have redirected the both of you :(
It must have treated ”comment’s” as ”comment is”. That must be it. Made it seven words. And of course, it has no idea what Viagra is. V1agra, /14gra, sure, but Viagra?
This isn’t going to help its education, is it? Heh.
Hi Phil! We miss you in Whole Wheat World. Best of luck on your Spaminator crusade!
Hey, Sol! I’m coming back, I swear I am, but every time I start over that way, something comes up that either wants all my connection bandwidth, or all my mental bandwidth.
The upside, though, is how many wonderful new surprises (musical and code) there will be by the time I do get back.
Huh. Odd. I thought that 50% (which is what it called both of you) amounted to <MTIfSpam>, but apparently not, since neither of you were redirected until I told it Adam was a dirty spammer (no sense in telling Google that you are the lyrics for a Weird Al song, anyway ;)). Might have to do a little altering, either in the plugin or in the weighting, since what I mostly wanted it for was to redirect questionable things that came in while I was sleeping, so that I could get to them before Googlebot did.
I really like the spam vs ham idea. Will this filter be available to us plebs?
Ive resorted to turning comments off on posts that don’t warrant a conversation, or that I don’t particularly want other people’s opinions on.
But being able to lable a comment as spam or ham, and then treat it appropriately (delete, KILL KILL KILL!) would be wonderful.
hey phil have you heard of spamnet? its a filter for email spam and it works on the general concept of distributed computing. the idea is hundreds of poeple download and install this email client plugin [i think they only support outlook right now (poo on that)] and when they get email it reads through the email and checks with a master list to see if its spam. the master list is updated by the users. so when a new peice of spam is released the filter wont catch it. in that case you select the email and hit the ’this is spam’ button. spamnet learns this and then from then on knows its spam.
isnt there a way we could apply this to comment spam? if not in a general, open standard kind of way, at least in an MT client kind of way?
just food for thought, i dont have the skill to try and do this myself [yet ;)] otherwise id give it a shot.
as soon as i hit that post button i thought of a better way to explain that…wheres the edit button when you need it?
so basically what im thinking is instead of a local blacklist [like the plug-in i know someone made] it would be nice to have a remote blacklist everyone could contribute to. sound feasable at least?
Well, I think something like that (only peer-to-peer, rather than centralized) is where Jay’s headed with MT-Blacklist. But for the most part, I’m not sure I want to go there, at least not automatically. You can share your blacklist, and import other people’s blacklists, and I tried it with two people’s, both people whose technical acumen and good sense I trust, and each of them had one entry (out of six or eight hundred, mind you), that I very much did not want to have blocked. One was a difference of opinion, the other was a poison pill (where a spammer induces you to include something that you shouldn’t include). I subscribe to the RSS feed of Jay’s additions to the master list, because I trust Jay to think very carefully before he adds something, but I don’t think I trust anyone else to think that carefully, and that widely about the ramifications of adding something. I know I’ve got a few URLs blacklisted that other people shouldn’t necessarily block.
hmmmm, point taken. it seems the spammers are getting closer and closer to hitting my current posts..ima have to give the current blacklist plugin a go
or I could just download both plugins, implementing the fixes you mention in the other comments.
d’oh!
Yep. It’s not quite a simple turnkey thing, since MT-Bayesian’s fairly complicated to install, especially if you’re using mySQL, and have to modify your database, but none of it’s really hard. Just more work that dropping one file in the plugins directory.
And I think now it should work the way I thought it was, going forward.
{mt dir}/lib/MT/Bayesian.pm
has a$threshold_spam
variable that’s 0.9 by default, and a new comment has a probability of 0.5 to start (and stays there, if there aren’t enough ham-or-spam words it knows about to change its mind). That works fine for the normal use, where with spam you don’t show it at all, or show it with a big red Border Of Shame like James does, but for my subtle alteration, I think setting that to 0.49 (or maybe lower, as I get more ham-words in the training corpus) should do the trick. It treats URLs and email addresses as very important, so anyone that is introduced as being a good person should be able to get away with a few spammy words, without going over, and even if they do, it’s a minor penalty for a short time.So you are really committed to collecting enough spam to train MT-Bayesian? My hat’s off to you.
My total spam count over the past 15 weeks (which is to say, since I first noticed comment spam on my blog and decided to do something about it) is 5 spam comments.
Not enough to train a cocker spaniel, let alone MT-Bayesian.
I’m still thinking about the <MTCommentAuthorLink> redirect. In principle, it’s the right bit of social engineering until Google catches up and starts punishing the comment spammers (there are rumblings that this has begun to happen). In practice …. ?
In practice? Baby, bathwater. The price is too high, for me.
Actually, no, I’m not collecting that much. More than you, because I’m not quite as scary on preview, so I get a few more hand-entered ones. But all I really want from MT-Bayesian is a over-complicated whitelist, which it seems to be providing, so far. If it knows someone, like it does you, then it recognizes your URL and email, and gives you major ham-points for them. The rest? It would be nice if it learned enough words to really know a few things are spam without help, but as long as it remembers that strangers are bad until proven otherwise, and there isn’t any easier way to do that, I’ll keep it around.
I certainly don’t expect miracles from it, like I do with email, since most words in spam comments don’t matter in the least (I saw someone who was already blocked by MT-Blacklist trying to get in by using a single period or comma as his anchor text, not realizing that the problem was URLs, not words), and I’m not married to it the way I am with POPFile. Just experimenting, and enjoying what it’ll do so far.
This seems like much more fun than my method, which is to replace the topic text with ”[Removed (off-topic)]” or ”[Removed (spam)]” (depending on my whim), delete the URL, change the email address to ”x@y.com”, and truncate the author name to four characters.
But what’s to stop me leaving a comment as Jacques (thereby stealing his URL and email ham points) and linking to one of my own posts? Come to think of it, isn’t it surprising that there appears to have been so little forging of other people’s identities in weblog comments?
Shhhhh!
Well, really, wholesale spammers mostly don’t expect HTML to be enabled in posts, so the URL field is all they use, and it wouldn’t do them any good. But a retail spammer who really wanted to slip in a comment while I was asleep, hoping Googlebot would get there first, would do well to borrow someone’s URL.
As to forging, yep, it’s odd. I’ve seen Dave Winer forged a few times, but other than that, well, if I’ve seen it, I haven’t recognized it.
I’d like to think: ethics, ego, respect —
but realtiy probably just no one thought about it before now.
PGP signing of email is relatively common, I guess PGP signed comments would be good for more than one reason.
And of course, pb’s ahead of us, by a year and a half or so. At the time, I wasn’t sharp enough to realize that his approach, not doing any verification, just making the signed comment available in case anyone else wanted to check it, was actually very slick way to get PGP’s foot in the door. If you know what it means, and how to go about it, you can easily sign your comments, or verify comments for people with either a posted public key or a key you already trust, and otherwise it stays completely out of the way. To do, to do, to do.