<Searching where I’m not hated>

As (Google employee) Nelson Minar notes, sometimes changing URLs by returning a 301 Moved Permanently with the new URL works out fine with search engines: if you search Google for [for fear they’ll lose visibility in search engines], his post at his new URL is the first result.

And as (Google employee) Matt Cutts notes, sometimes it doesn’t work out quite so well. Back at the end of October, when I switched to WordPress, I moved my weblog to the subdomain “weblog.” and changed to WordPress’s default URL scheme, with the words from the post title separated by hyphens (because Underscores are bad (for search engines)) and without the “.php” extension (because, among other things, it makes search results for [foo php] horribly cluttered with things only published using PHP, not about PHP), and redirected all my old URLs to the new ones. Now when you search for a set of words I’ve used in a post, like [when Gmail first added Atom feeds], you will not find my post.

If you want to find me in your Google search results, you need to do one of two things: remember that the parameter to include “duplicate results” is &filter=0 and add that to your query URL, or work your way to the end of the results (in this case, that would be around result 747, despite their claim to have 412,000 results) and click the link to include omitted results. Then, once again work your way out to the end of the results (now at result 976, despite the complete and utter lie of “412,000 results”), and you’ll find me: the last damn thing they think anyone would ever want, with my useless front page, my worthless permalink, my crap-filled paged archive (now blocked off in robots.txt, because there’s really no value a search engine can extract from that view), and my spammy HTMLized version of the Atom autodiscovery RFC (which at least serves the purpose of increasing my spam load).

The other, and faster, way to find me in Google results, searching for the words as a phrase, might hold a partial clue to why I’ve become persona non grata in Google’s view: [“when Gmail first added Atom feeds”] returns one result, from Sameer D’Costa’s aggregator. Include duplicates on that search, and you’ll also see a couple more Gregarius installs, along with my own worthless results. You won’t see my Gregarius install, though, because I know search engines hate duplicate content, so in Gregarius’s Admin/Config I set rss.config.robotsmeta to “noindex, follow” so that every page gets a <meta name="robots" content="noindex,follow" /> to tell search engines not to index my aggregator as duplicate content, though they are welcome to discover any URLs they don’t already know about. I don’t have any way of knowing whether Google first decided I was garbage, and then decided that made me a duplicate of other people aggregating me, or decided I was garbage because I was a duplicate of people duplicating me, but I know it isn’t helping: the only time you should let search engines index the output of an aggregator is when that’s the only place the feeds appear as HTML. Otherwise, it’s duplicate content, and that’s going to wind up hurting someone, whether it’s you or the source.

Now, maybe it’s just me, but I really don’t like getting fucked by someone who spends the whole time punching me and swearing at me and insulting me. If that’s the way Google feels about me, I’d rather just end our relationship. That takes two things. First, I went to about:config and changed keyword.URL to http://myweb2.search.yahoo.com/search?ei=UTF-8&p= so typing search terms in the addressbar will take me to Yahoo! rather than doing a Google “I’m feeling lucky” search, changed browser.search.defaultenginename to Yahoo so that selecting text and choosing Search Web for from the context menu would search at Yahoo, and changed browser.search.order.1 to Yahoo so Google wouldn’t be on top. I should probably delete the google.src plugin file, too, but for now I’m not quite willing to burn every last bridge. (And I didn’t switch to MSN Search because they make Google look like the kindest and gentlest of lovers: the only way you’ll find me in any results there is to include weblog.philringnalda.com in your search terms. Otherwise, I simply don’t exist.)

Then there’s the other half of breaking up: taking away Googlebot’s and msnbot’s keys to my house. Despite the fact that Google just flat out hates me, they’ve requested 6226 pages from me already this month (msnbot has only requested 1836, but they’ve requested robots.txt 104 times, to Googlebot’s 31 — maybe they know what’s coming).

That ain’t free: to keep my host from pushing me into a dedicated server, I had to install a caching plugin, and then reconfigure it to an even longer cache time to prevent misses when they came back for more, just for search engines, which are the only things that ever look at most old weblog posts (well, along with confused and misled searchers, if you happen to be returned as something other than the very last result after duplicate results are included). But, block Googlebot? That seems pretty serious, somehow, even though most of the “visitors” they deliver are just comment spammers (one of the clues that I’d become invisible in search engines was just how little comment spam I’m getting).

So, just like every abused woman on a crappy talk show that you want to shake some sense into, I’ll give Googlebot and his buddy msnbot another month, hoping he’ll change his abusive ways. After all, maybe one of his friends is reading this, and can talk some sense into him.


Comment by Sameer D'Costa #
2006-01-08 18:05:59

Oh, I thought I was just helping you out by sending less traffic to your blog. :)

Anyway, I shall change my robots meta in Gregarius to ”noindex, nofollow”. Maybe we should change the default value in the next version… I wonder…

Comment by Phil Ringnalda #
2006-01-08 20:19:51

But if you go noindex, and so does Eric Pierce, and the person whose name I can’t puzzle out of lomaji.com, will I suddenly exist again, or do I now only exist insofar as I’m reflected by you three?

The nice thing is that having thought enough about how I feel about being ostracized by Google and MSN Search, I really don’t care. It would be sort of interesting to find out, or not, either way. I’m not losing a thing as it is: the same friends will comment, or on rare and wonderful days link, whether or not the faceless and silent masses of search engine visitors come by looking for their ”blogger templates” and ”http 500 errors.”

I never looked, and now I can’t, but I doubt that I’ve gotten more than a couple dozen non-junk comments from search engine visitors over the years. If I never again get a comment from someone who wants me to create them a Yahoo Messenger account with an illegal character, or to let them see their friends while they want to be invisible, well, that’s a win for me. And that’s why I’m fairly serious about just saying ”Disallow: /” if things don’t improve. Last month, Googlebot, msnbot, and Ask Jeeves (which isn’t even able to comprehend a redirect, apparently) amounted to 51,611 hits with not one shred of benefit for me. Turn them away, and I could almost afford to let the crazy Italian who has been getting a 503 on my feed once a minute for months now back in.

Comment by Phil Ringnalda #
2006-01-08 20:23:58

Um, though, none of that should be considered in any way a vote against changing the default: index, follow is wrong, it should absolutely be noindex, follow. Duplicate content isn’t a very directed weapon, it could just as easily cut the other way, and spill over from /rss/ to /weblog/. Installing Gregarius shouldn’t mean dropping the people you subscribe to out of search engines, and it sure shouldn’t mean dropping yourself out, either.

Comment by Sameer D'Costa #
2006-01-08 21:05:42

Let me file a ticket for changing the default to ”noindex,follow” and let us see if anyone else has an opinion.

Personally I have set noindex,nofollow for two reasons.

  • I have lots of feeds with spam links in them (babysitting a wiki does that) and I would not want search engines to follow those links.
  • Search engines crawling an installation will extract a performance toll. Especially if you have many tens of thousand’s items in your db and permalinks enabled. (dreamhost wouldnt like that)
Comment by Phil Ringnalda #
2006-01-08 21:13:20

If you want them to stay out completely, you’re probably better off with robots.txt instead, so that they only request one file of a few hundred bytes, rather than requesting random pages that they already know about, ”just to see if they still say not to index them.” And I’d be very surprised if anything is obeying your /rss/robots.txt — one of the unfortunate things about the robots.txt pseudostandard is that it only calls for one file, at the root of the domain.

Comment by Aristotle Pagaltzis #
2006-01-08 19:31:03

in this case, that would be around result 747, despite their claim to have 412,000 results

Google never shows more than 1,000 results. It’s a hard limit.

Comment by Phil Ringnalda #
2006-01-08 20:05:22

Yeah, I know that it makes sense in a way, because they do something along the lines of selecting the 1,000 best results, then removing the 24 which they will not show under any circumstances, then removing the 229 that they consider duplicate content. None of my business why they can’t just select 1,000 things they like, and since it’s rare for anyone to look at 11-20, who cares whether they organize the world’s information, or organize the top quarter of a percent of the world’s information? With nobody wanting more than a tenth of a percent of that quarter of a percent, what do the numbers matter? It’s just a measure of how very popular and exclusive the club at the top is.

And I suppose I should be honored to be (the very last) in the top 1,000 that they like before they think about it, rather than being in the bottom 411,000. Assuming the bottom 411,000 actually exists, a fact of which I have no more knowledge than someone in an exclusive New York nightclub has of the existence of people shopping in an Ohio WalMart. It’s just that when you’re the last one allowed in, to sit at a table that gets whacked by the kitchen door every time a waitress goes in or out, you can’t help but start thinking about all those people from Ohio that never have a moment more exclusive than scratching off a winning Monopoly piece at McDonalds.

Comment by Mike Mariano #
2006-01-08 19:37:37

Aw, Google Blog Search still loves you! It displays entries both old and new, right up to and including this one.

And of course, everyone uses Google Blog Search! It’s orange! And has ”beta” in the logo! So that means it’s like the Google of the future!

Comment by Mike Mariano #
2006-01-08 19:40:37

Let’s pretend I can link correctly:

[when Gmail first added Atom feeds]

Comment by Phil Ringnalda #
2006-01-08 20:34:44

Another point I meant to get to, but ran out of room and steam too soon. Anyone that I really want to find, or be made aware of, one of my posts is quite likely to be cunning enough to search somewhere like Technorati or Google BS, or to pick up the post in a PubSub subscription feed, or something else that doesn’t involve my being one of the first anointed ten results from Google.

At least, as long as weblog search engines either make duplicate content decisions based on (accurately) determining what they first saw, or just flat out have lousy spam filters, I’ll still be somewhat discoverable to the people who really matter to me.

Comment by Jeremy Zawodny #
2006-01-08 19:52:32

Doesn’t it strike you ass odd that you’d change your software because some broken search engines can’t cope with it? Whose problem is that, really?

Comment by Phil Ringnalda #
2006-01-08 20:28:16

Earlier mental drafts of this post emphasized that a lot more, the changes I’ve made to both WordPress and Gregarius and the way I write titles and choose words for their effect on search engines, in part because I knew I could use it as a hook to try to get you to link to it.

Now, how incredibly twisted is that, given that I already knew that the primary consumer and valuer of links already thinks I’m scum?

Comment by Jeremy Zawodny #
2006-01-09 07:53:34

BTW, shortening Google Blog Search to Google BS is … well, I like it on several levels.

Comment by Aristotle Pagaltzis #
2006-01-09 10:26:00

Much like your “ass odd” typo. :-)

Comment by Phil Ringnalda #
2006-01-09 10:42:33

Okay, I give up, <kbd>-boy: smilie conversion is off ;)

Comment by Aristotle Pagaltzis #
2006-01-09 12:48:46

Whee. :-D

(Actually, I’m still reflexively trying to <tt> that, because smileys in proportional fonts, I’ve always thought, just look weird.)

2006-01-09 14:38:00

[…] phil ringnalda a digital magpie « <Searching where I’m not hated> […]

Comment by Phil Ringnalda #
2006-01-09 14:48:13

Interesting. I wonder if that’s a known ”link at the end of a paragraph doesn’t get excerpted properly” bug.

Comment by didier #
2006-01-09 16:28:51

That’s ”normal” behavior for WordPress as far as I know. Is HTML even allowed in excerpts?

Comment by Phil Ringnalda #
2006-01-09 16:47:55

Well, no, normal behavior would be to take a chunk of text around the link that caused the pingback to be sent (in that case, the phrase ”Underscores are bad”), but this one apparently choked on excerpting, and settled for concatenating the blog name and post title (misusing a quotation mark for glue, something I’m going to have to start bitching about now that I’m a Yahoo! Search user).

Comment by fluffy #
2006-01-22 10:03:55

301 seems to work fine as long as the original URLs are still current, but as soon as the original URLs go away, Googlebot seems to forget about that. I recently changed domains (from trikuare.cx to beesbuzz.biz), and for a long time I was enjoying being the #2 search result for ”fluffy.” So for the last two months I had trikuare.cx generating 301s to the appropriate new URL at beesbuzz.biz for Googlebot (and just putting up an explanatory interstitial page for everyone else, with the hope people would fix their links), and beesbuzz.biz was at #6, which was fine, but then as soon as the domain expired, beesbuzz.biz dropped off of the first page of Google results entirely.

Unfortunately, beesbuzz.biz hasn’t had the benefit of 7+ years of inbound links, and of course now most new inbound links are rel=”nofollow” anyway (hooray for comment spammers), so I don’t know if I’ll ever get my status back. Le sigh.

But I think the entire site’s pagerank dropped as well; beesbuzz.biz/blog/ used to be the #1 search result for ”i like plaid” (because I have a link to it with that name from many of my forum signatures), but as soon as trikuare.cx expired, the beesbuzz.biz result dropped off the front page!

I’m tempted to renew trikuare.cx for another year just to have more time for pageranks to converge and so on, but it seems so silly.

Comment by a-giâu #
2006-08-06 22:56:11

The stampede was getting too much, so I disallowed bots from my Gregarius and now all’s well.

–the lomaji.com guy.

2007-05-08 22:17:42

[…] phil ringnalda : […]

Name (required)
E-mail (required - never shown publicly)
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <del datetime="" cite=""> <dd> <dl> <dt> <em> <i> <ins datetime="" cite=""> <kbd> <li> <ol> <p> <pre> <q cite=""> <samp> <strong> <sub> <sup> <ul> in your comment.