What Google could do with weblogs
While I don’t know what Google will do about weblogs in search results (which puts me ahead of Andrew “I see the Googlebots walking among us” Orlowski, since I at least know what I don’t know), I do know one thing that they could do.
If you remember back to our last flurry of talking about Google and weblogs, when the purchase of Blogger was announced, lots of people were talking about Google using the changes.xml file from weblogs.com as a way to find out what weblogs they ought to index immediately. Anyone whose job depends on having Google index them well would have fallen off their chair laughing at the thought of Google letting websites choose to be instantly indexed: weblogs.com would have been completely overwhelmed as every single commercial site on the entire web began to ping it, and ping it, and ping it. Being indexed quickly by Google is valuable. Until quite recently, practically speaking you needed to allow two months for new content to get into Google’s index. But, despite the fact that a self-selected list of things to crawl wouldn’t work for Google, Blogger recently released its own changes.xml file.
Even though Andrew “I Can’t Remember How Search Engines Work, Because I Hate Blogs So Much” Orlowski doesn’t think so, weblogs are useful to search engines. Google News is useful for fresh content about the boring things that big media likes to cover, but people use Google for more than just the latest Britney Spears gossip. If someone creates a “Flog Britney Spears” Flash game, and someone who deleted their forty times forwarded email wants to find it, they won’t be pleased to know that Google will add it to their index during the next monthly crawl. They probably don’t want to read three hundred weblog posts about how fun it is to flog Britney, they want to flog Britney now! Although the actual weblog posts themselves may not be as useful to Joe Searcher as their current ranking in Google results makes it seem, the links within them, especially in aggregate, are. If three hundred blogs link to the same page, two hundred of them including “Flash game” in the link text, then that page might be a good candidate for a jump in the “Flash game” results, at least for a while (though that sort of thing reopens the whole Googlebombing issue). At the very least, that page needs to get in the index, pronto, if it isn’t already there.
Google can’t just remove every weblog post from the main index without reducing the quality of their search results: there are things that only appear in weblogs, or where weblogs are the best results. However, there’s no need to give a weblog’s front page prime results, when it’s just a temporary view of the posts that permanently appear elsewhere. By treating any URL that appears in a changes.xml file as the front page of a weblog, Google could more intelligently handle weblogs, returning any other page from the same site with the same keywords instead of returning the front page, so that they would no longer deliver frustrated searchers looking for that post that was on your front page two weeks ago.
With a little cunning to determine what’s a part of your weblog, and what’s an unrelated part of your site, they could also damp down the extremely high rank they give to things like Movable Type comment and TrackBack popups, which are so well marked up semantically that The Register’s Chief Foamer-At-The-Mouth is actually right (for all the wrong reasons) when he fingers TrackBack (and comments, though he doesn’t mention them because Joi doesn’t use comment popups, so one doesn’t come up first when he ego surfs for andrew orlowski googlewash) as being a source of weblog noise in search results. It really has nothing to do with TrackBack the spec, and everything to do with good markup. Google loves small pages with good HTML, so if your TrackBack listing popup has your entry title as the HTML <title>, and especially if it then repeats it in a <h2> in the body, then it’s going to rank high for keywords in the title, and if you bury the title for your actual entry, either by having date-based archives so there isn’t a single page with the entry title as the HTML title, or, as Joi does, by not using semantic markup to let Google know that the entry title is an important part of the entry, then your TrackBack popup may well outrank your entry itself. Knowing that something is a weblog, based on it having pinged, would let Google damp down that sort of thing with a weblog-specific filter, without having to completely ghettoize us on a tab that nobody would ever search (I pass through Google twenty or thirty times a day, but I would have been hard pressed to name all five tabs, because I use Google to search everything for me, not to have me tell it how to search and where to look).
I don’t know in detail how to use weblog links to keep Google fresh while avoiding Googlebombing, or how to keep beautifully marked up but meaningless pages from outranking more confused but richer pages, but then, that’s why I don’t work at Google. Judging by the more than a thousand posts in various threads in the WebmasterWorld Google forum about Google updates in just the last week or so, I’d say the folks at Google haven’t quite run out of ideas for how to change their index around.
Preposting update: while looking around at just how many posts there really were over there, I ran across GoogleGuy (an actual Google employee) saying rather diplomatically that Orlowski’s full of shit. Ev saying it was one thing, but they might not tell him everything about the search side of the business, just yet. GG, other than his Googlish way of saying as little as possible, has always seemed quite authoritative. And as an aside, what’s up with Ev’s archives, with the link in the RSS feed pointing to a weekly archive page that includes his original “Orlowski is full of crap. Again.” post, while the front page and its permalink to to a monthly archive with the post snipped? Also, I noticed that the template for the weekly archive was much easier to read, and the blogroll is, um, er, very nice company to keep. I’d roll that version of the template back onto the main page, if it was me (calculating how much an Evhead link is worth in Blogshares money, thinking about a stock split).
A while back I was looking at the Google WebQuotes:
Anders Jacobsen’s Blog: How Google uses weblogs to enhance search results today: Google WebQuotes
When I heard about the Google announcement, I thought they were now taking WebQuotes out of the Labs and on to the ”production” site, but that’s only my guess…
Maybe I just never search it with good terms, but for my money Technorati’s Cosmos kicks WebQuotes ass. Of course, it’s operating on a much simpler to deal with dataset…
WebQuotes did send me to Orlowski’s curiously named badpress.net site, though, with his page of ”Things I like” (bit of a blogroll), with Online featuring ur-blog robotwisdom and someone’s ”personal news filter site that I decided not to turn into a blog even though I’m adding an RSS feed but since I don’t have permalinks it’s not really a blog.” Along with some comments I’ve seen from people who know him personally, I really have to wonder if he’s just trolling for the fun of it.
It’s worth remembering that a user defeated by clicking straight through to the front page of a weblog where an article used to be can still easily obtain what they’re looking for by using the cached version (possibly sans graphics of course). From there, they should be able to spot a permalink to the ”real” entry and visit it directly.
Admittedly this is a slightly idiosyncratic and less than desirable process, and your changes.xml idea would simplify things a great deal. Something for the next Movable Type release perhaps? Even if Google didn’t pick it up, I’ve got the feeling it would be useful for something.
Site search functionality also helps, I’ve found. Last month I got linked to by a ”newsletter” I’d criticized for being spam, and the guy who writes it apparently just took the address of my front page from Google without bothering to check that the entry was still there; I ended up with a sudden surge in people searching for ”Howdy’s Thought & Humor” (the name of the newsletter) and a good laugh at the spammer when quite a few folks wrote me asking for help getting off his list . . .
Oh, sure, someone who already knows what a blog is and how they get indexed can usually find what they want, either through a second search in the blog’s own search or by going back to Google to try again. But that only works when they are already convinced that you are the one true result. But for most people, it just makes Google look bad – maybe they actually know that they can search for text in the page with their browser, and when they do they find that Google’s number one result doesn’t even include what they searched for. POS. The search engine at MSN doesn’t do that, plus I can check my email at Hotmail.
I don’t know what MT could do about front page search engine referals, but if the people doing PHP search term highlighting haven’t already done a ”if the terms don’t appear in the requested page, redirect to the blog search for those terms” script, they should have. Other than that, not much you can do, unless there’s a way that I’ve never heard of to say ”NOINDEX, FOLLOW” for just one section of text within a page, so you could persuade a search engine to index sidebars and follow your permalinks without indexing the posts in the front page.
but if the people doing PHP search term highlighting haven’t already done a ”if the terms don’t appear in the requested page, redirect to the blog search for those terms” script, they should have
I need to get some sleep just at the moment (it’s past 6 AM here), but tomorrow I’ll start working on it.
”I don’t know what MT could do about front page search engine referals… but the people doing PHP search term highlighting…”
That’s the kind of thing I was thinking about, in my roundabout layman way. It always bugs me the way one can arrive on a page via Google and a clever PHP Google search term highlighter will tell you ”I could not find the search terms you asked for on this page”. So what on earth am I doing there then?!
Evidently, Google does pay attention to meta robot commands. Following a suggestion I saw Brad Choate post some time ago, I’ve added
<meta name=”robots” content=”noindex, follow” />
to my index/archives pages. It seems to have helped.
You can have google crawl you almost instantly by clicking the ”:)” on the google toolbar when you’re on the site that you want indexed.
I never quite worked out what the smileys were all about… I think Google uses the toolbar in general for new URL discovery (i.e. if someone’s toolbar requests the PageRank of a page not in the index, they go out and spider ie), but do they really use the smileys for something as well?
Of course there are always a ton of other factors (is Googlebot in the middle of a deep crawl? is the page already queued? do they use a different bot?), but so far it doesn’t seem to have made any difference. I voted for this page, and more than three hours later all Googlebot has asked for is a bunch of old URLs, ones that I’m damn sure I’ve given it 301s for before, and will again, and it’s past time that it paid attention and respected my authori… Oops.
I don’t see any direct evidence from one test that it adds a page for my value of quickly. Beyond that, who really knows?
Just a tiny little comment:
your RSS feed entry for this posting contains ”<title>” unescaped (i.e. no <..), slightly screwing up my webpage-based aggregator :-)
Cheers for fixing it?
To quote my front page, I do good RSS! I good blogger!.
It is (was, actually, now) in a CDATA section, which should tell your parser ”pass this on exactly as it appears” with single encoded angle-brackets (<title>), so your parser should give it to you exactly like that, and when you then put that in the source of your page, it should display as <title>. So I can’t fix it, because it isn’t broken. However, I may have ”fixed” it for you as a side-effect, if you are only using <description>, since when I looked I discovered that for some forgotten reason I was putting the full item in <description> rather than putting the plain text excerpt there and putting the content in <content:encoded>. So I’m afraid I may have fixed it by giving you less data than I was, depending on which elements you look for.
Yeah; it’s fixed now. I’ve set up a quick and dirty version of ”Rippy the Aggregator” and it seems it uses the description…
Cheers & happy blogging :-)
I remember playing with Rippy a little, and as I vaguely remember it you can tell it what elements to use, and your preference order. What I can’t remember is how it actually parses, so I’d have to look at why it apparently isn’t handling CDATA properly.
From the ”too freaky” files: I was checking on whether the toolbar smilies noted above actually do prompt a crawl, so I was watching tail -f access.log when I saw a request for the reply to my previous comment page, and got over here about two seconds before you left that comment. Realtime comment notification through obsessive log watching!
Boy, with obsessive behavior like that, who needs Technorati?
I think it’s time we put our money where our mouths are. I think Orlowski talks rubbish but thinking alone doesn’t bake pies, so let’s find a way to test it. Then we’ll be at least more honourable people than him, since our opinion will be based on research rather than simple assertion. Proposal over at Hammersley’s site: http://www.benhammersley.com/archives/004808.html
Always be searching
My never-ending fascination with Google results continues. Right now, I’m number 8 when you Google for "always be closing" be…
Always be searching
My never-ending fascination with Google results continues. Right now, I’m number 8 when you Google for "always be closing". Probably…
Googleblogs
I’ll stand by what I wrote last July: Google could clearly change the way they rank pages, but anything other than a drastic change won’t work. Maybe they’ll need to add a new tab, just for blogs, similar to images and newsgroups, but then that margina…
Google and Blogs
I’ve been thinking about this notion that google might start segregating blog search results. I have questions about how they…
Who gets helped by de-blogging Google?
I’m certainly not the first person to roll my eyes at Andrew ”I Can’t Remember How Search Engines Work, Because I Hate Blogs So Much” Orlowski. In the wake of his columns on Google’s ”blog noise problem,” others have already explained why search engine…