Planetary damage

Danny Ayers writes:

I can’t really be bothered blogging all the new SemWeb stuff these days, most of it ends up on Planet RDF anyhow.

What Danny probably doesn’t realize it that he’s pretty much my only connection to the Semantic Web stuff, leaving me with a rather false impression about how much actual stuff is coming out these days. Of course, I read Shelley Powers, but she’s more of an RDF Comet than a planet, and the other five or six RDF-heads I subscribe to almost never post about anything, RDF-related or not. I’m not about to subscribe to the Planet RDF feed, since I’d wind up reading Danny twice, and other people I don’t choose to read, so Planet RDF actually deprives me of RDF news.

And, prior to this post, Planet Mozilla was doing the same thing to my readers: since most of what I might write about has already long since shown up there, I don’t write about it, and as a result I’ve certainly got at least one reader who would have and should have installed Firefox 1.5 rc 1, but hasn’t because I didn’t write about it being available, and what’s in it, and who ought to be testing it, because it had already been on planet.m.o half a dozen times before I might have written about it. But no more: the main reason I’m writing this post, beside wanting to put the idea in some heads that a Planet isn’t always a good thing, is to create yet another category, so I can redirect the feed Planet Mozilla fetches to only get things I put in the “Planet Fodder” category, which will let me start writing about Mozilla again. Odd, that.


Comment by Stuart Langridge #
2005-11-04 23:44:51

Planets ought to be good things; it’s just that feed aggregators don’t know that post N on the planet’s feed is post X on the individual poster’s feed, and so you see it twice. Surely this problem should have gone away by now?

Comment by Phil Ringnalda #
2005-11-05 01:05:07

The duplicate problem turns out to be fairly difficult (and, unfortunately, nobody really realized that until Atom was almost baked, so it doesn’t particularly do anything to solve it).

Take the easiest possible situation: say someone syndicated on planet.m.o uses ID numbers for their permalinks and guid/atom:ids, and say I don’t like them. If aggregators assume that any subsequent instances they see are duplicates, I can block their next post; if they assume subsequent different instances are updates, I can retroactively shut them up.

Doing that on a planet is likely to get me exiled pretty quickly, but where it gets really interesting is in search result feeds: picture how valuable it would be to have your post replace a post by someone with tens or hundreds of thousands of subscribers, just by injecting a few million keyword stuffed posts with faked guids.

The only thing I remember seeing that sounded like it had a hope of working was a proposal for Gregarius, that you be allowed to say which feeds could contain duplicates of which other feeds. It would require double-heinous UI, since for a Planet feed you need to be able to say ”ignore posts in this feed that duplicate IDs from [select checkboxes for two dozen feeds]” while for a search feed you say, what, ”posts in this feed are always the duplicate”? That still doesn’t work for when you see an ID in a search feed first (or when you get a post from a Planet before you get it directly, as I sometimes have). ”When a post in this feed is part of a duplicate-pair, remove the post in this feed, and show the other as new if this has been shown, and that differs in user-visible ways”?

And, looking at the thread again, I see that the proposal was only talking about optimizing by having duplicate groups, and using that to avoid spoofing was the comment I was going to add, except that my application for forum membership never was approved, and I didn’t ever send my draft email to the mailing list either.

But in fact, that’s not the primary reason I don’t subscribe to Planet RDF: in that case, I don’t want to read every who talks about RDF, I want Danny to be my filter. I know plenty of people who don’t want to read Planet Mozilla and learn all about the reflow branch and triaging of auto-UNCOs in layout, but would like to hear about some of the more user- and outside-developer-related things. And there’s just no getting around the problem that with a Planet you are writing for a team blog while not writing for a team blog: if something big happens, either everybody writes about it, and Planet readers see the same news n times, or only one person writes about it, and everyone else’s non-Planet readers don’t hear about it. planet.m.o has something like 80 contributors now: when Firefox 1.5 ships, should it have 75 posts saying ”Firefox 1.5 is out: get it here!”? (Figuring that the SeaMonkey blogs and at least three other would never stoop to mentioning Firefox, even scornfully.)

Comment by Stuart Langridge #
2005-11-05 01:30:41

Just in case I’m thick here, can’t aggregators dedupe based on permalink?

This is where you point out to me (gently!) that I’m the most naive person in the world and this was discussed and rejected fifty times in the last three years :)

Comment by Phil Ringnalda #
2005-11-05 02:42:07

Discussion is probably in atom-syntax, maybe late winter or sometime in the spring? Maybe with some supplemental material in Bob Wyman’s blog, since he did most of the job of pointing out to us that we hadn’t actually thought atom:id through.

Say you have two items:

<title>Phil is an idiot</title>
<description>Thick as a brick</description>


<title>Stuart is an idiot</title>
<description>Thick as a brick</description>

One you fetched from, the other came from a search feed, or a planet, or some other random URL. Which one is the true

With http guids returning HTML, you can require autodiscovery that returns all the feeds which are approved sources for a particular post, but if we both <link> to the same external URL, both with for an atom:id, what external authority can you refer to for a ruling on who should be able to claim to be your take on that link?

Comment by Phil Ringnalda #
2005-11-05 02:44:52

Right. Someone needs to fix this silly cropped overflow with a tits-on-a-boar scrollbar at the very bottom of the comment.

Oh, I guess that’s my job, isn’t it?

Comment by Darren Chamberlain #
2005-11-05 06:24:25

The duplicated items thing is one that has been bothering me for a long time, and it seems like it should be a straightforward thing to resolve. In your (RSS 2.0) example, the guid element is the same in each post; in an Atom feed, the id element would be the same. It seems to me that an aggregator could a series of tests: If the guid element is present, it is the canonical identifier; else if the id element is present (and it’s an Atom feed), the it is the identifier, otherwise use the link element. It seems like that simple heuristic would get us most of the way towards deduping.

Comment by Phil Ringnalda #
2005-11-05 09:00:26

Yes, in both of those items the guid is the same, and a naive de-duper would say that one is a duplicate of the other, and only show one. However, one of them is Stuart’s post saying that I’m an idiot, and the other is my maliciously inserted post making him say that he’s an idiot. To decide which one is the One True Post, you have to have some way of figuring out which feed URL is the authoritative source of items

It’s a tiny and contrived problem for most subscriptions, but once you add the biggest source of duplicates, search feeds, it goes from unlikely to absolutely certain, as inevitable as comment spam.

Comment by Aristotle Pagaltzis #
2005-11-05 21:32:13

The only thing I remember seeing that sounded like it had a hope of working was a proposal for Gregarius, that you be allowed to say which feeds could contain duplicates of which other feeds. It would require double-heinous UI

No, don’t think so. You can just watch the entry IDs coming in, and when you detect a new pairing of feeds carrying a dupe, you ask the user which of the two is more authoritative.

So when you subscribe a search feed, and an entry comes in via the search feed before it comes in via the original feed (at which point you look outside the window and the sky is pink, but I digress), you ask the user whether the search feed is more authoritative or the other one.

This is a bit annoying when you individually subscribe to 8 out of 10 feeds aggregated on a Planet, plus the planet itself, as you’d get to answer the roughly-same question 8 times, but then doing so is silly anyway, so that I think this is a case which can be ignored, and it will work just fine for the vast majority of users.

This scheme comes with the added bonus that the user will be notified of possibly-malicious recycling of entry IDs across feeds that he would otherwise never notice. What that is an unequivocally good thing would have to be seen.

Comment by Darren Chamberlain #
2005-11-05 06:32:45

I’ve taken the opposite approach — I’ve unsubscribed from individual feeds (like Danny Ayers) because I am subscribed to Planet RDF. In some cases, the duplication is not a big deal, for example Shelly Powers, whose presence on Planet RDF is limited to the SemanticWeb category on her blog (I believe).

I do agree with you about Danny, though — I find him to be the single most reliable source of semantic web and RDF information on the web, because he constantly points to the most interesting bits of it. Yes, much of it is duplicated on Planet RDF, but sometimes, without his endorsement, I might miss impotant or interesting things. A lot of it is not duplciated, though; for example, I hadn’t heard of the Atom OWL stuff anywhere else.

Comment by Chris Cunningham #
2005-11-05 08:45:33

I really don’t understand how it’s Planet’s fault that people stop blogging. It might make it more *obvious* that you’re blogging about the same things as everyone else, but in the end if Planet didn’t exist you’d still be blogging about the same stuff as everyone else. That’s just the nature of blogging.

While I realise that Planet is so popular now that several communities have bloggers who blog solely for their respective Planet, I wouldn’t imagine that established bloggers see it as their primary source of eyeballs anyway.

– Chris

Comment by Phil Ringnalda #
2005-11-05 09:54:02

I’m not quite sure, but somehow I think that just asserting the problem away isn’t actually a solution. Or are you saying that having p.m.o consist of 75 ”Firefox 1.5 is out, get it here” posts on the day it ships is somehow a good thing?

Comment by Aristotle Pagaltzis #
2005-11-05 21:35:05

Now if the web were all tripes and the entries all asserted they’re about the Firefox release machine-readably…

Comment by Phil Ringnalda #
2005-11-05 22:03:40

I’m nominating ”if the web were all tripes” for Aphoristic Typo of the Month :)

Comment by Aristotle Pagaltzis #
2005-11-06 09:32:25

Oh man. :)

Comment by Shelley #
2005-11-05 16:18:34

Just call me Hale Bopp.

Comment by Phil Ringnalda #
2005-11-05 22:06:24

Fair enough: you put the Bopp in the bop hale bop RDF bop.

2005-11-06 03:13:41

[…] Phil Ringnalda has a good point in Planetary damageabout the downside of the Planet aggregators, quoting this blog as an example. He reads this for SemWeb news, but I can’t be bothered blogging all the goodies because they usually appear on Planet RDF anyhow. But Phil doesn’t want to have to read everything on the Planet. […]

2005-11-10 02:36:26

[…] Phil Ringnalda wrote recently about the potential damage to the blogosphere of planet-style aggregators. I’ve thought of another problem. If you’re being aggregated onto a planet site then none of the other people on that planet will link to your posts. Why should they? After all, it’ll just look like noise when it hits the aggregator. This could have serious repercussions for your GoogleRank. […]

2005-11-13 10:29:24

[…] Ok, my first attempt at a round-up (in response to Phil’s observation of Planetary damage). Thanks to the conference there’s loads more here than there’s likely to be subsequent weeks, although it’s still only a fairly random sample and some of the links here are to heaps of other resources… Incidentally, if anyone’s got a list/links for SemWeb-related blogs that aren’t on Planet RDF, I’d be grateful for a pointer. […]

2006-01-30 20:13:51

[…] Beyond categorization, Planet Atom might benefit from some editorial oversight. Like many “planets”, Planet Atom takes every entry from every participant, whether or not the entry fits the topic. Currently (using the Friday afternoon update) eleven entries out of 40 mention Atom. Only 11 out of 40? Not exactly “A fusion of atom-related news.” Heaven knows what plantetary damage this is causing Phil Ringnalda. […]

Name (required)
E-mail (required - never shown publicly)
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <del datetime="" cite=""> <dd> <dl> <dt> <em> <i> <ins datetime="" cite=""> <kbd> <li> <ol> <p> <pre> <q cite=""> <samp> <strong> <sub> <sup> <ul> in your comment.