phil ringnalda : Ah, sweet irony

Ah, sweet irony

Some twisty path, I think Danny Ayers talking about Edd Dumbill talking about Mono and his editorial talking about handing XML.com over to Kendall Grant Clark, left me at the front page of XML.com noticing that there are quite a few interesting articles there which nobody seems to have linked and discussed in my hearing, so apparently it’s up to me to keep track. I wanted to see whether they were putting enough in their Atom feed to make it worth subscribing, so I started down that twisty path.

Their Atom feed is (fairly) properly served, as application/atom+xml, which makes it a complete pain to just look at the bloody thing. What I want is to just open the link in a new Firefox tab, and have the XML pretty-printed, or at least shown as text, but Firefox respects the MIME type’s authoritah, so it insists that I either save the file, or use something else. After trying a couple of new ideas for “Open” that worked about as well as the last dozen or two, I settled for saving it as “unnamed” (since it’s served by /meerkat/?_fl=atom&t=ALL&c=47, which doesn’t involve anything a browser would see as a filename – the RSS gets served with a content-disposition: filename=meerkat.rss header that at least makes Firefox suggest the workable-if-odd name meerkat.rss.xml (a thing I’m mentioning partly so I can look it up here later, when I want to know how to suggest a filename for something served by a directory with a query string)).

Open it up, and what do I see? XML Parsing Error: undefined entity. Meerkat started out serving RSS 1.0, which it does with a DOCTYPE declaring the HTML entities:

<!DOCTYPE rdf:RDF [
<!ENTITY % HTMLlat1 PUBLIC
 "-//W3C//ENTITIES Latin 1 for XHTML//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">
%HTMLlat1;
]>

lets them put an unescaped entity from HTML’s Latin 1 entity module directly in their feed. I’m not quite sure I understand how externally defined entities in XML work, but I think that means that a validating parser will then download the entity DTD, and convert entities to characters before passing the content off to its consuming application, and a nonvalidating parser can do whatever it wants, either parse them or just pass them on. Someone aggregating the feed might get Sacré SVG: SVG and Typography: Animation or they might get Sacré SVG: SVG and Typography: Animation. Not sure. It’s certainly the sort of thing I would do, to keep aggregator developers on their toes, though not exactly something I would recommend in production use. However, the Atom flavor doesn’t include the DTD declaring the entities, so it’s just purely invalid, not-well-formed XML.

Looking beyond the unparseable nature of the feed, though, things get a bit funky. The entry “content” consists of the same one-sentence summary that appears on the front page of XML.com, but it’s in an entry/content element, not an entry/summary element. That’s survivable, but not very helpful: there are a number of ways that you might want to display things differently when you know you’ve got a summary that isn’t the full content.

What’s less survivable is the actual structure of the content element: it’s CDATA-escaped text/html, or so it says, but without any markup, and with an xml:space="preserve" attribute on the content element. The way I understand that, it tells an XML parser to tell the consuming application that whitespace is significant in the content of the element. If you tell me “here’s some text/html, and whitespace is significant” then the best thing I can do is wrap it in a <pre>, to preserve your whitespace. In this case, that whitespace is a couple of newlines, four spaces, the entire summary on one line, a newline, three spaces, then another newline. I’m completely unable to grasp how that whitespace is significant to anything, but I do know that throwing a single long line into a <pre> is the last thing you want to do in virtually any layout, or browser, or embedded HTML control. I first saw xml:space="preserve" done that way in Blogger’s Atom feeds, and didn’t like it then; I like it even less now that the infection seems to be spreading. If there’s something useful that can be done with that combination, I’d sure like to know what it is. If your feed consists of text/plain Python code, or patches, or something where the whitespace really is significant, then tell me it is, but if you’ve just got some text, please don’t.

Then there’s the repetitive funk: we’re told that a given entry is in xml:lang="en" on the feed/entry, and also that it’s <dc:language>en-us</dc:language>. Which, and why tell me twice? The entry/author/name is repeated in entry/dc:creator, the type attribute on the link rel="alternate" (or does that mean the type attribute on the entry/content?) is repeated as dc:format.

Now, Anil tells me that what I need to do in this case is contact whoever’s in charge, and explain the changes they need to make. But, the Atom feed doesn’t include a X-Atom-Error header or a link rel=”service.error”, and I’ll bet it wouldn’t do any good to send a request with the HTTP ERR verb, and the RSS feed doesn’t include a admin:errorReportsTo element. The Meerkat about page (which appears to really be from March 2000, not just a Cool URI first used then) suggests that bug reports go to their RSS Forum, which I believe has been saying Thank you for your interest, but the forums have closed. longer than I’ve been interested in RSS. I know that Rael Dornfest wrote Meerkat in the first place, but is he still responsible for it?

No wonder I’d rather just give things a gentle fisking here, rather than quietly email the person in charge. And, of course, there wasn’t any chance that I’d pass up the opportunity to roll around in the delicious irony of XML.com producing an Atom feed that isn’t even well-formed XML, what with the way that Tim Bray keeps saying that anyone who can’t produce well-formed XML is a bozo, and an incompetent fool. Make no mistake: this shit is complicated, and hard to get right, and easy to mess up, and if you are in charge of designing it, and you don’t accept that, well, you might be a little bit of a bozo. Or, at the very least, you’ve spent far too much time in the rarefied air amongst people with very uncommon abilities and skills.

The best part? After spending most of my free time this evening writing this, I almost forgot to subscribe to the feed. I sure hope Bloglines is using a liberal parser, and will let me subscribe even though it’s malformed XML, because I’m starting a vacation pretty soon that’s going to drive every thought of subscribing to feeds out of my head (though the offline part doesn’t actually start for another week), and by the time I get back I’ll have completely forgotten the desire to subscribe to it. But, err, I’m sure Draconian error handling’s the right thing to do. Probably if I fire off an email to Rael tonight, he’ll have it fixed by morning, and I’ll still remember to subscribe. ROTFC&GWPMS.

This entry was posted on Thursday, July 1st, 2004 at 8:42 pm and is filed under feeds and syndication. You can follow any responses to this entry through the post feed. You can skip to the end and leave a response. Pinging is currently not allowed.

9 Comments

Comment by Phil Ringnalda #

2004-07-01 22:34:41