Ah, sweet irony

Some twisty path, I think Danny Ayers talking about Edd Dumbill talking about Mono and his editorial talking about handing XML.com over to Kendall Grant Clark, left me at the front page of XML.com noticing that there are quite a few interesting articles there which nobody seems to have linked and discussed in my hearing, so apparently it’s up to me to keep track. I wanted to see whether they were putting enough in their Atom feed to make it worth subscribing, so I started down that twisty path.

Their Atom feed is (fairly) properly served, as application/atom+xml, which makes it a complete pain to just look at the bloody thing. What I want is to just open the link in a new Firefox tab, and have the XML pretty-printed, or at least shown as text, but Firefox respects the MIME type’s authoritah, so it insists that I either save the file, or use something else. After trying a couple of new ideas for “Open” that worked about as well as the last dozen or two, I settled for saving it as “unnamed” (since it’s served by /meerkat/?_fl=atom&t=ALL&c=47, which doesn’t involve anything a browser would see as a filename – the RSS gets served with a content-disposition: filename=meerkat.rss header that at least makes Firefox suggest the workable-if-odd name meerkat.rss.xml (a thing I’m mentioning partly so I can look it up here later, when I want to know how to suggest a filename for something served by a directory with a query string)).

Open it up, and what do I see? XML Parsing Error: undefined entity. Meerkat started out serving RSS 1.0, which it does with a DOCTYPE declaring the HTML entities:

<!DOCTYPE rdf:RDF [
<!ENTITY % HTMLlat1 PUBLIC
 "-//W3C//ENTITIES Latin 1 for XHTML//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">
%HTMLlat1;
]>

lets them put an unescaped entity from HTML’s Latin 1 entity module directly in their feed. I’m not quite sure I understand how externally defined entities in XML work, but I think that means that a validating parser will then download the entity DTD, and convert entities to characters before passing the content off to its consuming application, and a nonvalidating parser can do whatever it wants, either parse them or just pass them on. Someone aggregating the feed might get Sacré SVG: SVG and Typography: Animation or they might get Sacr&eacute; SVG: SVG and Typography: Animation. Not sure. It’s certainly the sort of thing I would do, to keep aggregator developers on their toes, though not exactly something I would recommend in production use. However, the Atom flavor doesn’t include the DTD declaring the entities, so it’s just purely invalid, not-well-formed XML.

Looking beyond the unparseable nature of the feed, though, things get a bit funky. The entry “content” consists of the same one-sentence summary that appears on the front page of XML.com, but it’s in an entry/content element, not an entry/summary element. That’s survivable, but not very helpful: there are a number of ways that you might want to display things differently when you know you’ve got a summary that isn’t the full content.

What’s less survivable is the actual structure of the content element: it’s CDATA-escaped text/html, or so it says, but without any markup, and with an xml:space="preserve" attribute on the content element. The way I understand that, it tells an XML parser to tell the consuming application that whitespace is significant in the content of the element. If you tell me “here’s some text/html, and whitespace is significant” then the best thing I can do is wrap it in a <pre>, to preserve your whitespace. In this case, that whitespace is a couple of newlines, four spaces, the entire summary on one line, a newline, three spaces, then another newline. I’m completely unable to grasp how that whitespace is significant to anything, but I do know that throwing a single long line into a <pre> is the last thing you want to do in virtually any layout, or browser, or embedded HTML control. I first saw xml:space="preserve" done that way in Blogger’s Atom feeds, and didn’t like it then; I like it even less now that the infection seems to be spreading. If there’s something useful that can be done with that combination, I’d sure like to know what it is. If your feed consists of text/plain Python code, or patches, or something where the whitespace really is significant, then tell me it is, but if you’ve just got some text, please don’t.

Then there’s the repetitive funk: we’re told that a given entry is in xml:lang="en" on the feed/entry, and also that it’s <dc:language>en-us</dc:language>. Which, and why tell me twice? The entry/author/name is repeated in entry/dc:creator, the type attribute on the link rel="alternate" (or does that mean the type attribute on the entry/content?) is repeated as dc:format.

Now, Anil tells me that what I need to do in this case is contact whoever’s in charge, and explain the changes they need to make. But, the Atom feed doesn’t include a X-Atom-Error header or a link rel=”service.error”, and I’ll bet it wouldn’t do any good to send a request with the HTTP ERR verb, and the RSS feed doesn’t include a admin:errorReportsTo element. The Meerkat about page (which appears to really be from March 2000, not just a Cool URI first used then) suggests that bug reports go to their RSS Forum, which I believe has been saying Thank you for your interest, but the forums have closed. longer than I’ve been interested in RSS. I know that Rael Dornfest wrote Meerkat in the first place, but is he still responsible for it?

No wonder I’d rather just give things a gentle fisking here, rather than quietly email the person in charge. And, of course, there wasn’t any chance that I’d pass up the opportunity to roll around in the delicious irony of XML.com producing an Atom feed that isn’t even well-formed XML, what with the way that Tim Bray keeps saying that anyone who can’t produce well-formed XML is a bozo, and an incompetent fool. Make no mistake: this shit is complicated, and hard to get right, and easy to mess up, and if you are in charge of designing it, and you don’t accept that, well, you might be a little bit of a bozo. Or, at the very least, you’ve spent far too much time in the rarefied air amongst people with very uncommon abilities and skills.

The best part? After spending most of my free time this evening writing this, I almost forgot to subscribe to the feed. I sure hope Bloglines is using a liberal parser, and will let me subscribe even though it’s malformed XML, because I’m starting a vacation pretty soon that’s going to drive every thought of subscribing to feeds out of my head (though the offline part doesn’t actually start for another week), and by the time I get back I’ll have completely forgotten the desire to subscribe to it. But, err, I’m sure Draconian error handling’s the right thing to do. Probably if I fire off an email to Rael tonight, he’ll have it fixed by morning, and I’ll still remember to subscribe. ROTFC&GWPMS.

9 Comments

Comment by Phil Ringnalda #
2004-07-01 22:34:41

And, when I did subscribe, to the feed for XML.com, I got the feed title ”Meerkat” and the tagline ”An Open Wire Service.” That’s truly awful metacrap. You get zero points for being all semantic and including a bunch of Dublin Core elements if the feed for one of your crown jewel sites has a totally generic title and description. We’re not talking about getting your weblog scraped by a community college in West Virginia (which, by the way, will at least get your site title included in the title along with them).

 
Comment by Tim #
2004-07-02 02:43:58

You have a gift for hoisting people with their own petard, Phil. It is
a pleasure to read — I hope that someday you get a paying job
writing things like this so we can read more of them.

In the meantime, have a great vacation, and leave a few fish in the
water.

 
Comment by Pete Prodoehl #
2004-07-02 08:37:28

Phil, you mean you don’t right-click on the link, select ’Copy Link Location’ and then open a new tab, type ’view-source:’, paste in the URL and hit return?

Seriously, someone needs to write a Firefox plugin that adds to the contextual menu ’View Source of Link’ when you right-click a link… (Or perhaps it exists but I’ve not found it yet!)

 
Comment by Sam Ruby #
2004-07-02 10:20:03

I’ve kicked off a thread on atom-syntax: Validing parser required?.

 
Comment by Edd Dumbill #
2004-07-02 10:28:34

While I enjoy a good pitard-hoisting as much as the next man, it would have been courteous if you’d written by email to us first. I could have given you some background, some of which I would have been prepared to convey privately, but don’t feel happy broadcasting in a public response.

The main issue seems to be relying on Meerkat to produce our site’s feeds. I have asked whether they can’t be generated directly.

I do hope, and indeed I suspect, that the current brokenness of the feeds has not detracted from your enjoyment of the content of XML.com over the last few years in my time as editor.

regards

Edd Dumbill (now former editor, XML.com)

Comment by Phil Ringnalda #
2004-07-02 11:35:19

I do understand that some things can’t be said in public, and that the dividing line for what can and can’t varies by individual. For me, the private things are very few, very rare, and I usually don’t bother. For my own purposes, it doesn’t matter a bit: of course every aggregator I use is using a liberal feed parser, and doesn’t insist on well-formed XML to subscribe: I’ve been around this block enough times to not have it any other way. But I thought it was instructive enough (and, yes, amusing enough ;)) that some part of talking about it might tell someone a useful bit that they didn’t know, or more likely cause them to learn something while making sure I got it wrong.

That distinction, between ”I want to be able to view-source/subscribe to this feed, but I can’t” and ”I tried to view-source/subscribe to this feed, and these interesting things happened along the way” apparently isn’t clear enough in the way I write, since I quite often have people mistake my trying to use something as a learning experience for us all as simply a poorly written and poorly addressed bug report. I do want it to serve as that, too, but I’m far more interested in having someone explain external entities well enough that I get the tradeoffs, or in having Pete prod me into doing a ”view-source on this link” extension (I never would have thought of that way around application/* mime-types!), or something else I haven’t even guessed might happen.

Comment by Edd Dumbill #
2004-07-02 17:52:24

We partially fixed things for now, I’m told. Certainly with the title tag stuff.

Re public/private: I’m happy to divulge almost anything about *me*, but I figure I owe O’Reilly a little more delicacy of touch.

Either way, we do appreciate that you care enough to comment. This falls under the case of things we’ve noticed but for some reason not fixed. Occasionally a public berating provides the required impetus to get it sorted!

cheers

Edd.

 
 
 
Comment by Henri Sivonen #
2004-07-02 11:37:39

Re: Entities: Using any XML features that non-validating XML processors are not required to support is a bad idea on the Web. See my replies to Jacques Distler on Dave Shea’s site. (IMO, Atom should be DTDless.)

Re: bozo: Well-formedness really is not that hard if you serialize from a tree instead of using ad hoc printf (and use Unicode all the way). Actually, syntax trees are not a new idea at all. Why is it that theory about dealing with a formal languages is so often forgotten when dealing with markup?

Comment by jgraham #
2004-07-02 15:54:52

Well-formedness really is not that hard if you serialize from a tree instead of using ad hoc printf

It’s true that it’s quite possible to achieve a well-formed (and even valid!) site if you design from the ground-up with the single goal of ensuring that the site is well-formed. It’s also clear that using XML trees rather than text fragments as the objects that your CMS manipulates is one way to go about this with a high probability of success. I believe that Mozilla Composer uses a similar system that is disliked by many potential users since it prevents it from integrating with arbitary existing templating software and prevents direct alteration of the source. Clearly this must be implemented at the CMS level although I’m not aware of any existing CMS that takes this approach.

I’d describe writing your own CMS without the help of any existing templating libraries as ”that hard”.

In order to make this operation worthwhile, the requirement for well-formedness must override any other requirements like backwards compatibility (for most sites ~0% of existing content is well formed) or the desire to incoporate random external content (such as ads). As you note in your paper (which I only briefly skimmed), XML-based templating systems make existing HTML editors useless. Presumably one has to provide training to all the system’s users about well-formedness and the art of going from obtuse error messages to the realisation that a URI they copied into their content wasn’t properly escaped.

That’s a lot of existing infrastructure that has to be thrown away in order to acheive something that’s supposed to be easy.

I feel more or less the same way about converting HTML entities to unicode before transmitting a document over the wire. Whilst it may be desirable, there are, as far as I know, no packages that will do it automatically and no pratical benefit to doing it because every future browser will employ whatever hacks are necessary to make sure that DTD-defined entities show up the way they’re supposed to.

Why is it that theory about dealing with a formal languages is so often forgotten when dealing with markup

Because the success of the web has been primarily a result of the fact that one doesn’t need a computer science degree in order to write a webpage or even to write a CMS or web based application. Bring up formal language theory and most people contributing to the web will give you a blank stare.

 
 
Name (required)
E-mail (required - never shown publicly)
URI
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <del datetime="" cite=""> <dd> <dl> <dt> <em> <i> <ins datetime="" cite=""> <kbd> <li> <ol> <p> <pre> <q cite=""> <samp> <strong> <sub> <sup> <ul> in your comment.