Your bread hurts me: Atom text constructs revisited
Uche Ogbuji‘s XML.com column this month, Handling Atom Text and Content Constructs, is mostly a nice introduction to dealing with the Atom elements that hold the text you actually see, in titles, summary, content, and a few other things, with one absolutely horrifying mistake. After showing examples of plain text titles, he says
[…] and you should not even have tunnelled markup through encoding. Atom does not strictly prohibit the form in listing 3, but it does violate the spirit of the specification.
Listing 3: Bogus (unsignalled) encoded markup in plain text construct
<title>One <strong>bold</strong> foot forward</title>
Nooooooooo! No, no, no!
That is an absolutely, positively, perfectly valid Atom title, which should be rendered as One <strong>bold</strong> foot forward. It isn’t prohibited, it doesn’t violate the spirit of the spec, it is the spirit of the spec. The only thing that violates the spirit (and the letter) of the spec is the idea of deciding, based on the sins of the past, that any instance of <
means that it is markup, no matter what the author said that it is.
Doing that is like deciding that any sentence which contains the word “pain” is in English, whether or not the author is a French breadmaker writing about baking, with every other word having no meaning in English, in a sentence marked up with <p xml:lang="fr">
.
An Atom <title>
or <title type="text">
is text: there is no escaping, no tunnelling, no markup, it is text. To treat it as anything else, to not escape it before handing it to an HTML renderer, to strip things which would be unsafe or undesirable if it was the HTML which it is not, is a bug. No negotiation, no wiggle room, no examples of producer error that make you need to do it, either you display that example as One <strong>bold</strong> foot forward or you are not just wrong, you are destroying the very reason Atom exists.
Just a small correction to your correction. I’m assuming you should at least have unescaped the entities and converted them to angle brackets since that is part of the XML escaping. It’s easiest to think of in terms of a CDATA section where, in text mode, you would display exactly the text that is included. You wouldn’t suddenly start escaping certain characters before displaying the content as plain text.