Telling markup from text

I have a simple problem: I need to pass a piece of text, a title, through Atom aggregators so their users see it correctly. That piece of text is The <form> and the <button> should not be interpreted.

It would be easy to misunderstand what Atom (RFC 4287, w00t!) is all about: since the community-backed and open-standardized parts of it took years, you might think that’s what it’s all about. It’s not. It’s about two things: relative URLs, and absolute certainty about who is a moron when something involving the less-than character doesn’t display correctly. Go back to the start, when Sam first began talking about it. Relative URLs and a clear content model.

Atom text constructs, including atom:title, can have a type attribute, with one of three values: text, html, or xhtml. If there’s no type attribute, the default is text. That tells you what’s coming out of your XML parser.

With either <title>The &lt;form> and the &lt;button> should not be interpreted</title> or <title type="text">The &lt;form> and the &lt;button> should not be interpreted</title>, which are the exact same thing, your parser will hand you The <form> and the <button> should not be interpreted along with the knowledge that it is text. For browser-based aggregators, if you need to display text in a browser, you replace & with &amp; and < with &lt; and bung it in (or, you do a bunch of fiddling with JavaScript, possibly needing to escape quotes along the way, but if the end result isn’t the same, you’ve gotten it wrong).

With <title type="html">The &amp;lt;form> and the &amp;lt;button> should not be interpreted</title>, your XML parser will hand you The &lt;form> and the &lt;button> should not be interpreted along with the knowledge that it is HTML. That’s what you put in the browser, so bung it in directly.

With <title type="xhtml">… well, things get complicated depending on your parser’s API, and I’m willing to cut you some slack if you screw up. Too bad for the people that feel they have to use it.

Bloglines interprets text as HTML

So, what do you give me? With type="text", Bloglines fails to escape the text before trying to display it as HTML, so the title has the start of a form, and an actual button labelled “shouldn’t be interpreted”. Google Reader, My Yahoo!, Gregarius and NewsGator Online decide that the example tags are actual tags that they don’t like, and strip them, displaying The and shouldn’t be interpreted. In all five cases, those are bugs in the aggregators, and that’s the point of Atom: there’s no question, no wiggle room: if you treat text as markup, whether by letting it be interpreted by a browser or by removing it as unsafe markup, you have a bug.

With type="html", Bloglines, My Yahoo!, Gregarius and NewsGator get it right: apparently they simply treat all titles as HTML, so they correctly handle an Atom title in escaped HTML. Google Reader decides that the example tags are actual tags that it doesn’t like, and strips them.

On a (not so-) wild hunch (since it’s the source of the XSS vulnerability I reported to them a month ago), I gave Google Reader a type="html" title with the less-than characters double-escaped, as &amp;amp;lt;form>, at which point it correctly displays the title. The bug I discovered was that they were double-decoding and then interpreting the results; they apparently fixed it not by only single-decoding, but by double-decoding and then stripping things they don’t like.

Displaying text from an XML feed in a browser is a four-step process, and you don’t get to skip any steps, and you can’t do them out of order:

  1. Parse the XML.
  2. Put the character data in a deterministic state.
  3. Sanitize the data.
  4. Prepare the data for your particular display needs.

If you don’t have the parsing down, you’re not even in the game.

The very next thing you need to do with the characters your parser handed you is to decide what they were, and put them all in the same state. Because there’s no way to tell the difference between “a less-than I intended to display as a less-than” and “a less-than I intended as the start of markup” in plain text, that state needs to be HTML, whether or not you are going to allow HTML markup, or even whether or not you actually have an HTML renderer.

With RSS titles, it’s pretty much up to you what they were supposed to be: there has never been a spec which says that the content of the title element is escaped HTML, one spec explicitly said no HTML markup, but if you look at actual examples, whether it’s things that look like markup or things that look like named entities, you’ll find that most people are treating it as escaped HTML, wanting &amp;copy; to display © rather than &copy;. Write a heuristic to guess, be an asshole and treat them all as text, bow to the inevitable and treat them all as escaped HTML, I really don’t care: this is exactly why I will no longer produce RSS. Just don’t get distracted thinking about whether or not you want to display HTML in your titles: that comes later, right now you aren’t doing anything other than determining what state of escaping the original is probably in, and if you believe that state to be plain text, converting & and < to &amp; and &lt;.

With Atom, it’s simple, and that’s the whole point of this ramble: if there’s a type="html" attribute on the element, then it is HTML when it comes out of the parser, if there’s a type="text" attribute or no type attribute it is plain text. Convert the text, leave the HTML alone, just don’t treat them as the same thing.

Once you’ve gotten everything into the same deterministic state, then and only then can you remove the things you don’t want to include, because now you know that any < is the start of markup, and any &lt; is not. Filter out markup, remove XSS attacks, convert named entities to characters or NCRs, whatever you want, but except for things you’ve intentionally removed, your final display should be exactly the same as it would be if you just took the string at that point and put it in a browser.

And now-and-only-now, once you have $item->sanitized_title_in_html, you can do whatever sort of odd character substitution and massaging and escaping or plain-textification you need to do to make it work with your display.

So, could you, as a special favor to me, get it right, especially with type="text"? See, I’ve got these 300,000 <entry>s headed your way, many of them with a < that doesn’t signal the start of markup.

32 Comments

Comment by Anne van Kesteren #
2005-12-05 15:39:23

Did they just fix this or so? I don’t get it in Bloglines anymore… Good to know that Atom finally has a number by the way. Yay!

Comment by Phil Ringnalda #
2005-12-05 15:58:15

They don’t seem to have fixed my this, anyway: previewing this feed, I still get interpreted markup and character entity references that shouldn’t be interpreted.

Comment by Anne van Kesteren #
2005-12-05 16:44:20

I get the error there as well. Oh well.

 
 
 
Comment by Robert Sayre #
2005-12-05 16:57:17
Comment by Phil Ringnalda #
2005-12-05 19:08:09

Firefox, too, but then the bug you took on non-text is one of the reasons I want to push text ;)

 
 
Comment by Mark #
2005-12-05 18:30:16

I’m pretty sure Universal Feed Parser 4.0 (currently in CVS) gets this right. I added a bunch of Atom 1.0 test cases a few weeks ago and checked in code to pass them.

Comment by Phil Ringnalda #
2005-12-05 19:06:07

I wonder what the cycle’s going to be like, from you releasing to Planet Planet including to Planet Mozilla upgrading. It feels so… dirty to be slipping it a deprecated feed, ’round the back door.

 
 
Comment by Aristotle Pagaltzis #
2005-12-05 19:27:37

Liferea does this correctly. Then again, of course it would: it was yours truly who beat its Atom 1.0 support to a pulp repeatedly in order to make it work.

(Internally Liferea kicks escaped markup around, which makes things a nightmare to get right. What’s a unit test, mommy?)

(Uh, that reminds me, I should check if it really has all of my patches.)

Comment by Phil Ringnalda #
2005-12-05 19:44:59

I do feel sort of guilty that I haven’t pounded on/patched Gregarius or MagpieRSS (or both, probably), but there are so many projects, and every time I look at MagpieRSS I get wound up in architectural things, and yadda yadda that boils down to I don’t feel quite guilty enough, yet.

 
 
Comment by Roger Benningfield #
2005-12-05 23:27:23

Well… that was an adventure. Tweaking JournURL’s parser really only took a few minutes, but it exposed a feature of jTidy that I hadn’t noticed previously.

Coldfusion’s XmlFormat() turned the apostrophe in one title into an apos entity, but when jTidy got hold of said title thirty steps down the line, it said ”what is this ’apos’ of which you speak?” and promptly double-escaped the sucker. Unfortunately, it took an hour of fiddling to realize this… if I’d had my head screwed on straight, Tidy would have been the obvious culprit. Duh.

With that said… it works now. :)

Comment by Phil Ringnalda #
2005-12-06 00:01:28

Heh. ”Never escape anything unless and until you actually need it escaped” would be a good rule, though I clearly don’t follow it myself (I thought about taking out the &apos; in my sample feed, which is there because what I started with had been through a Template Toolkit filter that didn’t care whether it was dealing with element content where &lt; and &amp; would do, or attribute content where &apos; and &quot; matter, or CDATA content where &gt; matters iff it follows ]], but I just left it, because what could it hurt?).

Comment by Aristotle Pagaltzis #
2005-12-06 04:12:42

To be fair, this matters only in case of &apos;, because it’s a known named entity in XML whereas it’s not defined in HTML. If you take that into account and simply always escape apostrophes as ', you’ll stay clear of any problems.

Comment by Aristotle Pagaltzis #
2005-12-06 04:33:32

Err, that was supposed to be &#39;.

 
Comment by Phil Ringnalda #
2005-12-06 07:30:45

Interesting: why do I have a double-decode bug for that NCR? And if typing it correctly, as you did, displays the character, will doubling display ”properly” as &amp;#39;?

Comment by Phil Ringnalda #
2005-12-06 07:34:31

Sigh.

simply […] stay clear of any problems was too much provocation for Murphy, apparently.

 
Comment by Aristotle Pagaltzis #
2005-12-06 11:27:31
  • & amp; #39; gets double-decoded.
  • I tried to see if & #38; #39; would do what I mean, but that comment seems to have been eaten (Akismet again?).

(Blanks added to fool decoding.)

Comment by Phil Ringnalda #
2005-12-06 12:56:05

Akismet indeed: it (reasonably enough) thinks that NCRs of low-ASCII are a very strong sign of spam.

Since I’ve been using it, I’ve had three false positives, all you ;)

Comment by Mike Mariano #
2005-12-06 13:23:09

Wait—Akismet eats funny-looking comments, regardless of whether or not someone’s commented before?

”Blanks added to fool decoding”? This reaction is to something less like a spam filter and more like a Slashdot-style manner policeman. Does Akismet weed out first posts and ASCII penises?

==> of the world, unite!

Comment by Phil Ringnalda #
2005-12-06 14:45:03

regardless of whether or not someone’s commented before?

A huge swath of the world knows that Mark Pilgrim always comments as ”a@b.com” (even with broken things like Radio Userland’s comments, where the next person to use that will change the name on all his comments), so ”someone’s commented before” has two broken choices, either to conflate IP addresses and identity, or to allow insider spammers free reign.

And odds are Akismet felt sorry for you, after the orchiectomy, let’s see:

8===>
 
 
Comment by Aristotle Pagaltzis #
2005-12-06 20:59:22

So &amp; gets a free pass but & #38; is filtered? I don’t consider that reasonable.

Sigh. Can we please burn the spammers at the stake now and throw out all these pain-in-the-bottocks fences they made us put into things and then go back to getting useful stuff done, thanks.

I might start signing my comments that much earlier (one day I plan to anyway, but I haven’t gotten around to it) just to avoid the spam traps. (Assuming signed comments do get a not-spam score bonus, that is…)

 
Comment by Aristotle Pagaltzis #
2005-12-06 21:17:46

And (sorry for being such a chatterbox on this post of yours) now I remember, at least one of those was because I used an NCR in order to keep WordPress’ dirty graphical smiley substitutions’s fingers off my comment. Punished as a spammer for wanting to keep it real with the no-silly-graphics posse – oh, the irony.

Funnily enough, knowing of your double-decode bug, I actually could use low-ASCIIs now.

This thing is getting funnier by the minute.

Comment by Aristotle Pagaltzis #
2005-12-06 21:19:37

Err, low-ASCIIs. La la la…

Comment by Aristotle Pagaltzis #
2005-12-06 21:26:04

WTF? So I see it was WordPress that was the culprit, not me forgetting something. I wrote low-<abbr title="Numerical Character Reference">ASCII</abbr>s both times.

Comment by Phil Ringnalda #
2005-12-06 22:06:04

WTF indeed. If you didn’t just decide to tell me three times that what I always thought was an acronym for American Standard Code for Information Interchange is actually an acronym for Numerical Character Reference, could you duck around the outside of WordPress, and email me what you actually typed? There’s a catchall for philringnalda.com, so choose any pseudo-mailbox you like.

 
 
Comment by Phil Ringnalda #
2005-12-07 10:17:58

Okay, color me baffled. Those are all stored in the database as they appear here, even though as far as I can tell, I don’t have WordPress doing any comment munging on the way in, only on the way out. And as should have been obvious to me, that’s supposed to be two <abbr> elements separated by a space, one for low-ASCII followed by one for NCRs to produce low-ASCII NCRs.

Comment by Phil Ringnalda #
2005-12-07 10:22:39

Which WFM (and not just because it’s signed, since I submitted it unsigned first).

 
Comment by Aristotle Pagaltzis #
2005-12-07 10:41:26

Ok, now I feel extremely stupid. The problem was at my end. I use Markdown plus a custom filter for abbrs to write HTML; you guessed it, that filter is buggy. Sorry for the false alarm.

 
 
 
 
 
 
Comment by Aristotle Pagaltzis #
2005-12-06 11:30:08

(But note that, curiously, named entities at the second level, like & amp; amp;, work as intended.)

 
 
 
 
 
Comment by Matt Nordhoff #
2005-12-06 02:38:45

Hmm, shoot. The two Firefox extension newsreaders I’ve used don’t handle the feed well. Sage strips <form> and <button> out and Habari Xenu doesn’t escape them so they get interpreted. Now I feel dirty using a reader that doesn’t do it right. :(

I should probably go complain to their authors now, I guess (though I don’t use Habari Xenu anymore).

Comment by Matt Nordhoff #
2005-12-06 02:41:50

Ack, apparently WordPress strips ’em out too if I forget to use &lt;. Oops.

Comment by Phil Ringnalda #
2005-12-06 07:23:27

I really need to get around to adding preview, and validation to help tell you what’s wrong in your preview.

Luckily, at least my alterations for signed comments mean that I save what you type, and it’s only stripped on display, so all I had to do was edit it, not guess at what it said.

Comment by Matt Nordhoff #
2005-12-06 07:58:54

Oh, gee, thanks. :)

Well, Preview would be useful (and that sole ”Add comment” button looks kind of lonely), but it isn’t a hugely important thing. :)

Why isn’t the sanitized version stored? For when this happens? It would be more efficient to only sanitize each comment once instead of on every page load or however it works if it’s cached.

 
 
 
 
Name (required)
E-mail (required - never shown publicly)
URI
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <del datetime="" cite=""> <dd> <dl> <dt> <em> <i> <ins datetime="" cite=""> <kbd> <li> <ol> <p> <pre> <q cite=""> <samp> <strong> <sub> <sup> <ul> in your comment.