My brother’s (feed’s) keeper

One thing I didn’t expect, though perhaps I should have, about moving from Bloglines to Feed on Feeds, was the way that I am now responsible for making sure that the feeds I read are well-formed, usable XML. Bloglines is apparently fairly liberal in its parsing, and even when it isn’t, it doesn’t exactly put feed errors right in your face: in moments of extreme boredom or extreme avoidance behavior, I would sometimes scroll clear through my hundreds of subscriptions looking for the red “[!]” that it uses to indicate a problem with a feed, but rarely, rarely. With Feed on Feeds, it’s just a click (or two) on the “age↑” column header to sort them with the ones that were least-recently workable first. And then, since it seems that nobody else is noticing, I have to either rattle some cages, or wait for the offending post to fall out the bottom of the feed. At first, I tried emailing some people, but either email really is broken, or nobody reads email from me anymore. I was going to switch to leaving comments, at least where they are available, but then I noticed that my infatuation with short posts has left me in danger of having my sidebar wrap around underneath my suddenly shrunken content column, so instead I’ll just drop a few here. You know you love reading this stuff, anyway, right?

Le «blog personnel» de Joe Clark » � Office ¶ Kindergarten →
This will become an all too common theme in this, and no doubt subsequent, lists: Joe’s title contains HTML entities (or, if you prefer formality, character entity references from the HTMLSymbol set), which are perfectly at home in HTML, but are undefined in the XML of his feed, and thus need to either be escaped as ← and →, or need to be the literal character rather than an entity reference. The more I learn about character encoding, and the more problems I see in feeds, the more I like literal characters (properly escaping them would let me see the post in Feed on Feeds, but because I haven’t gotten around to hacking in support for unescaping them, I’d see the literal → rather than an arrow). In Movable Type it’s a right royal pain, but I quite often type an entity, and then the next time I’m looking at the preview in its annoying separate page, I copy the character and paste it back into my title or entry. In WordPress, it’s a quick scroll down to the preview at the bottom of the posting page and back up. Joe’s using WordPress 1.2-delta, but my copy of 1.5-alpha-6 suffers from the same lack of title-escaping, so chances are 1.5-final does as well.
World’s first? Wimax for train commuters
Dunstan mentions that Britain’s mobile companies paid 22.5 billion of something for licenses, but the currency unit is in question, or rather is a question mark. That’s not too uncommon with the euro symbol, since it’s not defined in ISO-8859-1, and pasting one into a form in an ISO-8859-1 page will tell your browser that you want to silently submit the form in Windows-1252 instead. Win-1252 uses 0×80 as the codepoint for the euro, while that’s an undefined character in ISO-8859-1. Since Dunstan publishes in UTF-8, the one true way to avoid that sort of silent recoding, I’m not sure how he managed an undefined currency symbol, but luckily his blog is powered by Naked Dunstan Technologies, so it’s his problem alone :)
J-Walk Blog: Cocktail Generator
Not my favorite problem to explain, or to suggest how to fix. J-Walk’s HTML is interpreted as being encoded in ISO-8859-1, thanks to a <meta> tag (though the HTTP standard would also say it was, as well, since that’s the default for all text/* media types, including text/html). However, his RSS is delivered with a Content-Type: application/xml header without a charset parameter, and the XML declaration in the file itself doesn’t specify an encoding, so thanks to the insanely complicated rules for XML, that means his feed is interpreted as being encoded in UTF-8, whether or not it really is. Sometimes that’s just fine, and other times you paste in a ½ thinking you’ll get a “vulgar fraction one half” (why are those called vulgar, by the way?), and instead you’ll get an invalid, unparseable feed. Ideally, the fix would involve hacking whatever pMachine uses to generate the feed, to add ; charset=iso-8859-1 at the end of the content-type header, and also changing the XML declaration in the feed itself to read <?xml version="1.0" encoding="iso-8859-1"?> so it will retain the knowledge of what it is no matter where it travels.
rawbrick.net › articles › 5 “love� songs
This one sucks particularly badly because Carol’s RSS feed is just fine: the two entities in the title, &ldquo; and &rdquo;, are both amp-escaped there. However, I’m subscribed to her Atom feed, where they are left unescaped, and thus are undefined entities. Judging by the URL, Textpattern g1.17 could use a little help learning about how to strip entities for filenaming, as well as learning how to escape them in Atom.
Half a Six (Apart) Pack
During Movable Type and Six Apart’s recent redesign, I think I read in one post that they were going to redirect the Six Log feed to a combined feed of everything they post everywhere, though I’m not sure because I was being flooded with posts due to their redirection of all the feeds to Feedburner, causing every <link> to change, and every item to become new again. Looking at what’s in the feed right now, it appears that maybe it is an aggregated feed, with a two-day delay on posts from the Pronet Weblog. My vague memory is that one of the benefits of Pronet membership was supposed to be earlier access to posts there, but the URL where I was subscribed to it directly is now 404, as is my subscription to the Movable Type news feed and the MT Plugins feed. Of course, since they’ve outsourced all their feeds, the exact same post in the two different feeds would have a completely different <link>, and thus look new again to anything not looking at the <atom:id>, or the element I never noticed before and now think I shall abuse, <feedburner:origLink>. In any case, my ideas about what I’m supposed to do about the three feeds I was subscribed to that are now returning a 404 are now as muddled as this post, and as my ideas about why they needed to outsource their feeds (are they incapable of producing usable feeds, despite writing software thats supposed to? are they incapable of handling the bandwidth, despite running a hosting service? are they so desperate to know about click-through numbers that they’ll annoy their subscribers, and lose some along the way? puzzling.). Probably I’ll just take the easy way out, and unsubscribe. Oops.

35 Comments

Comment by James #
2005-02-18 23:49:16

It looks like that bug in Textpattern’s Atom code is still around as of 1.0RC1. I’ll file a ticket.

Comment by Phil Ringnalda #
2005-02-19 17:59:09

Thanks James, exactly what I hoped someone would do, without my having to figure out what the current newest version was and see if it was there and find the bug db. ’Preciate it :)

 
 
Comment by Jacques Distler #
2005-02-19 00:28:25

(X)HTML (+MathML) named entities in feeds are exactly the reason I wrote the Numeric Entitities plugin for MovableType and the accompanying MathML::Entities Perl Module.

Evan Nemerson wrote a PHP implementation.

So there’s no real excuse any more for sending HTML entities in an RSS/Atom feed. If your blogging tool doesn’t convert them to numeric character references (or, possibly, utf-8 characters, depending on what encoding it uses for your feed), then your blogging tool is broken.

Comment by Mark #
2005-02-19 07:23:10

In Netscape’s RSS 0.91, the HTML entities were defined in the DTD, and were therefore valid to include verbatim in your feed. Userland’s RSS 0.91 removed the DTD, and therefore broke this very useful feature, and we’ve been DTD-less in syndication land ever since.

 
 
Comment by Pete Prodoehl #
2005-02-19 04:31:28

Sigh, I know, I have a few feeds in FoF that I never see because they’re invalid… I’ve choosen to ignore the problem, which may not be the best solution. I suppose public shaming is a possible solution.

Comment by Mark #
2005-02-19 07:20:19

Must. Control. Fist. Of. Death.

 
Comment by Phil Ringnalda #
2005-02-19 07:30:20

Wups, this version of this post does rather look like I’m trying to shame the authors into doing better. The first version did a much better job of explaining that I was writing in public rather than email only partly because people ignore my email, and mostly because I thought it much more likely that the authors of the software would see it here. For most feed problems I trip over, that’s where the problem really should be stopped. Then when I rewrote the whole thing, after accidently closing the wrong tab, it lost that flavor.

 
 
Comment by Joe Clark #
2005-02-19 07:03:46

I thought the only XML characters you *had* to escape were less-than/greater-than and ampersand?

Anyway, does that headline display incorrectly in any known browser? It doesn’t on Mac.

Anyway2, what kind of validator does one use to determine well-formedness?

Comment by Mark #
2005-02-19 07:19:35

feedvalidator.org will catch this and many other common mistakes.

 
Comment by Phil Ringnalda #
2005-02-19 07:47:55

The only things you have to escape are less- and greater-than in element content, quotes in attributes, and any ampersand that doesn’t signal the start of an entity reference to a defined entity. In HTML, you have a DTD (either explicitly referenced or implied) that defines all the named HTML entities, but in RSS you don’t, so &rarr; is undefined, and because XML does a lot more with entity references than HTML, that has to be a fatal error. Any & not followed by a defined entity means it’s time to halt and catch fire (or, for most aggregators, time to switch to the liberal and forgiving non-XML parser or to fix the error and parse again).

If there’s any browser that fails to display it, they should be shamed into oblivion: in HTML, failing to recognize an HTML entity is unacceptable. It’s just in XML where failing to recognize it is a sign of a fatal error in the input, rather than the processor.

And while feedvalidator.org is the place to really check your feed, for problems like that just loading your feed directly in your browser will usually tell you. At least in Gecko, there’s a bug that will keep it from reporting to you that you have a character which isn’t defined in your encoding, so Dunstan would be out of luck, but if you feed them XML browsers are happy to tell you about undefined entities.

 
Comment by Jacques Distler #
2005-02-19 08:48:02

Escaping &rarr; to &amp;rarr; in your feed is good. Recoding &rarr; to → is better. The latter will actually display as the desired → character.

Numeric character references are always safe. If you’re using utf-8, so is typing → (you, of course, have an easy way to just type ”→”, don’t you? ;-).

 
 
Trackback by Sam Ruby #
2005-02-19 08:54:43

Encoding Fun

Phil Ringnalda: Joseph Walton:

 
Comment by Geof F. Morris #
2005-02-19 10:29:14

Mark Pilgrim has to be mumbling about how I long ago lamented FoF’s [well, Magpie's] strict parsing. Hi, Mark.

While waiting for Magpie to become a liberal parser, I’ve suggested a jail for bad feeds, which originally started with feeds giving a 410 [hi again, Mark] or those who changed URLs without giving a forwarding address via 301 [hi, Six Apart!]. One source of many FoF-related 500 Internal Server Errors seems to be one or two feeds hanging up and causing update-quiet.php to give up. I readily admit that my bad-feed-purgatory is a horrible kludge, but before the good software design police shoot me, remember that I’m an aerospace contractor and engineer. :)

 
Comment by Joe Clark #
2005-02-19 11:42:10

My issue is that I care an order of magnitude less about feed validity than page validity. Actually, that may be an overstatement, as zero isn’t an order of magnitude less than anything but zero.

Comment by Jacques Distler #
2005-02-19 13:04:52

My issue is that I care an order of magnitude less about feed validity than page validity.

Which is ironic, as — being an XML application — validity (more precisely, well-formedness) actually has an impact on whether people can use your feeds.

 
 
Comment by Joe Clark #
2005-02-19 13:44:12

I got by for a very long time without feeds. I provide them because WordPress does. If they need fixing, it’s beyond my technical ability, as are many things. So if somebody else wants to fiddle with my installation, fine. But I can’t do it myself.

What I have an investment in is providing Web pages and not feeds of Web pages. Valid feeds are nice, I suppose. If you’d like to help me fix them up, go to town.

Comment by Phil Ringnalda #
2005-02-19 17:57:36

Well, damn it all.

I’d like to claim that after another twenty or thirty posts of this general sort, I’ll learn how to say them like I mean them, to make it clear that I’m saying ”here are these classes of problems that we aren’t solving in these programs, and look!, because of these instances I’m being kept from knowing that Joe posted something, and I’m being kept from reading about what Carol’s watching, and how can we fix it?” I don’t want to say ”Joe’s a moron for not understanding and caring about XML,” because I don’t want you to even think about XML unless your program just can’t figure out how to make things right, and it has to ask you what it should do, but I don’t seem to be able to get that part across, and I can’t claim to think I’ll get better since I’ve been saying that for at least four years now.

The good part of dragging your dirty feed around in public, though, is that Jacques reminded me that numeric character references are always right, and always refer to the character’s codepoint in Unicode, not to something specific to that particular weblog’s character encoding, so if someone hasn’t already done it, that’s the where and how to do a WordPress plugin (or patch for the next version, or both): PHP’s a bit of a pain to do entity-to-character translations in unless you’re using ISO-8859-1 or have a new enough PHP, but translating entities to NCRs is just simple string translation, the sort of boring crap that even I can manage to program.

Comment by Mark #
2005-02-19 19:02:49

Numeric entities are *not* always right. The entities between 0×80 and 0xFF are encoding-specific and should not be used unless you are absolutely sure that your feed is declared *and served* with the proper character encoding. See Sam’s survival guide for a list of these problematic entities, and their proper encoding-independent replacements.

Also, Joe is correct — this should be handled by the blogging software, not individual end users. Unless end users are foolish enough to write their own software, like Dunstan.

Thank you for reminding me why I went on vacation. It was from this.

Comment by Phil Ringnalda #
2005-02-19 19:34:09

From this? Damn, man, this is the good stuff, people learning things and fixing things so they can read and write with fewer barriers in the way. So far, this hasn’t resulted in a single 403 Forbidden, or anyone saying that if it suits their company, they’ll lead people on and then pull the rug out from under them, or anyone calling anyone self-important, or… oh, wait. That’s something else entirely, and this time around, doesn’t even involve RSS.

But I didn’t mean that any ampersand-digits-semicolon was always safe, I meant that it’s possible to replace HTML’s named entities with an NCR which will be safe in both XML and the HTML that’s extracted from it, without needing to know the charset of the characters surrounding the NCR. You don’t have to worry about 0×80 to 0xFF if you are translating every &hellip; to …, because everything that handles the feed has to know that it replaces … with a … in whatever encoding it’s outputting. Right?

 
Comment by Jacques Distler #
2005-02-19 19:55:03

Unicode codepoints ox0-ox08,0×0B,ox0C, ox0E-0×1F are problematic, whether you use NCRs, the corresponding iso-8859-1 characters or utf-8 or ….

If you don’t filter these characters out of your utf-8 input, you are in just as much trouble as if you don’t filter out the corresponding NCRs.

 
Comment by Sam Ruby #
2005-02-20 17:25:39

Sorry to be pedantic, but numeric character references are *not* encoding specific in either HTML or XML. In both cases, they are based on the ”Universal Character Set” defined in ISO 104646.

Comment by Phil Ringnalda #
2005-02-20 18:10:20

Bah, I misread what Mark miswrote. always unambiguously refers to the control character MESSAGE WAITING. Because that makes no sense to include in an HTML page, if you use it at least Firefox and IE/Win will assume you meant code point 149 in Windows-1252, BULLET. That’s a problem if you are expecting people to type NCRs, and no problem whatsoever if you are expecting people to continue to use HTML named entity references, and only want to have WordPress have a translation table that converts them to NCRs in the XML feeds, because (one hopes) the translation table will not make that mistake, and will convert &bull; to the proper .

Comment by Jacques Distler #
2005-02-20 20:27:36

It is, as I said, a no-brainer that named entities should be converted to NCRs in XML feeds.

Are there any tool vendors who don’t read Phil’s blog, or does this dead horse need more flogging?

For completely incomprehensible reasons, I’ve found that it’s also useful to convert named entities to NCRs in XHTML+MathML documents (despite the presence of a DTD which defines those named entities). But that’s not the sort of thing tool vendors need to know about … yet.

Comment by Phil Ringnalda #
2005-02-20 23:20:30

Are there any tool vendors who don’t read Phil’s blog

Many, many who don’t, a few who read it with all the pleasure of a mongoose reading The Cobra Times, and one who is even now chuckling to himself and saying ”You know what open source means, don’t you Phil? You see it, you don’t like it, you fix it.” Soonish.

 
 
 
 
 
 
 
Comment by carol o #
2005-02-19 16:10:45

Thanks for the heads up, Phil. I fixed the Atom feed so the entities should be properly escaped now. Figuring out how to strip the entities for the permalink is beyond me though, so I’ll await a Textpattern update for that. :)

Comment by carol o #
2005-02-19 16:19:46

Actually, some kind soul forwarded me a regular expression line to strip the entities, so that’s fixed now as well (I think).

Comment by Phil Ringnalda #
2005-02-19 18:01:43

It’s all fixed well enough for me, anyway: you popped back up in my aggregator sometime this afternoon, while I was at work and certainly not checking feeds. Thanks!

 
 
 
Comment by Phil Ringnalda #
2005-02-19 18:14:05

Heh. Far too fine a post title to not hand-track-back-forward: Carol’s explanation of how to fix Textpattern’s Atom feed titles, and have permalinks properly strip entities:

your entities bring all the boys to the yard

 
Comment by Jacques Distler #
2005-02-21 08:13:52

So, a quick postmortem. Of the 5 problem feeds in your post,

  • 2 instances of HTML named entities in feeds (solvable if the blogging tool converted them to NCRs)
  • 1 instance of an entry-posting form which is iso-8859-1, whereas the output pages are supposed to be utf-8 (probably not too hard to fix; the comment posting form works fine).
  • 1 instance of generated feeds having no charset declaration (cured by adding a charset declaration to the XML prolog).
  • 1 instance of startup fever leading to broken or unusable feeds.

Aside from the last, they all seem eminently fixable.

 
Comment by Anil #
2005-02-22 11:49:12

Gah, we’re still tracking down some of the feed issues and thanks for pointing out the ones we’ve still got to nail down. I’ll try to make the feeds compelling enough for you to resubscribe to shortly. :)

Comment by Phil Ringnalda #
2005-02-22 19:41:52

Though of course as you know, I’m too lazy to actually unsubscribe, when I can keep filing bug reports via 404 errors with less effort. And I see that my pronet feed popped back into being this afternoon, so now it’s just my rather oddly chosen, probably never advertised, MT news and MT plugins subscriptions at http://www.movabletype.org/news/news.xml and http://mt-plugins.org/atom.xml that haven’t redirected yet.

Makes you wonder what my point might have been, choosing those URLs that probably weren’t ever linked from either site.

 
 
Comment by Paul Roberts #
2005-02-22 13:30:44

re: World’s first? Wimax for train commuters

i just looked at 1976design.com and

”paid 22.5 billion of something for licenses”
is pounds sterling, it should be a £ sign, we don’t use euros in the UK.

Comment by Phil Ringnalda #
2005-02-22 19:36:49

Agreed, it should be, in the sense of ”should now be, because Dunstan fixed it.” And apparently my first instinct, the evils of an ISO-8859-1 page, was right; that phpMyAdmin’ll bite you every time.

 
 
Comment by Etan Wexler #
2005-03-16 21:22:09

Validity constraint: ID

Values of type ID MUST match the Name production.

[5] Name ::= (Letter | ’_’ | ’:’) (NameChar)*

In other words, it is invalid to identify comments with an ID value entirely comprising digits. “Phil ringnalda dot com� misses the mark, although your humble reporter knows of not a single user agent that cares. Should Phil Ringnalda care—I notice the obligatory badge of honor on the home page—the fix is to simply prepend a letter of his choosing to the string of digits.

Comment by Phil Ringnalda #
2005-03-16 22:15:11

Indeed. Were my tool not so completely ahistoric, I could do that going forward: all comment ids after any given time could be prefixed with a letter.

However, you may not change fragments of your past: if a UA requests, say, blog/2004/05/holy_crap_thats_blogger.php, I have no way of knowing whether the UA’s user wanted to see that, or /blog/2004/05/holy_crap_thats_blogger.php#cid008351, or, most likely of the three given the relative number of links, /blog/2004/05/holy_crap_thats_blogger.php#008351.

I screwed up, back in February of 2002, there’s no doubt about it. The only way I can think of to fix it, without causing actual harm to people to avoid hypothetical harm to unknown machines, would be to hard-code a magic number into a hacked tag in a plugin: ”id < 9703 ? id : "cid" . id”. Bah. I’d do that in a PHP template, where it would remain in my face reminding me that it exists, but a magic number hidden away, either as a plugin or more likely (since I’m lazy) just a hack in the MT source? I’ll do it within ten minutes of being notified by someone that my invalid ids are hurting them, but because there will still have to be 4609 (4610, counting this one) invalid ids, no matter what, otherwise I can’t see a reason to change now.

 
 
Name (required)
E-mail (required - never shown publicly)
URI
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <del datetime="" cite=""> <dd> <dl> <dt> <em> <i> <ins datetime="" cite=""> <kbd> <li> <ol> <p> <pre> <q cite=""> <samp> <strong> <sub> <sup> <ul> in your comment.