My brother’s (feed’s) keeper
One thing I didn’t expect, though perhaps I should have, about moving from Bloglines to Feed on Feeds, was the way that I am now responsible for making sure that the feeds I read are well-formed, usable XML. Bloglines is apparently fairly liberal in its parsing, and even when it isn’t, it doesn’t exactly put feed errors right in your face: in moments of extreme boredom or extreme avoidance behavior, I would sometimes scroll clear through my hundreds of subscriptions looking for the red “[!]” that it uses to indicate a problem with a feed, but rarely, rarely. With Feed on Feeds, it’s just a click (or two) on the “age↑” column header to sort them with the ones that were least-recently workable first. And then, since it seems that nobody else is noticing, I have to either rattle some cages, or wait for the offending post to fall out the bottom of the feed. At first, I tried emailing some people, but either email really is broken, or nobody reads email from me anymore. I was going to switch to leaving comments, at least where they are available, but then I noticed that my infatuation with short posts has left me in danger of having my sidebar wrap around underneath my suddenly shrunken content column, so instead I’ll just drop a few here. You know you love reading this stuff, anyway, right?
- Le «blog personnel» de Joe Clark » � Office ¶ Kindergarten →
- This will become an all too common theme in this, and no doubt subsequent, lists: Joe’s title contains HTML entities (or, if you prefer formality, character entity references from the HTMLSymbol set), which are perfectly at home in HTML, but are undefined in the XML of his feed, and thus need to either be escaped as
←and→, or need to be the literal character rather than an entity reference. The more I learn about character encoding, and the more problems I see in feeds, the more I like literal characters (properly escaping them would let me see the post in Feed on Feeds, but because I haven’t gotten around to hacking in support for unescaping them, I’d see the literal → rather than an arrow). In Movable Type it’s a right royal pain, but I quite often type an entity, and then the next time I’m looking at the preview in its annoying separate page, I copy the character and paste it back into my title or entry. In WordPress, it’s a quick scroll down to the preview at the bottom of the posting page and back up. Joe’s using WordPress 1.2-delta, but my copy of 1.5-alpha-6 suffers from the same lack of title-escaping, so chances are 1.5-final does as well. - World’s first? Wimax for train commuters
- Dunstan mentions that Britain’s mobile companies paid 22.5 billion of something for licenses, but the currency unit is in question, or rather is a question mark. That’s not too uncommon with the euro symbol, since it’s not defined in ISO-8859-1, and pasting one into a form in an ISO-8859-1 page will tell your browser that you want to silently submit the form in Windows-1252 instead. Win-1252 uses 0×80 as the codepoint for the euro, while that’s an undefined character in ISO-8859-1. Since Dunstan publishes in UTF-8, the one true way to avoid that sort of silent recoding, I’m not sure how he managed an undefined currency symbol, but luckily his blog is powered by Naked Dunstan Technologies, so it’s his problem alone :)
- J-Walk Blog: Cocktail Generator
- Not my favorite problem to explain, or to suggest how to fix. J-Walk’s HTML is interpreted as being encoded in ISO-8859-1, thanks to a
<meta>tag (though the HTTP standard would also say it was, as well, since that’s the default for all text/* media types, including text/html). However, his RSS is delivered with aContent-Type: application/xmlheader without a charset parameter, and the XML declaration in the file itself doesn’t specify an encoding, so thanks to the insanely complicated rules for XML, that means his feed is interpreted as being encoded in UTF-8, whether or not it really is. Sometimes that’s just fine, and other times you paste in a ½ thinking you’ll get a “vulgar fraction one half” (why are those called vulgar, by the way?), and instead you’ll get an invalid, unparseable feed. Ideally, the fix would involve hacking whatever pMachine uses to generate the feed, to add; charset=iso-8859-1at the end of the content-type header, and also changing the XML declaration in the feed itself to read<?xml version="1.0" encoding="iso-8859-1"?>so it will retain the knowledge of what it is no matter where it travels. - rawbrick.net › articles › 5 “loveâ€? songs
- This one sucks particularly badly because Carol’s RSS feed is just fine: the two entities in the title,
“and”, are both amp-escaped there. However, I’m subscribed to her Atom feed, where they are left unescaped, and thus are undefined entities. Judging by the URL, Textpattern g1.17 could use a little help learning about how to strip entities for filenaming, as well as learning how to escape them in Atom. - Half a Six (Apart) Pack
- During Movable Type and Six Apart’s recent redesign, I think I read in one post that they were going to redirect the Six Log feed to a combined feed of everything they post everywhere, though I’m not sure because I was being flooded with posts due to their redirection of all the feeds to Feedburner, causing every
<link>to change, and every item to become new again. Looking at what’s in the feed right now, it appears that maybe it is an aggregated feed, with a two-day delay on posts from the Pronet Weblog. My vague memory is that one of the benefits of Pronet membership was supposed to be earlier access to posts there, but the URL where I was subscribed to it directly is now 404, as is my subscription to the Movable Type news feed and the MT Plugins feed. Of course, since they’ve outsourced all their feeds, the exact same post in the two different feeds would have a completely different<link>, and thus look new again to anything not looking at the<atom:id>, or the element I never noticed before and now think I shall abuse,<feedburner:origLink>. In any case, my ideas about what I’m supposed to do about the three feeds I was subscribed to that are now returning a 404 are now as muddled as this post, and as my ideas about why they needed to outsource their feeds (are they incapable of producing usable feeds, despite writing software thats supposed to? are they incapable of handling the bandwidth, despite running a hosting service? are they so desperate to know about click-through numbers that they’ll annoy their subscribers, and lose some along the way? puzzling.). Probably I’ll just take the easy way out, and unsubscribe. Oops.
It looks like that bug in Textpattern’s Atom code is still around as of 1.0RC1. I’ll file a ticket.
Thanks James, exactly what I hoped someone would do, without my having to figure out what the current newest version was and see if it was there and find the bug db. ’Preciate it :)
(X)HTML (+MathML) named entities in feeds are exactly the reason I wrote the Numeric Entitities plugin for MovableType and the accompanying MathML::Entities Perl Module.
Evan Nemerson wrote a PHP implementation.
So there’s no real excuse any more for sending HTML entities in an RSS/Atom feed. If your blogging tool doesn’t convert them to numeric character references (or, possibly, utf-8 characters, depending on what encoding it uses for your feed), then your blogging tool is broken.
In Netscape’s RSS 0.91, the HTML entities were defined in the DTD, and were therefore valid to include verbatim in your feed. Userland’s RSS 0.91 removed the DTD, and therefore broke this very useful feature, and we’ve been DTD-less in syndication land ever since.
Sigh, I know, I have a few feeds in FoF that I never see because they’re invalid… I’ve choosen to ignore the problem, which may not be the best solution. I suppose public shaming is a possible solution.
Must. Control. Fist. Of. Death.
Wups, this version of this post does rather look like I’m trying to shame the authors into doing better. The first version did a much better job of explaining that I was writing in public rather than email only partly because people ignore my email, and mostly because I thought it much more likely that the authors of the software would see it here. For most feed problems I trip over, that’s where the problem really should be stopped. Then when I rewrote the whole thing, after accidently closing the wrong tab, it lost that flavor.
I thought the only XML characters you *had* to escape were less-than/greater-than and ampersand?
Anyway, does that headline display incorrectly in any known browser? It doesn’t on Mac.
Anyway2, what kind of validator does one use to determine well-formedness?
feedvalidator.org will catch this and many other common mistakes.
The only things you have to escape are less- and greater-than in element content, quotes in attributes, and any ampersand that doesn’t signal the start of an entity reference to a defined entity. In HTML, you have a DTD (either explicitly referenced or implied) that defines all the named HTML entities, but in RSS you don’t, so → is undefined, and because XML does a lot more with entity references than HTML, that has to be a fatal error. Any & not followed by a defined entity means it’s time to halt and catch fire (or, for most aggregators, time to switch to the liberal and forgiving non-XML parser or to fix the error and parse again).
If there’s any browser that fails to display it, they should be shamed into oblivion: in HTML, failing to recognize an HTML entity is unacceptable. It’s just in XML where failing to recognize it is a sign of a fatal error in the input, rather than the processor.
And while feedvalidator.org is the place to really check your feed, for problems like that just loading your feed directly in your browser will usually tell you. At least in Gecko, there’s a bug that will keep it from reporting to you that you have a character which isn’t defined in your encoding, so Dunstan would be out of luck, but if you feed them XML browsers are happy to tell you about undefined entities.
Escaping → to &rarr; in your feed is good. Recoding → to → is better. The latter will actually display as the desired → character.
Numeric character references are always safe. If you’re using utf-8, so is typing → (you, of course, have an easy way to just type ”→”, don’t you? ;-).
Encoding Fun
Phil Ringnalda: Joseph Walton:
Mark Pilgrim has to be mumbling about how I long ago lamented FoF’s [well, Magpie's] strict parsing. Hi, Mark.
While waiting for Magpie to become a liberal parser, I’ve suggested a jail for bad feeds, which originally started with feeds giving a 410 [hi again, Mark] or those who changed URLs without giving a forwarding address via 301 [hi, Six Apart!]. One source of many FoF-related 500 Internal Server Errors seems to be one or two feeds hanging up and causing update-quiet.php to give up. I readily admit that my bad-feed-purgatory is a horrible kludge, but before the good software design police shoot me, remember that I’m an aerospace contractor and engineer. :)
My issue is that I care an order of magnitude less about feed validity than page validity. Actually, that may be an overstatement, as zero isn’t an order of magnitude less than anything but zero.
Which is ironic, as — being an XML application — validity (more precisely, well-formedness) actually has an impact on whether people can use your feeds.
I got by for a very long time without feeds. I provide them because WordPress does. If they need fixing, it’s beyond my technical ability, as are many things. So if somebody else wants to fiddle with my installation, fine. But I can’t do it myself.
What I have an investment in is providing Web pages and not feeds of Web pages. Valid feeds are nice, I suppose. If you’d like to help me fix them up, go to town.
Well, damn it all.
I’d like to claim that after another twenty or thirty posts of this general sort, I’ll learn how to say them like I mean them, to make it clear that I’m saying ”here are these classes of problems that we aren’t solving in these programs, and look!, because of these instances I’m being kept from knowing that Joe posted something, and I’m being kept from reading about what Carol’s watching, and how can we fix it?” I don’t want to say ”Joe’s a moron for not understanding and caring about XML,” because I don’t want you to even think about XML unless your program just can’t figure out how to make things right, and it has to ask you what it should do, but I don’t seem to be able to get that part across, and I can’t claim to think I’ll get better since I’ve been saying that for at least four years now.
The good part of dragging your dirty feed around in public, though, is that Jacques reminded me that numeric character references are always right, and always refer to the character’s codepoint in Unicode, not to something specific to that particular weblog’s character encoding, so if someone hasn’t already done it, that’s the where and how to do a WordPress plugin (or patch for the next version, or both): PHP’s a bit of a pain to do entity-to-character translations in unless you’re using ISO-8859-1 or have a new enough PHP, but translating entities to NCRs is just simple string translation, the sort of boring crap that even I can manage to program.
Numeric entities are *not* always right. The entities between 0×80 and 0xFF are encoding-specific and should not be used unless you are absolutely sure that your feed is declared *and served* with the proper character encoding. See Sam’s survival guide for a list of these problematic entities, and their proper encoding-independent replacements.
Also, Joe is correct — this should be handled by the blogging software, not individual end users. Unless end users are foolish enough to write their own software, like Dunstan.
Thank you for reminding me why I went on vacation. It was from this.
From this? Damn, man, this is the good stuff, people learning things and fixing things so they can read and write with fewer barriers in the way. So far, this hasn’t resulted in a single
403 Forbidden, or anyone saying that if it suits their company, they’ll lead people on and then pull the rug out from under them, or anyone calling anyone self-important, or… oh, wait. That’s something else entirely, and this time around, doesn’t even involve RSS.But I didn’t mean that any ampersand-digits-semicolon was always safe, I meant that it’s possible to replace HTML’s named entities with an NCR which will be safe in both XML and the HTML that’s extracted from it, without needing to know the charset of the characters surrounding the NCR. You don’t have to worry about 0×80 to 0xFF if you are translating every … to …, because everything that handles the feed has to know that it replaces … with a … in whatever encoding it’s outputting. Right?
Unicode codepoints ox0-ox08,0×0B,ox0C, ox0E-0×1F are problematic, whether you use NCRs, the corresponding iso-8859-1 characters or utf-8 or ….
If you don’t filter these characters out of your utf-8 input, you are in just as much trouble as if you don’t filter out the corresponding NCRs.
Sorry to be pedantic, but numeric character references are *not* encoding specific in either HTML or XML. In both cases, they are based on the ”Universal Character Set” defined in ISO 104646.
Bah, I misread what Mark miswrote.
•always unambiguously refers to the control characterMESSAGE WAITING. Because that makes no sense to include in an HTML page, if you use it at least Firefox and IE/Win will assume you meant code point 149 in Windows-1252,BULLET. That’s a problem if you are expecting people to type NCRs, and no problem whatsoever if you are expecting people to continue to use HTML named entity references, and only want to have WordPress have a translation table that converts them to NCRs in the XML feeds, because (one hopes) the translation table will not make that mistake, and will convert•to the proper•.It is, as I said, a no-brainer that named entities should be converted to NCRs in XML feeds.
Are there any tool vendors who don’t read Phil’s blog, or does this dead horse need more flogging?
For completely incomprehensible reasons, I’ve found that it’s also useful to convert named entities to NCRs in XHTML+MathML documents (despite the presence of a DTD which defines those named entities). But that’s not the sort of thing tool vendors need to know about … yet.
Many, many who don’t, a few who read it with all the pleasure of a mongoose reading The Cobra Times, and one who is even now chuckling to himself and saying ”You know what open source means, don’t you Phil? You see it, you don’t like it, you fix it.” Soonish.
Thanks for the heads up, Phil. I fixed the Atom feed so the entities should be properly escaped now. Figuring out how to strip the entities for the permalink is beyond me though, so I’ll await a Textpattern update for that. :)
Actually, some kind soul forwarded me a regular expression line to strip the entities, so that’s fixed now as well (I think).
It’s all fixed well enough for me, anyway: you popped back up in my aggregator sometime this afternoon, while I was at work and certainly not checking feeds. Thanks!
Heh. Far too fine a post title to not hand-track-back-forward: Carol’s explanation of how to fix Textpattern’s Atom feed titles, and have permalinks properly strip entities:
your entities bring all the boys to the yard
So, a quick postmortem. Of the 5 problem feeds in your post,
Aside from the last, they all seem eminently fixable.
Gah, we’re still tracking down some of the feed issues and thanks for pointing out the ones we’ve still got to nail down. I’ll try to make the feeds compelling enough for you to resubscribe to shortly. :)
Though of course as you know, I’m too lazy to actually unsubscribe, when I can keep filing bug reports via 404 errors with less effort. And I see that my pronet feed popped back into being this afternoon, so now it’s just my rather oddly chosen, probably never advertised, MT news and MT plugins subscriptions at http://www.movabletype.org/news/news.xml and http://mt-plugins.org/atom.xml that haven’t redirected yet.
Makes you wonder what my point might have been, choosing those URLs that probably weren’t ever linked from either site.
re: World’s first? Wimax for train commuters
i just looked at 1976design.com and
”paid 22.5 billion of something for licenses”
is pounds sterling, it should be a £ sign, we don’t use euros in the UK.
Agreed, it should be, in the sense of ”should now be, because Dunstan fixed it.” And apparently my first instinct, the evils of an ISO-8859-1 page, was right; that phpMyAdmin’ll bite you every time.
In other words, it is invalid to identify comments with an ID value entirely comprising digits. “Phil ringnalda dot com� misses the mark, although your humble reporter knows of not a single user agent that cares. Should Phil Ringnalda care—I notice the obligatory badge of honor on the home page—the fix is to simply prepend a letter of his choosing to the string of digits.
Indeed. Were my tool not so completely ahistoric, I could do that going forward: all comment ids after any given time could be prefixed with a letter.
However, you may not change fragments of your past: if a UA requests, say,
blog/2004/05/holy_crap_thats_blogger.php, I have no way of knowing whether the UA’s user wanted to see that, or/blog/2004/05/holy_crap_thats_blogger.php#cid008351, or, most likely of the three given the relative number of links,/blog/2004/05/holy_crap_thats_blogger.php#008351.I screwed up, back in February of 2002, there’s no doubt about it. The only way I can think of to fix it, without causing actual harm to people to avoid hypothetical harm to unknown machines, would be to hard-code a magic number into a hacked tag in a plugin: ”
id < 9703 ? id : "cid" . id”. Bah. I’d do that in a PHP template, where it would remain in my face reminding me that it exists, but a magic number hidden away, either as a plugin or more likely (since I’m lazy) just a hack in the MT source? I’ll do it within ten minutes of being notified by someone that my invalid ids are hurting them, but because there will still have to be 4609 (4610, counting this one) invalid ids, no matter what, otherwise I can’t see a reason to change now.