While I wasn’t paying attention to it, after my seemingly-offtopic and certainly poorly-put comment, Joshua Allen’s post-and-comments about how Microsoft Vista will handle encoding in RSS feeds came back around to where I wanted to go, though not with exactly the conclusion I had in mind. In the post itself, Joshua said
The RSS platform uses MSXML (in XML conforming mode) to fetch and parse the data, which would imply an XML-aware HTTP client, though in his last comment he says
the XML stack doesn’t know about HTTP, and the HTTP stack doesn’t know about XML, which sounds to me like a bug in the HTTP stack, for not providing an XML-aware interface.
Even people who know vastly more about it than me tend to “Just use XML” as though it was a monolithic block. It is not. “Well-formedness” is a fatal error constraint on conforming XML processors, what we generally call XML parsers, and not on any other part of the stack of programs that take a feed from someone’s server and end up displaying characters on a screen. It is not a permanent taint on a file that means it may never be parsed. While an XML parser has to halt and catch fire on well-formedness errors, not one word in the XML spec says what the program feeding a stream of bytes to the parser may or may not do, or what might happen while or after an HTTP client fetches a file from a remote server.
In this particular case, there are two things which are not conforming XML parsers, that Joshua makes it sound like will behave as if they somehow were, in order to make the experience of Vista users more painful, and to make life difficult for anyone else who wants to actually behave correctly.
Vista will apparently have code (code which is not a conforming XML parser) which goes out and fetches RSS and Atom feeds from web servers, and either (as seems most likely) saves them as local files, or (less likely, but what do I know?) passes the incoming HTTP stream off to a conforming XML parser.
When a file is served over HTTP with a
Content-type: text/xml header, RFC 3023 says that no matter what encoding information might appear in the body of the file, the actual character encoding is either what the charset parameter in the content-type header says it is, or it is US-ASCII. Whether or not you like it, whether or not it happens often, things which only know about HTTP and not XML are permitted to take
Content-type: text/xml; charset=utf-8 and turn it into
Content-type: text/xml; charset=shift-jis without ever knowing what the
<?xml version='1.0' encoding='utf-8'?> meant. If you are a program module which uses an HTTP client to fetch XML and save it locally, or feed it to an XML parser, you are responsible for checking the content-type header, and if the type is
text/xml you are responsible for munging the XML declaration to make it match the charset parameter on the content-type header, or the US-ASCII default for text/* over HTTP, or re-encoding it to what you or your parser want, or telling your parser to ignore the encoding in the XML declaration.
You don’t get to appeal to the XML spec, because the XML spec doesn’t get to cover the behavior of HTTP clients. If that seems wrong to you, you can write an RFC to superceed RFC 3023, but you cannot wrap yourself in any other spec and claim the high ground, because no other spec covers what happens to text/xml over HTTP. Nor can you claim any moral position about well-formedness, because the first step in determining whether something is well-formed XML is to determine what characters the bytes represent, and if you have mangled the encoding before feeding it to your parser, it simply cannot do that. If you want to parse at all costs, you are absolutely free to run the bytes through your parser as many times as it takes, switching encodings until you find one that works, though best-practices say you should tell the user that you had to clean up after a bozo to get the feed to work, but if you refuse to parse some feeds, saying that they are not well-formed because you misunderstood them, and parse others and claim that they were well-formed as you found them without mentioning that you ignored the rules to be able to call them well-formed, then you are doing wrong, and making life difficult for everyone else who knows what is right and what is wrong, but has to convince people both of that, and of how you are wrong.
The other thing that the XML spec does not constrain is what you do to those bytes before you hand them to your parser.
One of the most common well-formedness errors in RSS feeds is whitespace before the XML declaration, usually from things like a blank line at the end of a section of PHP. If you pass
\n\n\n<?xml… to a conforming XML parser, it is required to catch fire, but you aren’t required to pass that. Nowhere in the XML spec does it say “it is a well-formedness constraint that files which will be treated as XML at some point must not have newlines before the first < character at any time after the moment of their creation.”
While it says of the other common error that a conforming XML parser must stop passing parsed character data and information about the logical structure of the document once it sees an ampersand which is not the start of an entity reference, not only does it not say that no such document may ever be parsed at any time in the future, it goes so far as to say that
to support correction of errors the parser may look at the rest of the document for other errors, may report them, and may hand back the remainder of the document after the first fatal error unparsed (and, one assumes, while holding its nose). To
support correction of errors. While the spec writers were probably thinking of an editing program employing a parser when they said that, it’s still the case that the XML spec only defines what a conforming XML parser does when faced with a stream of bytes, not what a program employing it may or may not do to that stream of bytes, because it’s up to that program to decide what fatal parsing errors are also fatal to it.
If Vista makes life less pleasant for the people who pay for it, by punishing them for easily corrected problems rather than flipping the bozo bit and displaying a warning icon linked to an explanation that it cleaned up some errors, and if this is the feed which controls your pacemaker you better warn the feed producer to fix the problem, then that’s great for me: I use an aggregator which uses a feed parsing library that doesn’t bother to fix them, so I quite often have to tell people they have a problem they need to fix for my sake, even though most other people don’t have a problem. Vista doesn’t need to do that, there is no spec that compels it to do so, but if they want to make my life better by making the lives of their users worse, go go Microsoft!
However, by ignoring RFC 3023 they are hurting me: ignoring a spec because it is inconvenient not only makes it impossible for others to follow the spec, it weakens the whole assemblage of specs that let us work together. Saying that a spec requires you to fail when it doesn’t is odd, but it tends to be your problem; refusing to follow a spec tends to be our problem.
Bah. Still clear as mud. If you fail to fix something that you are allowed to fix, but not required to fix, that’s okay, though it doesn’t seem very smart or helpful, but if you fail to fix something that you are required to fix, and then claim that you are doing the right thing, that’s actively harmful. By saying you only parse well-formed feeds, when in fact you will parse something which is not well-formed because it is not actually in the encoding you use to parse it, you hurt us all.