PHP turns evil

If there’s one thing that the RSS Draconian Wars taught us, it’s that you don’t want to be involved in any discussion of XML and error handling.

No, that’s not it. It’s, err, that any recovery from error by a parser, as distinct from an application which employs a parser, is evil.

PHP, meet your evil patch. Even better, if I understand Daniel Veillard’s comment correctly, PHP’s only going to be doing it because it’s already there in the widely (and rightly) respected libxml2 library.

Me? I don’t care. No matter how badly people misunderstand it, what the XML spec says is that a conforming XML parser must halt and catch fire when it hits a fatal error. It doesn’t say that an application which handed the purported XML to the parser can’t massage it and try to run it through again (in fact, it makes a strong effort to make clear the difference between a parser and an application that employs it). Does it matter to me whether the massaging code lives in the same library as the parser? ‘Fraid not. In fact, who’s more likely to get the massaging right, libxml2, or me? You try to parse purported XML as XML, and if it doesn’t work, you decide on a case-by-case basis whether it’s better to catch fire, or just set the bozo bit and get what you can. Kudos to Daniel for doing it once, in one place, as well as he can, rather than having it done poorly in tens of thousands of places.

9 Comments

Comment by Jacques Distler #
2004-08-19 21:50:30

Does it matter to me whether the massaging code lives in the same library as the parser?

That really depends, now doesn’t it?

Presumably, the application author knows what kind of errors can be expected/tolerated/corrected in the offending piece of purported XML. He has some meta-idea of what that data is supposed to look like.

The library author, however, doesn’t. There are general heuristics you could apply, but the point the Tim Bray seems condemned to forever repeat is that these heuristics will fail as often as they succeed — making the result thoroughly untrustworthy.

This is not ”like tag-soup HTML.” It’s worse. Browser-writers have a lot of empirical data on what sort of errors occur frequently in web pages, and can tailor their error-recovery appropriately. But a general-purpose XML parser … ?!

Comment by Phil Ringnalda #
2004-08-19 22:11:13

Well, truth is I want what James Robertson always describes Smalltalk giving him, every time the subject comes up: just subclass the parser and override the error methods that don’t suit you: character not valid in the charset error, count is only one in a row, not a possible UTF-7 < set of bytes? Pass. Nesting error, in title/description? Set a flag to start escaping every element until you get to the end tag. So, I guess that means I do agree: I just don’t want to have to do it the way my tools would make me do it, cleaning up the source and attempting to reparse after every error.

Comment by Jacques Distler #
2004-08-19 23:08:30

Ah, well that sounds completely different from what Daniel Veillard is doing.

I’m very sympathetic to your proposal.

As to what’s happening in PHP 5.1, you just know that 99 PHP authors out of 100 are going to reflexively set DOMDocument->recover to ”true” every time they go to parse some XML.

Comment by Phil Ringnalda #
2004-08-19 23:23:23

Actually whether 99 or 1 set recover depends entirely on what the first big article on using it does in the example code: we’re a bunch of copy-n-pasters, we PHP authors. Look at articles about parsing with the old 4.x SAX parser: almost every single one not only does it the same way, it uses the same variable names, and sets the same options, reasonable or not.

Comment by Tim #
2004-08-20 00:33:46

Gotta agree with Phil on this one. I’m certainly guilty as charged, re
PHP and cutting and pasting.

Whatever the top article that shows up in the Google search for parse
XML does is likely to be what I will do.

That would suggest that the proponents of strict parsing, instead of
spending lots more time arguing their point of view, would be better
served by turning their energies to providing the best sample code
and article on parsing XML with PHP 5.1, so that the rest of us can
steal from the best.

Comment by Christian Stocker #
2004-08-20 01:31:44

I added a big disclaimer ”Do not use this feature by default” to the mentioned post, so that at least my example hopefully won’t be taken as copy&paste example everywhere ;)

I also wrote a second post about all the comments I got and why I won’t remove the feature.

 
 
 
Comment by Adam Trachtenberg #
2004-08-20 17:17:47

The good news is that this patch is *not* in PHP 5.0, but only in 5.1, so all the definitive XML in PHP 5 articles *will not* have include this feature.

I also know it won’t be there because Christian and myself have written most of those pieces and conference slides. :)

 
 
 
 
2004-09-14 05:46:14

I’ve tried to write a generic program that corrects malformed XML. It doesn’t work. For instance, consider the simple case of a missing end-tag. You don’t know for sure the tag is missing until you hit the end of the document, and then there are normally multiple possible places it could be inserted. Where does it go? Any decision you make will be wrong as often as it’s right, probably more often.

The only way to fix a malformed document is if you know what the malformed document is supposed to look like, and even then you aren’t always sure. There are big arguments on the TagSoup mailing list about how and where TagSoup should fix broken HTML because sometimes it’s broken one way and sometimes another. Applying *human* intelligence is still the most reliable way to fix this stuff.

Comment by Mark #
2004-09-14 06:52:46

Paul Prescod, May 7, 1997:

There
is still so much room for a document author to screw up that
well-formedness is a very minor step down the path. The idea that
well-formedness-or-die will create a ”culture of quality” on the Web
is totally bogus. People will become extremely anal about their
well-formedness and transfer their laziness to some other part of the
system.

On a completely unrelated note, Elliotte, your home page is well-formed XML but doesn’t validate as XHTML.

 
 
Name (required)
E-mail (required - never shown publicly)
URI
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <del datetime="" cite=""> <dd> <dl> <dt> <em> <i> <ins datetime="" cite=""> <kbd> <li> <ol> <p> <pre> <q cite=""> <samp> <strong> <sub> <sup> <ul> in your comment.