A little chip in the concept

I don’t much mind admitting that I don’t really understand DTDs. Last time I tried to do anything with them, they soundly whipped me and I wound up saying things I both regret, and regret having as virtually the only words of mine on w3.org and the only entry I’ve ever had (briefly) linked from w3.org. DTDs are either beyond me, or not obviously useful enough to me to persuade me to do the actual work required to really understand how they work.

Still, I was a bit surprised when Xiven linked to a post to the validator mailing list, pointing out that the utterly wrong HTML <a href=""><b><a href=""></a></b></a>, which is reported as invalid in HTML, is ignored in XHTML. Nesting links is one of those basic, there’s absolutely no way you can ever do this, things, but in XHTML if you put a nested link inside an inline element, the validator won’t catch it. According to Hixie’s answer, it’s because the validator uses an XML DTD for XHTML, and an SGML DTD for HTML, and while you can say that a/b/a is wrong in an SGML DTD, you can’t in an XML DTD. As he puts it, in XHTML it’s XML-valid but non-compliant.

I don’t suppose it really means much, other than that you can’t just trust the validator completely and blindly (as I generally have). Still, it knocks a little chip off my concept of how things are: I’ve always thought that you should validate your (X)HTML, always, and if you did, and it was valid (modulo the possibility of validator bugs), then it was right. Now, I can’t trust the validator to catch that error, and I don’t know what other known things there are that the validator doesn’t catch because it can’t catch them. It would be a little overboard to say it’s rocked the foundations of my world, or that it means XHTML is evil because it validates farther away from compliance (assuming there aren’t more problems with HTML that can’t be caught than there are with XHTML, and I don’t know or know how to find out, really), but it still bothers me, days later, and has left me wondering how many other things I trust to be absolutely correct arbiters that are actually just approximate, with known flaws sanded over and patched with Bondo.

6 Comments

Comment by Dominic Mitchell #
2004-05-02 02:43:34

There are a number of prohibitions listed in the XHTML spec. Nested form elements are one that’s personally caught me out before. I do wonder whether a RelaxNG validator would be better able to catch this sort of error…

 
Comment by Dare Obasanjo #
2004-05-02 03:22:43

No XML schema language (not DTDs, XSD, RELAX NG, Schematron, etc) can validate all the possible rules of an XML vocabulary. Various schema languages are more limited than others [Schematron being the least limited and XSD & DTD being the most] but you’ll can always come up with rules in your vocabulary that cannot be expressed using conventional schema languages. E.g. some attribute value must be a prime number.

I tend to have to point this out to someone on our newsgroups or our internal mailing lists on a weekly basis. I find it interesting that it always comes as a surprise.

 
Comment by Simon Jessey #
2004-05-02 06:35:50

I have adopted a policy for validation that gets around this problem. I serve XHTML as application/xhtml+xml to user agents that accept that MIME type (and prefer it with their Q-rating), as outlined in my article on the subject. This means that if the page is not well-formed XML, it chokes. I confirm that all is well by feeding the same page to the validator.

However, for user agents that do not accept application/xhtml+xml, or prefer not to, I serve HTML as text/html, after transforming it with some PHP trickery. I then validate that page as valid HTML. This combined approach is not as complicated as my description makes it sound, and it has the effect of making sure both pages validate against both the XML DTD and the SGML DTD.

 
Trackback by David Dorward's Blog #
2004-05-02 03:33:19

http://blog.dorward.me.uk/archives/000129.html

Validation is a step on the road of quality control, not the entire path.

 
Trackback by Caveat Lector #
2004-05-02 06:13:09

Valid != right

Gonna warm up my gimpier-than-ever hands before the last three pages of Org of Info paper with a few words on markup validation. Phil Ringnalda just discovered that XML-valid doesn’t mean right. This is a highly useful lesson that I wish everybod…

 
Trackback by Dare Obasanjo's WebLog #
2004-05-06 08:20:32

Knowing the Limitations of XML Schema Validation

 
Name (required)
E-mail (required - never shown publicly)
URI
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <del datetime="" cite=""> <dd> <dl> <dt> <em> <i> <ins datetime="" cite=""> <kbd> <li> <ol> <p> <pre> <q cite=""> <samp> <strong> <sub> <sup> <ul> in your comment.