Validation sustained

A mere 13.5 months after Jacques overruled my objection to serving a blog as application/xhtml+xml by twisting Alexei Kosut’s MTValidate plugin into a “you will submit valid XHTML comments!” plugin, I’ve finally caught most of the way back up (yeah, I still need to start serving the right Content-type:. Soon.).

For reasons beyond the ken of mortal man, my NightmareHost doesn’t actually include a copy of the oh-so-standard OpenSP SGML parser, and my last bout with compiling it lead to lots of screaming and no working binary, but it seems that 1.5.1 is delighted to compile in user-space with a simple

./configure --prefix=$HOME/OpenSP
make
make install
cd ~/OpenSP/bin
./onsgmls -h

which gave me (along with much delight) Usage: ./onsgmls [OPTION] SYSID…: a working onsgmls!

Then, with Jacques’ helpful advice which turned out not to be quite enough to overcome the demons of perl -MCPAN -e shell in user-space, and a package of the Perl modules I was missing which was, all I had to do was drop his patched MTValidate in my plugins directory, and learn the horrors of Perl’s $0, which is the full path and name of the currently running script on the machine your developer uses, but only the name of the script on your machine, leading to vast confusion until you finally just put PluginPath /home/you/there/mt/plugins/ in your mt.cfg so you can use my $vdir = File::Spec->catfile(MT::ConfigMgr->instance->PluginPath, 'validator'); for an absolute path to the directory in plugins (thank you, bloody Config::General and your Apache-style Includes and your bloody IncludeRelative), and be done with it. A little “borrowed” template code for the Comment Preview template, a little patching up of the entry preview code in CMS.pm, and now when I preview an entry with my usual botched nested lists, my local copy of the validator will tell me about it, before I have to show it in public.

What do you have to do now, to comment? Nothing different, really. I’m still forcing previews, just like I have for months, and if you just type and don’t use any HTML, the preview will pat you on the back, say “You’re valid!” and give you a Post button. If you use HTML, poorly, or slip in an unencoded ampersand, it will do its best to tell you just what’s wrong, and give you a Preview button to see how your fixing went.

What do I still have to do? Check a bunch of pages with Trackbacks, and with old comments that had interesting characters in them, to make sure that Alexei’s UTF-8 Hack plugin is really doing the job of turning random characters in random charsets into something that’s at least valid UTF-8, if not correct, then turn on Content-type: application/xhtml+xml for those browsers brave enough to ask for it, and then find something to make use of it: there was a time when I understood enough math to make some use of MathML, but that time’s long past. Meanwhile, I’ll just keep previewing my entries, and being told “Valid!” (and yes, I do enjoy using a debit card, since the “Approved” may be my only approval of the day).

Oh, yeah, and I upgraded to MT 3.0D. A word of advice: if you are using Sean Willson’s rebuild type mod, so that your values for whether or not to automatically rebuild index templates don’t make sense to MT, after you upgrade be sure to run through the templates checking the now-unchecked boxes to rebuild, before you start to panic thinking you’ve broken something because you don’t get any indexes built.

15 Comments

Comment by Anne #
2004-06-22 00:19:54

Great! I hope you can switch to ”real” XHTML soon.

Comment by Phil Ringnalda #
2004-06-22 21:06:26

Wow. Aimed the HTML Help Validator’s spider at myself, and it looks like it might be a little while yet. Between the existing invalid comments (not that high a percentage, but out of 3000+, there’s still a lot) and all the times my referral script went mad (the regex (three problems, as it happens) that grabs the title went through several terrible iterations, where it would do things like grab an entire RSS feed, or big chunks of HTML from things with horribly broken <title> elements, and I don’t seem to have cleaned up nearly as many as I thought), I’ve got a ways to go still.

 
Comment by Pete #
2004-06-29 05:45:46

I’ve never understood why everyone makes a fuss over application/xhtml+xml when there are alternatives that are allowed. I’ve used text/xml a couple of times now and it seems to even work fine on the in-game browser for a MMOG I play (it only supports html 1!). If I’m missing something here please slap me round the face with a wet fish and explain it to me gently. Thanks.

Comment by Jacques Distler #
2004-06-29 07:09:08

The distinction that is relevant is between text/html and any of the allowed XML MIME types (application/xhtml+xml, application/xml ortext/xml).

If you use text/html, then your document is handled by the tag-soup parser, and you do not have access to any of the cool extensions (like MathML) which have been introduced in recent years. If you use an XML MIME type, then your document is handled by the XML parser. That allows new features, but — if your document is not well-formed — the XML parser will puke and not render your page.

Assuring well-formedness at all times is a nontrivial task (once you allow comments and trackback and such). That’s what Phil’s grappling with here.

 
 
 
Comment by Jacques Distler #
2004-06-22 01:17:27

Well, it’s about time, Phil!

In light of your experience, I’ve revised my instructions a bit.

I’m curious what you intend to do with Alexei’s UTF-8 hack. Is this for ISO-8859-1 text in your database? For Trackbacks (which, notoriously, don’t have a declared charset)?

It doesn’t deal with the infamous Windows-1252 problem (which, unless you do some transcoding, instantly renders your document ill-formed when you switch from ISO-8859-1 to UTF-8).

But, hey, once you’ve got built-in comment/entry validation, the rest seems quite doable …

Comment by Phil Ringnalda #
2004-06-22 08:24:06

Oy. I hadn’t actually counted the bits, to see whether it translated Windows-1252 to something sane, before translating iso-8859-1 to utf-8. Maybe I need to have iconv go 1252 to utf-8, rather than 8859-1 to utf-8.

Yep, going forward I mostly need it for Trackback: the entry right before this had a 8859-1 ping that was both invalid and improperly displayed.

Looking backward, I have (a very few) 8859-1 characters in the database, mostly names in comments or pings, and a large number of troublesome characters in my referrer listings, since it has just blindly grabbed page titles without thinking about encoding, and in PHP, it’s going to have a hard time ever thinking about it. Might have to just drop it, for now at least. But first, a more pressing problem: I searched out and copied-and-pasted a couple of names with accented vowels from comments, previewed and saw that I had a valid comment with the correct characters, but in the textarea I had transcoded garbage. I haven’t got a clue where, but somewhere along the line either I’m screwing up or it’s making a bad assumption, so if you preview more than once, you’ll have crap.

Comment by Phil Ringnalda #
2004-06-22 09:15:00

Ah, NoHTMLEntities 1 in mt.cfg fixes the preview transcoding problem. Wonder what else it breaks?

 
Comment by Jacques Distler #
2004-06-22 19:13:33

To fix Windows-1252 characters in Trackbacks (and legacy crap in your database), what you still need is MTStripControlChars. This converts the offending byte sequences, which are non-existent in UTF-8 (by Spec, a fatal error in an XML parser), and invalid (control) characters in ISO-8859-1.

Other characters, like é, may need to be transcoded, but I expect you have many fewer of those to deal with.

Comment by Phil Ringnalda #
2004-06-22 20:16:22

Oh, maybe so maybe no: all UTF8-Hack is doing is looking for characters that make it think it’s seeing 8859-1, and if it finds them it runs Text::Iconv to go from 8859-1 to UTF-8. Well, isn’t every 8859-1 character in the same place in Windows-1252? Look for the same 8859-1 characters, plus the Windows-1252 characters, and if you find any at all of either, tell Text::Iconv to do Windows-1252 to UTF-8, and you should be set in one shot, no?

Unless I’m misremembering the 8859-1:1252 relation. But isn’t that why browsers do Windows-1252 when they shouldn’t, because it’s 8859-1 with other stuff shoved into the wrong places? iconv -l says my server (apparently supported encodings and even names are server-dependent) would be happy to do WINDOWS-1252 to UTF-8.

Then, for major happy-fun-time: before I blew my hack away, I’d been GETting the page that was purportedly sending me a Trackback ping, to look for Trackback RDF in it, thinking that at some point I might want to only accept pings from people who play well with others. I don’t remember getting a single valid ping that didn’t, but what just occurred to me is that with only moderate pain, I ought to be able to guess from that same GET what encoding they probably use, and get around the incredibly awful lack of encoding information in the ping itself. Dump my list of possible encodings in a text file, and if it’s not something I can convert to UTF-8, sorry, $app->_response(Error => "I'm sorry, I can't understand a word you're saying in $guessed_encoding - would you mind repeating that in UTF-8?");

 
Comment by Phil Ringnalda #
2004-06-23 01:25:28

Well, after staring at Tantek Çelik’s last name, properly encoded in UTF-8 in my database, properly displaying in the editing interface, but being turned into garbage in the page, I finally realized that, no, I don’t want MTStripControlChars, at least not in its current form. It thinks that characters are encoded in a single byte, and that if it sees one of its enemy bytes, that byte is a Windows-1252 character. Unfortunately, in a multibyte encoding, that means it turns a character into two (or more) bytes of crap.

Comment by Jacques Distler #
2004-06-23 07:10:49

Ouch!

0x80-0x9F don’t occur as the low byte of any legal UTF-8 character. But they can occur as the high byte. Ç is 0xC387, and my bozotic plugin will attack the high byte.

I guess I need to retract the suggestion that it’s safe for UTF-8. Presumably, some modified version of Alexei’s hack (which transcodes Windows-1252 to UTF-8) will work in that case.

 
 
 
 
 
Comment by Scott Johnson #
2004-06-22 19:56:24

”… now when I preview an entry with my usual botched nested lists, my local copy of the validator will tell me about it, before I have to show it in public.”

That sounds really nice. I’d love to see some additional output from the MT interface when I’m working on an entry.

Also, it’s good to see some sites upgrading to MT3.0D. I’ve started a new blog with it, but I’ve been afraid to tackle upgrading my 2.661 blogs so far. Perhaps that time has come. Other than the snafu with the rebuild type mod, were there any showstopping issues with the upgrade process?

Comment by Phil Ringnalda #
2004-06-22 20:26:35

It’s sweet. Both for (X)HTML and for RSS/Atom, I really like having my own copy of the validator (oops, I need to hack the Feedvalidator back into my saving process), where I can throw any little half-done hack at it.

Um, showstoppers. Not really. I’m not very happy with the way that the upgrade script loads up plugins and then shows the errors that you’ve never seen, because MT itself hides them from you, but I don’t think it actually affects the upgrade, just scares hell out of you at the worst possible time. The only other stumble I had was failing to upload CMS.pm, because I’d been hacking on my 2.661 copy more recently than they had changed the one in the tarball, and I missed seeing my FTP client decide that meant it didn’t need to bother uploading the 3.0 version. Made for an odd and (again) scary effect, where I only had permission to create entries, nothing more, and some other things acted very odd (amaing that anything ran at all, really), but after a bit of checking file mtimes I did figure it out. But other than that (three things which could easily be called my fault), despite having everything massively hacked up, it all just worked. I had to kill MT-Blacklist, of course, which worries me a bit, but then I hardly ever got blacklist denials, since so few spammers actually preview and then post. Without a pressing reason to upgrade, I’d give Jay another couple weeks (or is it down to just one now? I’ve lost track) to get 2.0 out, but the upgrade itself isn’t very painful. Mostly. I hope. I’m jinxing you, aren’t I?

Comment by Scott Johnson #
2004-06-22 20:44:04

You’re not jinxing me. I’ve researched plugin compatibility fairly thoroughly already. Other than plugins, I don’t have anything hacked up on my MT install, so I think I’ll be ok. And I don’t use MT-Blacklist because I haven’t really ever had a huge problem with crapflooding. The small amount of spam I’ve received has been very manageable in the MT 2.6x interface. I guess I have been lucky in that regard.

Thanks for the info on your upgrade. I’m feeling confident in upgrading now. Perhaps I’ll backup and upgrade as soon as I get back from vacation.

 
 
 
Comment by Joshua Kaufman #
2004-07-04 04:19:43

Tangentially related: did either you or Jacques make any progress on a plugin that strips things like smart quotes?

 
Name (required)
E-mail (required - never shown publicly)
URI
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <del datetime="" cite=""> <dd> <dl> <dt> <em> <i> <ins datetime="" cite=""> <kbd> <li> <ol> <p> <pre> <q cite=""> <samp> <strong> <sub> <sup> <ul> in your comment.