Stray CDATA end tags fix

First of a series of quick fixes for early adopters of MT 2.6:

If you use (the newly redesigned) encode_xml filter on a tag that is automatically run through the (new) sanitize filter (as you probably do if you have a comment feed), or if you combine encode_xml with remove_html (as you quite likely do), then you’ve got a problem with your RSS feed: encode_xml now wraps anything that looks to it like HTML in a CDATA section (starts with the less-than character, then ![CDATA[, and ends with ]] and the greater-than character, and can cause surprising results if misinterpreted so I’m talking around an example). That’s mostly okay, since it makes for faster parsing: if you have the old-style entity-encoded HTML then an XML parser has to go through it character by character, looking for XML tags and entities that it needs to parse, while a CDATA section tells the parser that it doesn’t need to look, and you are a little less likely to break a parser with a CDATA section than with parsed text including entities.

However, because of the order that the filters appear in {your MT directory}/lib/MT/Template/Context.pm, encode_xml runs first, adding the CDATA section if it sees HTML, and then either sanitize or remove_html runs after that, and they will both remove the opening CDATA tag, leaving you with nothing but the closing tag and possibly naked invalid XML. For a quick fix, edit Context.pm, find the section in sub post_process_handler that reads:

if ($local_args{'encode_xml'}) {
    $str = encode_xml($str);
}

cut it out of where it is, and paste it below

if (my $spec = $local_args{'sanitize'}) {
    require MT::Sanitize;
    if ($spec eq '1') {
        $spec = $ctx->stash('blog')->sanitize_spec ||
            MT::ConfigMgr->instance->GlobalSanitizeSpec;
        }
    $str = MT::Sanitize->sanitize($str, $spec);
}

so that both remove_html and sanitize get to run before encode_xml starts sticking in CDATA that will confuse them.

1 Comment

Trackback by Sam Ruby #
2003-02-15 18:12:00

CDATA in RSS descriptions

I’ve verified that the Radio’s aggregator omits the descriptions in Brad Wilson’s valid rss feed. My suggestion for supporting the widest range of aggregators is to add a remove_html filter for the description, and to provide the full content in cont

 
Name (required)
E-mail (required - never shown publicly)
URI
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <del datetime="" cite=""> <dd> <dl> <dt> <em> <i> <ins datetime="" cite=""> <kbd> <li> <ol> <p> <pre> <q cite=""> <samp> <strong> <sub> <sup> <ul> in your comment.