Now I have two problems
Movable Type gives you a choice for the content of comments: either you can have naked URLs converted to links, or you can allow HTML. That simplifies things for MT: if you don’t allow HTML, then any URL is naked, because any URL that was in an attribute in an HTML element has already been stripped out. However, it’s annoying: nobody ever reads anything, and hardly anybody ever actually looks at their preview, even when you force them to preview, so if you allow HTML in comments, you will get comments with naked, unlinked URLs. I would like to fix that, which is why I now have two problems, since MT uses a regular expression to autolink URLs.
The current code that does the linking is
$text =~ s!(http://\S+)!<a href="$1">$1</a>!g;
Simple as pie: anything starting http://, followed by one or more characters that’s not whitespace (space, tab, whatever) gets plopped into the href attribute and the content of a link. <a href="$1">[link]</a>
would be a bit prettier, since people (not looking at their preview) tend not to notice that their huge URL could use a trip through a URL shortening service, but still, simple as pie.
When you add in the possibility that things starting http:// might already be in attributes of HTML tags, though, simple goes out the window. We don’t want to autolink <img src="http://whatever"/>
. We could assert that the http:// either have nothing before it, or have any character that isn’t a single or double quote ((^|[^'"])
), but sadly enough, quotes aren’t actually required around attributes in HTML, and we would end up clobbering <a href = http://whatever>
if it weren’t for the fact that Sanitize already clobbers that into <a>
. That means that we don’t have to deal with unquoted attributes (as long as we are willing to depend on that near-bug in Sanitize), but what about spaces elsewhere? <a href=" http://example.com ">
with spaces inside the quotes makes it through Sanitize, and seems to work in a browser. Whether or not it’s correct would seem to require deep study of RFCs and the SGML spec, and quite frankly I’m getting a bit tired of these two problems, before I even got to problems like (http://example.com) or http://example.com., so I think I’ll live with the hope that nobody ever does that in my comments.
Changing munge_comment in lib/MT/Util.pm from:
sub munge_comment {
my($text, $blog) = @_;
unless ($blog->allow_comment_html) {
$text = remove_html($text);
if ($blog->autolink_urls) {
$text =~ s!(http://\S+)!<a href="$1">$1</a>!g;
}
}
$text;
}
to:
sub munge_comment {
my($text, $blog) = @_;
unless ($blog->allow_comment_html) {
$text = remove_html($text);
}
if ($blog->autolink_urls) {
$text =~ s!(^|[^'"])(http://\S+)!<a href="$2">[link]</a>!g;
}
$text;
}
appears to me to work unless you put spaces inside your quotes, and if you do very bad things will happen and I’ll probably just delete your whole comment out of spite. Bloody regexthp.
(Of course my first thought was just to steal from Mombo, but I can’t even figure out what Sam is or isn’t allowing before and after. Doing it pre-sanitize, so you escape mistakes, does seem to produce better results on things that slip through, though.)
This is exactly why I’m not such a big fan of regular expressions for automatic link conversions. I think you can do a better job using straight forward string manipulation; I’ve written up a technique that does exactly that here: Converting links without regular expressions