Now I have two problems

Movable Type gives you a choice for the content of comments: either you can have naked URLs converted to links, or you can allow HTML. That simplifies things for MT: if you don’t allow HTML, then any URL is naked, because any URL that was in an attribute in an HTML element has already been stripped out. However, it’s annoying: nobody ever reads anything, and hardly anybody ever actually looks at their preview, even when you force them to preview, so if you allow HTML in comments, you will get comments with naked, unlinked URLs. I would like to fix that, which is why I now have two problems, since MT uses a regular expression to autolink URLs.

The current code that does the linking is

$text =~ s!(http://\S+)!<a href="$1">$1</a>!g;

Simple as pie: anything starting http://, followed by one or more characters that’s not whitespace (space, tab, whatever) gets plopped into the href attribute and the content of a link. <a href="$1">[link]</a> would be a bit prettier, since people (not looking at their preview) tend not to notice that their huge URL could use a trip through a URL shortening service, but still, simple as pie.

When you add in the possibility that things starting http:// might already be in attributes of HTML tags, though, simple goes out the window. We don’t want to autolink <img src="http://whatever"/>. We could assert that the http:// either have nothing before it, or have any character that isn’t a single or double quote ((^|[^'"])), but sadly enough, quotes aren’t actually required around attributes in HTML, and we would end up clobbering <a href = http://whatever> if it weren’t for the fact that Sanitize already clobbers that into <a>. That means that we don’t have to deal with unquoted attributes (as long as we are willing to depend on that near-bug in Sanitize), but what about spaces elsewhere? <a href=" http://example.com "> with spaces inside the quotes makes it through Sanitize, and seems to work in a browser. Whether or not it’s correct would seem to require deep study of RFCs and the SGML spec, and quite frankly I’m getting a bit tired of these two problems, before I even got to problems like (http://example.com) or http://example.com., so I think I’ll live with the hope that nobody ever does that in my comments.

Changing munge_comment in lib/MT/Util.pm from:

sub munge_comment {
    my($text, $blog) = @_;
    unless ($blog->allow_comment_html) {
        $text = remove_html($text);
        if ($blog->autolink_urls) {
            $text =~ s!(http://\S+)!<a href="$1">$1</a>!g;
        }
    }
    $text;
}

to:

sub munge_comment {
    my($text, $blog) = @_;
    unless ($blog->allow_comment_html) {
        $text = remove_html($text);
    }
    if ($blog->autolink_urls) {
            $text =~ s!(^|[^'"])(http://\S+)!<a href="$2">[link]</a>!g;
    }
    $text;
}

appears to me to work unless you put spaces inside your quotes, and if you do very bad things will happen and I’ll probably just delete your whole comment out of spite. Bloody regexthp.

(Of course my first thought was just to steal from Mombo, but I can’t even figure out what Sam is or isn’t allowing before and after. Doing it pre-sanitize, so you escape mistakes, does seem to produce better results on things that slip through, though.)

15 Comments

Comment by Simon Willison #
2004-04-04 15:47:32

This is exactly why I’m not such a big fan of regular expressions for automatic link conversions. I think you can do a better job using straight forward string manipulation; I’ve written up a technique that does exactly that here: Converting links without regular expressions

Comment by Phil Ringnalda #
2004-04-04 16:22:14

I knew I was going to miss things not reading weblogs for four months.

Still, although you weren’t exactly targeting the same problem I am (mixed HTML and raw URLs), you suffer some of the same problems when you face that. Also, for both of us the requirement that a link start with a space (no (www.google.com) as noted in your comments) makes not requiring that it end with a space a bit less useful.

I suspect that the answer is don’t parse HTML with regular expressions in the other direction: go more complicated, use a real parser, rather than go simpler, use string manipulation.

The good news is, PHP 5 has a rather nice HTML parser :)

The bad news is, I seem to be in Perl at the moment, and I don’t have a clue how to use HTML::Parser.

 
Comment by Phil Ringnalda #
2004-04-04 17:00:23

Argh. I see by my comment feed that now I’m autolinking closing punctuation. Bloody regexthp.

Comment by Phil Ringnalda #
2004-04-04 17:03:23

And the ”Key at …” comment in my PGP signature, invalidating it. Okay, nevermind. If you can’t be bothered to link, your URLs won’t be linked.

Wow, there’s three hours of my life I’d like to have back.

Comment by Mark #
2004-04-04 19:28:33

You did all that in three hours?

Comment by Phil Ringnalda #
2004-04-04 19:42:33

Heh. I could feel that coming. Three hours clock-time: I aim for maximum distraction while ”working” so that’s three hours of taking tangents into specs, reading blogs, filing bugs on two programs, deleting email, and kitty scritching. My mental regex block is big, but not so big that it takes three hours to come up with nine characters that will break more things than they fix.

 
 
 
 
 
Comment by Sam Ruby #
2004-04-04 20:18:45

Truth be told I can’t tell you with any precision what is allowed and what is not allowed by Mombo.

What exists in Mombo wasn’t designed, it has evolved… Initially all I supported was straight text, with no formatting other than respecting new lines and blank lines. Over time I found that people were entering in various markup which I would usually go back and fix up. Over time, I automated certain fixups. Eventually, preview entered into the mix, allowing people an opportunity to immediately see how what they entered would be presented.

Comment by Phil Ringnalda #
2004-04-04 21:06:34

Well, there is that: I have the same ”gist growed” problem with my MT hacks. But what I was having trouble with was [s.:;?-]<(], which is what’s matched before (and after, with some things swapped) a naked URL. What little I know of regular expressions comes from PHP’s implementation of PCRE, where a fair bit of that doesn’t exactly make sense to me (- would seem to be either the range from to , or maybe a , the range from to , a ). At a guess, it’s actually ”space period colon semicolon question-mark hyphen right-square-bracket less-than open-paren” (right-square-bracket? and the ending one matches both left and right?), but it doesn’t help with my problem: you encode bad things after you convert raw urls, so <a href=" http://example.com "> winds up as &lt;a href=” <a href=”http://example.com”>[link]</a> ”&gt; and it’s the commenter’s fault it looks stupid, but MT sanitizes before it autolinks, so I can’t count on it to escape a link-in-a-link and shift the blame to the commenter: I make invalid HTML, not an ugly display.

Then there’s the PGP problem, which is probably best solved by using pb’s method of saving two copies of signed comments, one as submitted and one with the signature stripped that you then muck around with, but even that doesn’t get around where I noticed it getting munged, when I previewed: sign, preview to get a nonce for the signed version, it gets munged in case you are just previewing to preview… bleah. I wish I only had two problems.

Comment by Jacques Distler #
2004-04-04 22:22:55

Then there’s the PGP problem, which is probably best solved by using pb’s method of saving two copies of signed comments, one as submitted and one with the signature stripped…

Actually, the best solution for PGP-signed comments was (and maybe will be) to use detached-signatures. The problem (and the reason why I never suggested them) is that

  1. Most GUI tools don’t give an easy option for creating detached signatures.
  2. You’d have to provide a separate textarea into which the commenter is supposed to paste the detached signature. No matter how well you document this, there’s no telling what dim-witted things people will do when presented with these two texareas.
 
Comment by Sam Ruby #
2004-04-04 22:41:51

Hint: backslash is an escaping character in Python, so in order to pass a backslash to the regular expression enging, it takes two (either that or you use something called raw strings).

”[+-*]” as a regular expression would match a soingle plus, minus, or a star.

Comment by Phil Ringnalda #
2004-04-04 23:11:48

Sure, I thought that had to be it, a double-backslash escapes the thing after it, well, unless the second backslash is actually a part of the thing, like s.

In PHP’s PCRE? ”[+*-]”. Four characters with special meaning in character classes, and ^ is only special as the first character and – is only special if it could be forming a range. It may not be powerful or elegant, but PHP sure is capable of telling me how to do something before my attention span runs out.

 
 
 
Comment by Mark A. Hershberger #
2004-04-06 11:04:41

Truth be told I can’t tell you with any precision what is allowed and what is not allowed by Mombo.

Heh. I thought people said this was a problem unique to Perl. Python has readability problems?

;)

 
 
Comment by Mark A. Hershberger #
2004-04-05 23:00:53

Why not try turning

s!(^|[^’”])(http://S+)!…!g;

into

s![^’”]s*(http://S+)!…!g;

(Replacement text not shown.)

Comment by Phil Ringnalda #
2004-04-05 23:19:21

”Happy Fun With Regex” http://happyfun.com/regex

Now I have, what, seventeen problems?

Comment by Mark A. Hershberger #
2004-04-06 11:00:31

Ok, so I shouldn’t write code without testing it.

This works:

s{(?<![”’])(s+)(http://S+)}{$1 <a href=”$2”>$2</a>}g;

 
 
 
Name (required)
E-mail (required - never shown publicly)
URI
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <del datetime="" cite=""> <dd> <dl> <dt> <em> <i> <ins datetime="" cite=""> <kbd> <li> <ol> <p> <pre> <q cite=""> <samp> <strong> <sub> <sup> <ul> in your comment.