phil ringnalda : Smart convert breaks

Smart convert breaks

Attention blogging toolmakers who want to do a smart convert breaks (Marcus, I think that’s you and only you):

One more to add to the list of things where you shouldn’t convert line breaks to <br> or <br />: <pre> through </pre>

Background:

As typically implemented by blogging tools, “convert line breaks” is boolean: either every single newline in the post is converted to an HTML break character or no newlines are converted. Almost every novice or non-geeky user, and most experienced geeky users, will use “convert line breaks,” because it’s a pain to have to insert your own <p>, </p>, and <br /> tags. That’s fine up until the user wants to post a table, list, form, select, or preformatted section. Virtually nobody enters tables or lists as a single line, instead entering:

<table> <tr> <td> Cell one </td> <td> Cell two </td> </tr> <tr> <td> Row two </td> <td> Cell four </td> </tr> </table>

which “convert line breaks” turns into:

<table><br> <tr><br> <td><br> Cell one<br> </td><br> <td><br> Cell two<br> </td><br> </tr><br> <tr><br> <td><br> Row two<br> </td><br> <td><br> Cell four<br> </td><br> </tr><br> </table><br>

a glob of invalid HTML which browsers typically display by putting some, if not all, of the breaks first, and then showing the table. That’s not what the typical novice user expects when she pastes in the table code for yet another “What * are you” quiz. For that matter, having a <pre> section double-spaced isn’t what the typical experienced user expects—it took me several rebuilds to realize what was happening.

What shouldn’t be converted:

Ideally, a smart convert line breaks would not convert any breaks between <pre> and </pre>, and within a table or list would only convert line breaks in text, but not after tags. For example,

<td> a line a nother </td>

should be converted to

<td> a line<br /> a nother </td>

but that might be a little too expensive to implement. For a simple, fairly smart convert line breaks, just skip over anything between <table and </table>, <(ol|ul|dl) and </(ol|ul|dl)>, <pre and </pre>, <select and </select>, and <form and </form>. For a fully smart version, don’t convert a newline immediately after start or end tags for any of:

p
br
table
caption
colgroup
col
thead
tfoot
tbody
tr
td
th
dl
dt
dd
li
ol
ul
blockquote
form
fieldset
legend
label
select
optgroup
option

Now if only I could figure out where Movable Type actually does its conversion, I could try hacking in a huge regexp to match those, and see just how slow it would be.

This entry was posted on Saturday, May 11th, 2002 at 10:13 am and is filed under blogging tech. You can follow any responses to this entry through the post feed. You can skip to the end and leave a response. Pinging is currently not allowed.

5 Comments

Comment by Marcus #

2002-05-11 11:30:20

:) Thanks for the list… I’ll hack them in next time I’m pottering about with my code. (I was only handling the breaks in tables and lists up until now.)

Is the html_text_transform function in Util.pm handling the breaks/paragraphing in MT? I can’t tell.

Reply to this comment

Comment by Phil Ringnalda #

2002-05-11 12:17:42

Ulp. That must be it. Now I just need three or four books on Perl and a better brain for regexps, and I’ll be ready to go. But I’m getting a call, and, um, I’m going into a tunnel now. The hack is in the mail. Um. Time to put my code where my mouth is, I guess.

And I see that Util.pm is also the home of spam_protect, which I’ve also been thinking could do with an upgrade: just hex-encoding the :, @, and . seems a bit too easy to decode, so I’ve been thinking (idly, so far) about converting it to Hossein’s style of document.writing several separate completely hex-encoded chunks. Again, code where my mouth is.

Reply to this comment

Comment by Phil Ringnalda #

2002-05-11 18:09:36

Ah, yes. Now I remember why I’m not a professional programmer.

First step: divide the post into chunks of block-level HTML elements and not-block-level stuff. Wrap the not-block-level stuff in <p>/</p>, consuming zero to two r?n along the way. Within the formerly-not-block-level newly made paragraphs, replace any r?nr?n with </p><p>, then any remaining r?n with <br />.

Within a block-level element, there are three possible sorts of content: non-block-level stuff that should go in a paragraph, non-block-level stuff that shouldn’t, and nested block-level stuff that needs to be recursed. Don’t forget the block-levelish things like <td> that can contain block-level elements. By the time you are done, you’ve written most of a parsing engine for a browser.

Reply to this comment

Comment by Jason #

2002-11-16 15:58:14

I know that this is sort of an old thread, and you’ve probably solved this problem by now, but if you haven’t, Brad Choate has. You might want to take a look; I’m using it on my site, and it works beautifully.

Reply to this comment

Comment by Phil Ringnalda #

2002-11-16 17:21:47

Hmm. I think I’ll stick with my hack, since I’m already overloaded with plugins, but every time I have to hack it back in after an upgrade I’ll regret not doing it as a plugin instead.

Reply to this comment