My fingerprints on your code

By an odd coincidence, in the last week I got my “code” (well, a few words or a couple of deletions) into Movable Type (a bit of HTML in bm_entry.tmpl), Blogger (style=”height:100%” in the form around the posting textarea, to defeat Mozilla 1.1+ bug 161583), and Radio (getting rid of a couple of extra decodes that made it impossible to repost sample HTML from the RSS aggregator). Funny thing about the Radio fix is that I only knew where to look for the problem because of Matthew Ernest’s hint, but now he thinks they fixed it without him.

6 Comments

Comment by Shannon #
2002-10-09 12:32:55

Well then. Ask, and ye shall receive.

 
Comment by Bill Kearney #
2002-10-09 15:01:14

While it does seem like making things ”more complicated” (and we know how some folks whine about that…) it’d be nice to have a few more options in Radio to work around this cruft.

Just having WYSIWYG and source seem fine until you start seeing how MT has the ability to handle line breaks. That’d be handy to control in Radio.

It’d be interesting to have a special setting that allowed control over encoding/decoding. Perhaps defaulting to utterly stripping encoding down to it’s barest form. This wouldn’t work for people actually wanting to embed markup. But most folks don’t want to do that. So give the markup folks a choice and then only enforce a single layer of encoding.

That’s to say in default mode rip it all out and put it in native encoding. Otherwise allow for ONE level of additional encoding.

And by native encoding I’m talking about using UTF-8 or whatever your local system handles for storage. UTF-8 with it’s ampersand-pound encoding can be fully handled in ASCII. (There’s some sorting headaches but that’s a whole other seizure inducing headache…)

Stuff like HTML entities should ALWAYS get hacked out and converted to native characters. Using them just adds another layer of decoding on the reader end. The use of them is one reason handling embedded HTML is such a pain.

This isn’t easy but it’s not the nightmare some folks make it out to be.

 
Comment by Phil Ringnalda #
2002-10-09 15:39:48

Yep, sorry, and thanks for the reminder, sweetie.

Dunno if I’d praise MT for line breaks (other than by comparison, anyway), since that’s the one hack I’ve got that wasn’t folded into 2.5: it’s still all or nothing, either you get a <br /> after every line in stuff like <pre>, <blockquote>, <li>, etc., or you don’t get any at all.

I would have thought that HTML entities would be the easiest part: in PHP (stolen from the example in the manual) it’s just

function unhtmlentities ($string)
{
	$trans_tbl = get_html_translation_table (HTML_ENTITIES);
	$trans_tbl = array_flip ($trans_tbl);
	return strtr ($string, $trans_tbl);
}

Even if you have to write your own translation table, it still seems pretty simple, compared to doing intelligent conversion from HTML to text without stripping the meaning. The usual alternative to embedding HTML, just stripping it out, seems worse than the disease to me, since if you strip <blockquote> and <a href> from most blog entries, you end up with little or no meaning (or, worse, not the original meaning). I think if I was going to try to read RSS on a cellphone (which I won’t, ever), I’d probably run it through a proxy with a little more muscle to turn it into something I could handle, rather than expecting the entire PC-using world to limit the amount of information they include in their feed just because I can’t keep up. But for the moment, mostly, I like the idea of a text <description> and HTML in <content:encoded>, even though it’s a little weird that the most widely supported RSS 1.0 module is a proposed addition that has gotten a total of zero comments.

 
Comment by Matthew Ernest #
2002-10-09 22:28:14

I’m just amazed that there’s anyone looking at this bread cast upon the waters. Hello Mr. refering page!

The mind boggles.

 
Comment by michel v #
2002-10-11 01:47:49

Bill, is there a way to output an UTF-8 character from its Unicode number ?
The reason I ask this is that I compiled a list of HTML entities and their numeric entities counterparts, into an array that can be used for on-the-fly conversion. It also includes some conversion for smart quotes and the like (if you typed the text in MS Word and then pasted it, for example), and the numeric entities in the range 140-159 that work only on Windows are converted to standard numeric entities that work on every platform.
As a little bonus before conversion, a little code to convert anything between ASCII #128 and #255 (french accents for example) to their numeric entities too. This was simple and straightforward, eh.

So, one could use that filter I made, to first convert any non-7bits-ASCII character to a numeric entity and THEN use a filter to output UTF-8 characters based on the number in the entities.

That is, only if it’s possible to output a UTF-8 char from its number. Is it ?


PHP’s html entities table is very limited, it’s a joke to use with real-life applications (if users use only stuff like &lt;, &quot; etc it’s ok, but once they start using entities like &rsquo; the app is lost).
If you want to check, just get_html_translation_table and then print_r. You’ll see the ridiculous size of it ;)

 
Trackback by Among Other Things #
2002-10-09 20:34:53

Phil Ringnalda: his hands are in my code

Sounds dirty, doesn’t it? It’s okay – it all makes sense if you read this. Incidentally, I’m also sending the ping mostly to see if he can receive pings after upgrading to MT 2.5 – I can’t, and the boards seems to suggest it’s mod_perl.

 
Name (required)
E-mail (required - never shown publicly)
URI
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <del datetime="" cite=""> <dd> <dl> <dt> <em> <i> <ins datetime="" cite=""> <kbd> <li> <ol> <p> <pre> <q cite=""> <samp> <strong> <sub> <sup> <ul> in your comment.