A better remove_html?

Has anyone done a smart replacement for Movable Type’s remove_html function? There aren’t very many people I read who still only stick an auto-generated excerpt or an HTML-free version of their posts in their RSS, but there are a few, and I’d like to be able to recommend something slightly less evil to them.

For instance, Meg’s cleanup post, featuring an unordered list written in the style common to many of us who’ve suffered through fairly dim autoparagraphing code, doesn’t have any whitespace, just markup separating the items, so in my aggregator I get

A recipe for Pickled Oysters with English Cucumber “Capellini” and DillA map of the Madaket (Nantucket) bus routeVarious torrents of things I never listened to, like Jon Stewart’s Crossfire appearanceMore strange .pdf files that I must have inadvertantly downloaded than I care to admitAn Excel spreadsheet from 4/2003 comparing the costs of purcasing an espresso machine to going to the local coffee shop to making due with my French Press pot at homeMy brother’s “updated” résumé from early 2004

Given all the lovely filters like Markdown and whatnot that will take plain text and turn it into nice HTML, it seems like we ought to have one that would do the reverse, but I’d settle for one that just knows enough to ensure there’s some form of whitespace in places where the HTML would guarantee a break. (Yes, I do realize exactly what it means that I’m not willing to fix it myself. But, hey! This ain’t open source: I’m supposed to bitch instead of fixing!)

4 Comments

Comment by Ben Tucker #
2005-02-17 21:10:23

Aaron’s html2text isn’t bad. And as an added bonus the output is valid Markdown.

Comment by Phil Ringnalda #
2005-02-17 21:21:58