gzip: well, sometimes it’s simple

For quite a while, our mantra about weblogs and especially weblog feeds has been “enable gzip compression to cut everyone’s bandwidth, it’s simple.” Mostly, it is. But, if you have a taste for that sort of thing, bug 241085 makes for fascinating reading about when it isn’t just that simple.

For a while now, I’d been noticing that in Firefox some Blog*Spot blogs would load partway just fine, and then turn into garbage characters for the rest of the page, but I didn’t think too much about it: a reload would fix it, and I don’t seem to read that many Blog*Spotters outside my aggregator anyway, and it didn’t affect any of my blogs, so I didn’t bother trying to chase it down. As complicated as it is, I wouldn’t have chased it down beyond a report, anyway.

Blogger doesn’t know the character encoding of your blog, and you can’t tell Blog*Spot to send an HTTP header with the encoding. Ideally, you would like to send Content-Type: text/html; charset=UTF-8 if your page is using utf-8, because that way browsers know right from the start what you are using. On Blog*Spot, though, your only option is a <meta http-equiv="content-type" content="text-html; charset=UTF-8"> in the HTML itself. Amusingly enough, that was actually originally intented as something your server would parse, looking inside every HTML file it served and setting headers accordingly, but nobody ever implemented it that way, so it’s now treated as something for the browser to read, and then pretend that it got it as a header before it started parsing, rather than afterward.

So, say you aim Firefox at a Blog*Spot-hosted blog that’s in utf-8. Not getting a Content-Type header, it assumes iso-8859-1, and starts saving the gzip’d file to the cache and parsing it. But before long, it hits that meta tag, and slams on the brakes. This isn’t iso-8859-1, it’s utf-8! For reasons I don’t understand, but which are just assumed to be perfectly correct in the bug comments, at that point Firefox stops, and re-requests the file. However, since it already has part of it saved, it only requests the rest of it, with a Range: bytes=4068- header to say it only wants the part after the first 4068 bytes.

Fine and dandy, Blog*Spot replies with a 206 Partial Content, but this time it doesn’t gzip the content, just sends it uncompressed. That’s where things go to garbled. The bug makes it sound like Firefox got 4068 bytes of the compressed entity identified by Etag: "11a8085-bc3b-40c65e8b", and it wanted the rest of that compressed entity, but Blog*Spot’s server wanted to give it everything after 4068 bytes of the uncompressed page, despite claiming in the reply that it was sending the rest of that same entity, with the same Etag. However, it wouldn’t make sense that a refresh would clear it up (by just loading the whole page in one shot from the disk cache) if that were the case, since there would be a broken spot in there somewhere, so it must be that Blog*Spot sends the correct bytes, but because it claims to be sending the same entity, despite not saying it is gzip’d this time Firefox still treats it as gzip’d and ungzips the nongzip’d content. I think.

After much discussion of who was right, the bug seems to have tapered off by sending Greg Stein, who is conveniently enough both the engineering manager for the Blogger team and an Apache httpd developer, off to fix Blog*Spot’s servers, so in the meantime, just let it load as junk, and then refresh to get it ungarbled. Or, evangelize whoever you are reading to enable a full-content feed, so you don’t have to read them in your browser: “really, I need full content, because you’re garbled in my browser – see, here’s a screenshot!”

(via Matt to jimfl to a comment from Bill Stilwell)


Comment by Phil Wilson #
2004-06-09 03:29:30

Ah, fascinating! I’d been wondering what had been going wrong with Blog*Spot blogs, although in my case I was sometimes having to refresh up to ten times to get it to display properly.

Comment by Jacques Distler #
2004-06-09 08:20:41