gzip: well, sometimes it’s simple

For quite a while, our mantra about weblogs and especially weblog feeds has been “enable gzip compression to cut everyone’s bandwidth, it’s simple.” Mostly, it is. But, if you have a taste for that sort of thing, bug 241085 makes for fascinating reading about when it isn’t just that simple.

For a while now, I’d been noticing that in Firefox some Blog*Spot blogs would load partway just fine, and then turn into garbage characters for the rest of the page, but I didn’t think too much about it: a reload would fix it, and I don’t seem to read that many Blog*Spotters outside my aggregator anyway, and it didn’t affect any of my blogs, so I didn’t bother trying to chase it down. As complicated as it is, I wouldn’t have chased it down beyond a report, anyway.

Blogger doesn’t know the character encoding of your blog, and you can’t tell Blog*Spot to send an HTTP header with the encoding. Ideally, you would like to send Content-Type: text/html; charset=UTF-8 if your page is using utf-8, because that way browsers know right from the start what you are using. On Blog*Spot, though, your only option is a <meta http-equiv="content-type" content="text-html; charset=UTF-8"> in the HTML itself. Amusingly enough, that was actually originally intented as something your server would parse, looking inside every HTML file it served and setting headers accordingly, but nobody ever implemented it that way, so it’s now treated as something for the browser to read, and then pretend that it got it as a header before it started parsing, rather than afterward.

So, say you aim Firefox at a Blog*Spot-hosted blog that’s in utf-8. Not getting a Content-Type header, it assumes iso-8859-1, and starts saving the gzip’d file to the cache and parsing it. But before long, it hits that meta tag, and slams on the brakes. This isn’t iso-8859-1, it’s utf-8! For reasons I don’t understand, but which are just assumed to be perfectly correct in the bug comments, at that point Firefox stops, and re-requests the file. However, since it already has part of it saved, it only requests the rest of it, with a Range: bytes=4068- header to say it only wants the part after the first 4068 bytes.

Fine and dandy, Blog*Spot replies with a 206 Partial Content, but this time it doesn’t gzip the content, just sends it uncompressed. That’s where things go to garbled. The bug makes it sound like Firefox got 4068 bytes of the compressed entity identified by Etag: "11a8085-bc3b-40c65e8b", and it wanted the rest of that compressed entity, but Blog*Spot’s server wanted to give it everything after 4068 bytes of the uncompressed page, despite claiming in the reply that it was sending the rest of that same entity, with the same Etag. However, it wouldn’t make sense that a refresh would clear it up (by just loading the whole page in one shot from the disk cache) if that were the case, since there would be a broken spot in there somewhere, so it must be that Blog*Spot sends the correct bytes, but because it claims to be sending the same entity, despite not saying it is gzip’d this time Firefox still treats it as gzip’d and ungzips the nongzip’d content. I think.

After much discussion of who was right, the bug seems to have tapered off by sending Greg Stein, who is conveniently enough both the engineering manager for the Blogger team and an Apache httpd developer, off to fix Blog*Spot’s servers, so in the meantime, just let it load as junk, and then refresh to get it ungarbled. Or, evangelize whoever you are reading to enable a full-content feed, so you don’t have to read them in your browser: “really, I need full content, because you’re garbled in my browser – see, here’s a screenshot!”

(via Matt to jimfl to a comment from Bill Stilwell)


Comment by Phil Wilson #
2004-06-09 03:29:30

Ah, fascinating! I’d been wondering what had been going wrong with Blog*Spot blogs, although in my case I was sometimes having to refresh up to ten times to get it to display properly.

Comment by Jacques Distler #
2004-06-09 08:20:41

I’ve found that you don’t need to reload the page, merely change the character-encoding manually (View->Character_Encoding->Unicode_(UTF-8)).

Comment by Phil Ringnalda #
2004-06-09 08:55:36

Ah, I was hoping I could become even more confused!

So, before you change encodings, it hasn’t yet gotten away from its first guess of iso-8859-1? And, is that significant, the display problem is the result of an unintegrated change in encoding midstream, or is an encoding change practically speaking just another way to reload from the cache?

Comment by Jacques Distler #
2004-06-09 12:14:06

Clearly, changing the encoding setting causes Mozilla to re-render the page. I have no idea whether it reloads it from the cache or not (I would guess not).

But, certainly, you don’t issue another HTTP request to check whether the cache is stale.

Comment by aroon #
2004-06-09 12:03:05

ahhh…this was happening to a friends blog and i always wondered why. now i know!

Comment by Chris L #
2004-06-09 15:33:35

Interesting– I’m glad someone finally figured out what was causing this problem with BlogSpot and Mozilla!

Comment by kodiak #
2004-06-10 05:26:25

So now, I know. Thanks for spelling it clearly!

Comment by Anonymous #
2004-06-10 16:58:40

Let’s say I delete my blog on blogger, but I can still see some of the pages. How can I get rid of that? The main page doesn’t show anything but some of the archives are still there.

Comment by Phil Ringnalda #
2004-06-10 19:42:07

Well, my answer for two years was ”yer screwed, should have asked me before you deleted it,” then for another year it was ”yer screwed if you aren’t a Blogger Pro, should have…,” but now, between the fact that Blogger’s help says that it should delete files on Blog*Spot unless you are using Blog*Spot Plus, in which case you can use FTP to delete them yourself, and the fact that you can now backdate posts even without Pro, I’d say:

  • If you have Blog*Spot Plus, fire up an FTP client and delete them yourself
  • If you don’t have Plus, file a support request with Blogger, telling them that deleting a Blog*Spot blog didn’t delete all the files, and telling them which ones.
  • If they don’t respond, and leave you hanging, create a new blog at the same URL, add as many test posts as there are archive files still hanging around, then use the ”more post options” to backdate the posts so there is one for each archive file’s timespan, so they will overwrite the files with nothing but ”Test” or ”Gone” or whatever suits you, publish the blog, then delete it. Who knows, maybe deleting the new one will delete the files.
Trackback by Neil's Smaller World #
2004-06-09 02:53:12

Gzip: Well, sometimes it’s simple

Phil on the Firefox Blog*Spot bug. Something to do with gzip and character encoding, apparently.

Trackback by linklog #
2004-06-13 06:28:23


phil ringnalda dot com: gzip: well, sometimes it’s simple. Eplaining why we are sometimes presented with a load of rubbish when trying to load Blog*Spot pages….

Trackback by Gavin's Blog #
2004-09-30 16:30:23

HTTP Gzip Woes

We’ve had a reports over the last couple months or so about getting garbled pages with Mozilla from our web…

Comment by Dave #
2007-04-25 10:13:10

We’ve been having this problem on a standard non-bloggy Apache install and are trying to figure out if maybe it’s the encoding. @*(!&@^#*^&!@! Firefox. Developers seem to have an ego the size of Microsoft’s sometimes – have to redo Javascripts to use GetElementById instead of simple div.form.value … and now this…

Comment by Phil Ringnalda #
2007-04-25 11:03:50

I know, it’s crazy: who would be foolish enough to think that it’s better for anyone to correctly implement published standards, instead of reverse-engineering whatever Microsoft has ever happened to implement?

Name (required)
E-mail (required - never shown publicly)
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <del datetime="" cite=""> <dd> <dl> <dt> <em> <i> <ins datetime="" cite=""> <kbd> <li> <ol> <p> <pre> <q cite=""> <samp> <strong> <sub> <sup> <ul> in your comment.