MT’s completely search-friendly dynamic URLs

The first few times I saw people worrying that the new dynamic PHP-based publishing in Movable Type 3.1x would mess up their permalinks, and make Google stop loving them, I didn’t worry too much about it. But, today I saw Mr. Dooce saying:

Why not use the dynamic page generation in MovableType 3.1?? Have you seen the instructions for that?? You have to do a lot of monkeying around and what happens to all those Google archived links in search results and links from others? Sure, we could try to generate a redirect .htaccess file, but again, pain pain pain. I think one of the biggest strengths of MovableType is that it generates a living breathing HTML page for the archives. It’s also one of the reasons that so many blogs show up so high in search results.

So maybe there’s a need for a little discussion of how URLs work with dynamic publishing. Please don’t think I’m talking down to you (or to Jon, who I’m sure only needed the prod of saying the URLs don’t change, and probably won’t be reading this at all) in this entry; it’s just that I’ve noticed how your eyes glaze over when you expect that someone’s going to geek beyond you. Fear not, it’s pretty simple.

Here’s an example of a URL from before dynamic publishing:

http://www.dooce.com/archives/nubbin/09_10_2004.html

And here’s an example of a URL with dynamic publishing:

http://www.dooce.com/archives/nubbin/09_10_2004.html

You’ve got to admit, that doesn’t look like it’s going to confuse Google, does it? Every single character is exactly the same, in the exact same order. How’s that happen?

When you ask Apache, the web server program most of you use (we’ll get to non-Apache and less-able Apache in a bit), for that URL, it doesn’t go straight to that directory on the disk and serve it up. Unless your server is run by someone cruel, who doesn’t allow it, first it will look for a file in / named .htaccess, and will do things to the URL based on what’s in there. Then it does the same in /archives/ and then in /archives/nubbin/, and only then, if you haven’t changed the URL around with the instructions in any of those .htaccess files, will it look for a file named 09_10_2004.html and send it off to the browser that asked for it.

With .htaccess and an Apache extension called mod_rewrite, there’s absolutely no need for anyone at the other end to know just what you are doing when you send them /archives/nubbin/09_10_2004.html. For example, when you request my /index.php file, you actually get it from another directory, that you don’t even need to know the name of: it just looks like it’s in /.

MT’s dynamic publishing just takes that a step farther: everything that you tell it you want to be dynamic, it moves the actual file aside (by renaming it to 09_10_2004.html.static), and then the .htaccess file and Apache have this conversation:

Apache:
I want /archives/nubbin/09_10_2004.html
.htaccess:
Is it a real file, on the disk?
Apache:
(Looks) Nope, not there.
.htaccess:
Is that the name of a directory?
Apache:
(Peers at 09_10_2004.html, shakes head) D’oh. No.
.htaccess:
Go ask this file named mtview.php for it – tell him what you want in the REQUEST_URI variable, and he’ll hook you up. And don’t bother telling the person who asked you that you didn’t just find the file.

Apache tells mtview.php what it was originally looking for, mtview.php looks in a database table called mt_fileinfo to see what would be in that file if it was really there, and builds it up from scratch. Apache grabs that, sends it to your browser (or Googlebot, or anything else that asks), and doesn’t let on in any way that it didn’t find the real file. (Well, except for an X-Powered-By: PHP/4.3.6 header that nobody will notice.)

If you don’t get to use mod_rewrite, but you can have custom error pages (either through a .htaccess file or through something like cPanel), you can probably still use dynamic publishing, though you may wind up with a whole lot of 404 reports in your stats: just tell your server to use mtview.php for your custom error page, and it will try to create whatever file someone asks for, only returning a 404 if it’s not a URL that it knows about. And that’s also how you do it if you’re on a Windows server running IIS instead of Apache: tell it to use mtview.php as a custom error page.

Either way, there’s no reason for your permalinks to change, because they don’t really exist any more: once it’s working, you can delete all the old whatever.html.static files, and nothing changes. The URL is just a handy label that mtview.php uses to look up a row in the database that tells it things like “that’s a Monthly page, build it with template number 15, starting with 20040901” or “that’s an Individual entry, entry number 237, build it with template number 17.”

There are a few steps and requirements, and it’s not going to be right for everyone, but one thing you don’t have to worry about at all is your URLs.

16 Comments

Comment by Adam Kalsey #
2004-09-10 22:11:47

I realize that you’re trying to show in a conversational way how all this tech wizardry happens, but your conversation between Apache and .htaccess isn’t exactly how it happens.

The files can be there, but when Apache goes to grab it off the disk it first checks with .htaccess to see if there are any special instructions it should follow. .htaccess says, absolutely! You should ignore that file that’s on the disk and instead ask mtview.php for it, following all the other stuff about the REQUEST_URI and keeping it a big secret from the person on the other end.

Comment by Phil Ringnalda #
2004-09-10 22:31:50

Are you sure about that? I would interpret

  # don't serve mtview.php if the request is for a real file
  # (allows the actual file to be served)
  RewriteCond %{REQUEST_FILENAME} !-f

as meaning, well, pretty much what I said. Maybe a touch more accurate to have Apache say ”I’m gonna go get this file, got a problem with that?” to which .htaccess replies ”No problem, if it’s a real file, otherwise let’s try something a little different.” but basically, if there’s a file that Apache thinks matches the request, it gets served, which is why after changing to dynamic building for a particular template, you have to rebuild, to rename all the existing files built by that template.

Comment by Adam Kalsey #
2004-09-11 18:42:17

Umm, yeah, I probably should have read the actual htaccess for for MT then, huh? I assumed they were doing a straight re-write without the condition.

That’s exactly what that line means. So if there’s a static file, serve it, otherwise use dynamic rendering. That’s rather clever really. So if I had some sort of reason to serve a single one of my archive files statically (being slashdotted or something) I could throw a file up there and bypass the dynamic generation.

Comment by Phil Ringnalda #
2004-09-11 19:24:29

Well, almost, nearly kinda sort. In theory, you would want to survive a Slashdotting by going to Templates, setting ”Build all templates statically”, editing the Slashdotted entry, closing comments, and saving it, so that you would replace balm3r_wuz_fir3d.html.static with balm3r_wuz_fir3d.html showing the entry and comments to that point, then resetting to build dynamically so you can go on with the rest of your life elsewhere. But then as soon as you add a new entry, or get a comment on the previous entry, or anything else that will cause BuildDependencies to rebuild your entry, the static file gets shoved aside again (same thing if you just put your own static file there: as long as MT thinks it should be dynamic, anything that saves that entry will shove the static file aside). So you would have to go static for anything that gets rebuilt for as long as you want one entry static.

Dunno, from what I hear from Matt getting Slashdotted with WordPress, running out of Apache connections is more of a worry than the load of PHP and MySQL. Be interesting to see, once someone with dynamic MT gets hit.

 
 
 
 
Comment by Joost Schuur #
2004-09-11 00:37:31

So does Google ever do anything to identify dynamically generated pages and treat them differently? I doubt there’s any HTTP headers that say ’I didn’t come from a static file’, but would it look at WordPress index.php URLs e.g. and treat them differently?

Data from robots.txt non-withstanding.

Comment by Phil Ringnalda #
2004-09-11 01:09:13

There was a time when people in the SEO world said that URLs with a query string that included a parameter with the letters ”i” and ”d” in it was pure Googlebot suicide, because it feared following a million identical URLs with session ids in them. It’s now trying much harder to crawl big dynamic sites (as are all the other engines), so that’s probably no longer as true, if it ever really was, but I still think I’d be scared to use ”id” in a query string if I wanted the site well indexed. However, since weblog inurl:id seems to return a fair number of things, it probably doesn’t matter in the least.

There’s no guaranteed way of identifying a dynamic page, but most people don’t do much to hide it, either. Whether it’s rewriting requests that look like a URL to a script, or having /foo/bar/whatever/stuff/things actually be calling the script bar, in the directory foo, with the path info /whatever/stuff/things, there’s no real need to have your URL show that it’s acting as a query, and you can stop PHP from adding its X-Powered-By header (at least, I think you can), and add in your own Last-Modified and Etag headers that will look just like the ones Apache would have generated.

I think people are just worried that they will have to change their URLs (which is generally a bad thing, at least short-term), and that the changed URL might be something that they sort of remember hearing once wasn’t so good in search results. Nope, and probably would be a problem anyway.

 
 
Comment by Mike Wills #
2004-09-11 06:07:56

My only problem is that my dynamic pages don’t display on the first try. See this post. When I hit refresh, then it displays…

 
Comment by dj blurb #
2004-09-11 09:57:22

I’m reading everything I can on moving to dynamic publishing. Actually, the process of moving to dynamic pages was painless. If you do everything that the instructions say, you can be dynamic in about 15 minutes. The real issue, and one that will bite a lot of people, is the usage of plug-ins. OUCH.

Thanks for the discussion and the comment on blurbomat.

I am going to have to do some serious code to de-cruft the URLs but first, I must find a fix for using Textile and Supplemental Categories.

Comment by Phil Ringnalda #
2004-09-11 10:15:29

Textile? Arvind has you covered. (I’m trying not to think about why it was included with the betas and not the shipping product, but as long as it seems to work…).

 
 
Comment by Robin Millette #
2004-09-11 17:24:05

Is it really wise to return content _and_ a 404 response? Are you sure google or any sane search engine will pay attention to a ”Document not found” page?
Other then that, it’s all good :)

Comment by Phil Ringnalda #
2004-09-11 17:56:49

That depends: are you talking about the case where it’s done carefully, and then beta tested widely, pounded on until all the bugs are squashed, and only then released, or the case where it’s SHIPPING_DATE rather than $shipping_date, and everything else is determined solely based on when the thing has been predetermined to ship? In the first case, someone will fire up their Live HTTP Headers extension, notice that header("HTTP/1.1 200 OK"); is not overriding the 404, that it should have a header("Status: 200"); to completely hide from the outside world that there was any ErrorHandler involved, and that’ll get fixed before it ships. In the second? Sigh. Off to file a bug…

 
 
Trackback by Neil's World #
2004-09-11 00:31:57

Going dynamic with MT

If you’re debating over whether to make some of your Movable Type pages dynamic, have a read of this guide by Phil Ringnalda.

 
Trackback by Breaking Windows 2.0 #
2004-09-11 22:17:18

MT’s completely search-friendly dynamic URLs

The first few times I saw people worrying that the new dynamic PHP-based publishing in Movable Type 3.1x would mess up their permalinks, and make Google stop loving them, I didn’t worry too much about it… Source: phil ringnalda dot…

 
Trackback by ***Dave Does the Blog #
2004-09-13 13:33:25

More MT 3.1X stuff

Continuing to pre-plan on the MT311 conversion. Neil Turner has Nine steps to a quicker MT3.1x installation — though by…

 
Trackback by ***Dave Does the Blog #
2004-09-29 08:53:19

Dynamic publishing in MT 3.11

One of my major reasons for taking the 3.11 plunge is the prospect of dynamic publishing — that a blog…

 
Trackback by Movalog #
2005-06-05 04:20:35

Dynamic Publishing – Pros and Cons

Discussion cross posted on Movalog and Learning Movable Type One of the key features that Six Apart promotes about Movable Type is MT’s ability to publish dynamically. What is dynamic publishing? And what are the benefits (and downsides) to dynamic…

 

Sorry, the comment form is closed at this time.