What happens if I push this button?

Blargh. That didn’t exactly work out.

A few days back, DreamHost sent me the email they are sending so many others, saying that I’m an abusive bastard who needs to quit hogging so much of the CPU on my shared server. My hogging appears to be the result of using the one-click installer they wanted me to use to install WordPress, causing me to quite often use a quarter- to a half-second of CPU to run PHP for every requested page. Do that 16,000-some times a day and I use up three or four thousand CPU seconds, and if you only allow 24,000 CPU seconds per day for lusers, you can’t pack very many on a server.

My first solution, installing WP-Cache 2, seems to have done some good, taking me from my 0.244 CPU seconds per PHP instance average before I knew I was a sinner to an average of 0.138 to 0.192, depending. They use a reporting tool that, well, apparently isn’t exactly suited to the task at hand (I’d call it a piece of crap, except that it’s something Hixie wrote back in 2002, so I assume it’s just intended to do something other than what they need), so if you are sinning with PHP running as a CGI (again, something they very strongly recommend) you get everything lumped into one “php5.cgi” category (despite the fact that for Gregarius I run PHP 4 as an Apache module). Partway through day 2, I tried to follow their instructions for getting more detailed results by not running PHP as a CGI, despite the fact that their instructions appear to actually run it as a CGI, and the fact that their Apache module is compiled with such silly restrictions that it won’t actually run WordPress, so php5.cgi had a 0.138 second average, and php had a 0.192 second average. Then after a full day of running just their CGI-pretending-not-to-be, php had a 0.191 second average, so boo to PHP 4 and I’m back to PHP 5.

Still, I’m an evildoer: customers should use less than 30-40 cpu minutes per day and dividing by 60 (because they certainly couldn’t report and mandate in the same unit), in the most recent day I used 52.8. So I started looking for big wins. The most obvious one would be to ban search engine crawlers: between googlebot, msnbot, Yahoo!’s Slurp, and whatever the hell Paul Allen is up to with Everest-Vulcan, I could cut out more than 3000 hits per day, in exchange for losing 56 search engine visitors interested such things as “ho [sic] do I turn off smartfilter,” but that seems a little extreme. So I started looking at by far my biggest request, /feed/.

Oddly, despite the presence of a bunch of Etag and Last-Modified handling code in WordPress, I seem to be returning nothing but 200 OK, not a 304 Not Modified in the bunch. Not really a problem, since DreamHost gives me more bandwidth than I can possibly use, and then increases it every time I turn around, but the most obvious way to solve that, serve a static file and let Apache handle conditional gets, would also cut out a lot of PHP use. That’s when I started foolishly button-pushing: a little WordPress plugin that hooked all the post-changing functions and deleted the static file, some mod_rewrite funky caching rules to serve the static file if it was present, or call wp-atom.php if it wasn’t, and a little hacking in wp-atom.php to buffer output and dump it to the static file, and I thought I was all set. Then I checked my aggregator, and saw that I’d just syndicated everything currently in my whole feed to Planet Mozilla, since I was both serving the static file for more than I wanted, and also saving various things like comment feeds or individual entry feeds as the static file for the main feed. Oopsie. A bit more traffic, as people in such odd places as DebianLinux.net’s Free Software Planet wonder about the source of this big wad of posts, isn’t exactly going to cut my resource use.

So, I did the un-WordPressish thing, and put the burden on me: instead of the plugin deleting, and you having to wait for the regeneration, when I save a post I have to wait (a fraction of a second) while the feed regenerates, from a separate file that won’t be building other random feeds. For some reason I don’t understand that maybe has to do with all the redirecting and mod_rewriting in the background, the access log still reports sending a 200, but sending - bytes which I assume means an approximate 304. I hope.

While looking closely at accesses, I noticed a couple of other big wins from vastly too quickly refreshing feed consumers: somone in Italy with the oft-reviled UA string Java/1.4.2_08 who was fetching my feed (very rarely updated more than once a day, usually more like once a week or month) every minute, and two users of JetBrains’ Omea Reader (which I probably once knew was where Dmitry Jemerov went when he stopped developing the grand old aggregator Syndirella, but had since forgotten) who from the random 1.5 to 4.5 minute refreshes appear to have had their refresh rate set to something shorter than their program was able to make it through their whole set of feeds, so that they were literally refreshing as fast as possible. If you are a Brazilian or Hungarian user of Omea Reader, and you are willing to turn your refresh rate down to something reasonable (an hour pleases me most, for historical reasons, but even 30 minutes is quite tolerable), feel free to contact me, and I’ll stop turning you away with a 503 Service Unavailable.

The strange part about the Java-user in Italy is that, while he was fetching the WordPress-produced feed, with its either missing or non-working information for conditional gets, once a minute, and continued to attempt to fetch it once a minute while I served him 503 errors for a day or so, as soon as I let him back in to get the static feed he switched to a reasonable enough 30 minute refresh schedule. Whether that’s in the aggregator code, or an underlying HTTP library, defaulting to a one minute refresh in the absence of any other information strikes me as a really awful idea.

So, what happens when I push this button that says “Publish” — will it notice that the feed contents have changed, and change out the static file, or will I just have broken something else? (Answer: nothing will happen, because I’m incompetent. Once more.)

22 Comments

Comment by Matt #
2005-12-03 22:47:35

Good luck?

The speedup you saw from WP-cache is way below normal and what I’ve seen in other sites, odd. You should ask them to put you on a faster box so you don’t use as many seconds. ;)

Comment by Phil Ringnalda #
2005-12-03 23:46:40

Part of that may be the result of averages not being a particularly good measure: since the instructions to find out what PHP script is doing what don’t seem to work, I don’t know any more than 0.33 cpu 3462k mem 0 io php, so I’m just guessing that the 0.03-0.06 second ones are WP-Cache hits, and the others are misses. And 1000+ entries, plus 1000+ entry feeds, plus categories and date archives and whatnots, makes it possible for three or four pretty active crawlers to request thousands of pages a day without ever getting a cache hit.

Maybe 3600 seconds is too short a cache expiration? Psychotic msnbot has requested the feed for an entry from August 2002 (that last got new content ten days after it was published) five times already today. Unfortunately, with its random timing even a six hour expiration would have missed two of those, that were just a hair over six hours. Maybe I should just 410 the feed URLs for ancient posts: I did get a comment on a post from 2003 today, but that’s incredibly rare, and mostly feeds on posts more than a year old are just teasing feed-aware crawlers into thinking they might get something new.

I would just take DreamHost’s suggested way out, and move up to a dedicated server, except that it’s ridiculous that I should need a dedicated server for a B- list weblog, and I suspect a large part of this push for lowered CPU usage came from the marketing department, not the server administration department.

Comment by Arve #
2005-12-04 03:06:18

I would just take DreamHost’s suggested way out, and move up to a dedicated server, except that it’s ridiculous that I should need a dedicated server for a B- list weblog, and I suspect a large part of this push for lowered CPU usage came from the marketing department, not the server administration department.

Oh well, I’ve had someone on my server periodically hogging the CPU, producing loads of 7-9, and I assure you, there’s nothing fun about serving your readers ”503 Service Unavailable, Server busy” messages when all you do is serve static HTML pages. So no, I don’t think it’s only the marketing department

Comment by Phil Ringnalda #
2005-12-04 10:52:50

Just… tell me you’re not on squirm.

Not that I hog it in big chunks, just too frequently. And perhaps not even that, now: I got the day ending at 8 this morning down to just 35.59 CPU minutes, nicely within their 30-40, with just a partial day of a static /feed/ and a few hours of running WP-Cache with an 8 hour expiration instead of just 1 hour: now I’ve got nearly 500 files in the cache, and a fighting chance of a hit on bizarre ancient posts when the crawlers decide to take an interest.

Comment by Arve #
2005-12-06 08:14:13

Seeing as the referer spamstorm is also at an all-time high, I’d suggest upping any measures you have against them. I recently updated my .htaccess file to block common occurences and spammy domains: It’s blocked over 7500 spammy requests so far in December.

(I won’t publish this list, but you can get it upon request)

Comment by Phil Ringnalda #
2005-12-06 22:57:56

Oh, how I wish I had kept my head down enough over the years (my every post had a (quite clean) list of referrers, with ”Referrers” in a nice bold headline, plus various other sorts of drawing a target on my own back) to block 7500 a week.

Looking at just philringnalda.com (which is currently scrubbing off most of the crud), I’ve actively blocked 39,110 referrer spams, plus another 42,538 that more-or-less blocked themselves, by requesting things that are either 410 or 404 (including an impressive 10,109 requests for ”/blog/2003/02/fresh_as_a_daisy.ph” – someone’s got entirely too many zombies and too little to do with them, and I suppose I should make Apache’s life easier by knocking those out in the root .htaccess, instead of making it run clear down the filesystem before serving a 404).

 
 
Comment by Arve #
2005-12-06 08:16:48

Since I’m still not able to do one-response replies: I’m not on your server.

 
 
 
Comment by Arve #
2005-12-04 03:17:50

augh, I obviously can’t respond in one go today…

Psychotic msnbot has requested the feed for an entry from August 2002 (that last got new content ten days after it was published) five times already today.

<blockquote

Maybe I should just 410 the feed URLs for ancient posts

Do real people even subscribe to single-entry feeds? I’ve always been happy with just subscribing to a regular comment feed, if I’m that interested in a particular weblog. Having psychotic bots requesting your feeds a zillion times a day can’t possibly be healthy for the server.

Comment by skippy #
2005-12-04 05:31:16

I subscribe to single-entry feeds for the entries in which I’m interested, so that I can see the follow-ups to that entry without seeing every other comment on every other post.

Comment by Phil Ringnalda #
2005-12-04 10:41:21

The problem then is, ”how do I solve the problem that you will subscribe to the feed for this entry, but your aggregator won’t provide any way of reminding you that you don’t need to continue to subscribe, and to fetch it on your regular schedule, after a couple of weeks?”

I’ve always resisted the ”close ’em after two weeks” solution to comment spam, partly because I’ve tended to use weblog posts as project pages, so a comment on a two or three year old post is the way to ask me about a problem importing YACCS comments in Movable Type, but if I fixed my WP ”Pages” so they had comments? Probably closing comments when an entry was both more than two weeks old and hadn’t had a comment for two weeks, and then returning a quick and CPU-cheap 410 for the feed (which I’m not naive enough to think would actually produce unsubscribing action on the part of most aggregators), might be reasonable.

2005-12-04 14:03:38

The problem then is, “how do I solve the problem that you will subscribe to the feed for this entry, but your aggregator won’t provide any way of reminding you that you don’t need to continue to subscribe, and to fetch it on your regular schedule, after a couple of weeks?”

This is exactly why I don’t subscribe to comment feeds ;) Sending a 410 seems pretty reasonable if they’re busting the cache and the entry is more than a few weeks old.

Comment by Phil Ringnalda #
2005-12-04 14:24:47

Except that I’d forgotten the rest of my take on closing comments, that if you really want to restart an old conversation, you should do it in your own post, and Trackback the old one, and people who were interested in the old one should find out about your new post by… being subscribed to the comment feed. Grr.

 
 
 
 
 
 
 
2005-12-03 23:53:48

Hm, I’d try writing a server-side test that calls the WordPress code directly, to try to determine if you’re spending your time there or somewhere else (Apache, FastCGI, PHP, ???). I seem to recall that just parsing the source files is one of the common performance problems with PHP. APC (docs) might help … It might be simpler to just install it, warm it up, and then see what your performance is like.

Tip if you don’t have root:
pear install apc should still work as long as you tell it to install somewhere you have access to. Then you can use an alternative php.ini by setting the PHPRC environment variable to the directory containing it (add SetEnv PHPRC /path/to/dir/ to .htaccess). A bit evil, but it’s the only way I could get work done.

Re: Matt: If they put him on a faster machine, they’d probably expect him to use (proportionally) less time so they can cram (proportionally) more users.

 
Comment by Aristotle Pagaltzis #
2005-12-04 00:35:44

This is at least the third time since your switch to WP that the IDs in your feed changed. *grumble* Doesn’t WP store a unique ID with each entry in the database or what?

Also, things like this are why I prefer an approach that doesn’t use any filthy databases or templating or application servers, it’s just Apache scraping HTML files off a Linux filesystem and stuffing them down the wire, boy that sure does run fast.

PS.: be nice to have @cite allowed for <q>.

Comment by Phil Ringnalda #
2005-12-04 01:23:53

Eh, you can’t fool me: you love reading my last ten posts over and over and over again. You love it so much that you didn’t even notice that not only did I muck up the atom:id, I also regressed to Atom 0.3. I was too lazy to copy just the caching code from the file in another WP installation where I was testing (after all, it was one line at the start, and five at the end), so I copied the whole thing, without noticing that it was an Atom 0.3 version with a broken ID. At least, I think that was the problem. If not, there shouldn’t be more than another five changes of ID. Twenty tops.

But, yes, WP stores an unchanging ID, and has rather more functions which will return something which is not the ID than functions which will return the ID. I’d claim that I’ll start only using the ones that do, when that’s what I need, and only use the ones that don’t when I need the non-ID thing that they return, but I was born bad. And I only got worse. However, I think I probably do now allow <q cite="">.

Comment by Phil Ringnalda #
2005-12-04 01:32:13

Heh. That’s a nice ”Oh, you don’t use a Mozilla browser? What a shame.” feature: in IE, there isn’t even any sign of a quote, in Opera it’s quoted but I don’t see any way to access the cite, and in Firefox you can right-click the quote, choose Properties, and then ask yourself while copying and pasting the URL whether the bug to provide a reasonable API for chrome to open a new tab in the current window has made it far enough to reasonably link that URL.

Comment by Aristotle Pagaltzis #
2005-12-04 13:29:50

Yeah, the very nearly non-existant support for @cite is another one of those esoterica that piss me off. It seems that a lot of old, really old ideas from HTML for metadata not directly related to presentation were never implemented, or half-heartedly at the very best. (I can’t remember examples off hand, but I know there are at least two or three other cases.) I’d prefer not feeling silly for being a good boy and annotating things as they should be, with no encouragement than a vague hope that in the future it will start mattering that in the then-past I made all this effort.

Sigh.

Comment by Phil Ringnalda #
2005-12-04 14:21:57

If you’re going to use million dollar markup, in HTML or in RDF or XHTML for XML’s sake, probably the most important lesson is ”you better plan on using it yourself, because ’if you mark it up, they will come’ is mostly a bad bet.” Making your own chicken omelet’s the surest way to solve the chicken-and-egg problem.

Personally, I sort of enjoy markupsturbation, so I quite often do it even when I don’t intend to ever use it, though.

Comment by Aristotle Pagaltzis #
2005-12-04 20:46:16

Oh, sure; I’d just prefer if others would benefit from my efforts as well. Like Mark in the other entry he talks about in that one, I’ve told myself I’ll write some Javascript at some point to fix the problem that browsers don’t yet do anything interesting with ins and del and their @datetime attributes – ‘cause I use those too, and so far the only result is ugly rendering of entries where I’ve made additions.

 
 
Comment by Mark #
2005-12-05 09:45:20

Top 10 forgotten HTML features that could have made the web better, but fell victim to inconsistent or nonexistent implementation:

  1. blockquote/@cite and q/@cite for referencing quotation sources
  2. a/@rel and a/@rev (now used in microformats)
  3. form/@accept-charset for specifying character encoding
  4. img/@longdesc for accessible text-only alternatives of complex images like charts and graphs
  5. input/@readonly and input/@disabled
  6. ins and del, each with a @datetime attribute
  7. ol/@start (now deprecated)
  8. fieldset and legend for logically breaking up long forms
  9. table summary=””, which could have been an officially unofficial way to declare that a table was used exclusively for layout (like img alt=”” for spacer images)
  10. object
 
 
Comment by Sander #
2005-12-04 15:43:27

you can right-click the quote, choose Properties

Oooh! Now how come I wasn’t aware of that feature?! (Ah, for the days when I had the time to spend hours each day reading all interesting incoming bugs, and their dependencies.)
Also: Heh. Checking my own weblog to see this in action, I had to resort to running sql to actually refind those posts that contained quotes with cite. *wry*

*needs to go write a userContent q[cite]:after, blockquote[cite]:after rule*

 
 
 
 
Comment by Phil Ringnalda #
2005-12-04 13:14:59

Apparently the switch to reasonable behavior by my Italian Java-using subscriber was just a momentary fluke, since he’s back to once-a-minute, non-conditional-get requests. SecFilterSelective "REMOTE_ADDR" "213.92.16.225" it is, then.

 
Name (required)
E-mail (required - never shown publicly)
URI
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <del datetime="" cite=""> <dd> <dl> <dt> <em> <i> <ins datetime="" cite=""> <kbd> <li> <ol> <p> <pre> <q cite=""> <samp> <strong> <sub> <sup> <ul> in your comment.