Joel’s RSS problem
Most of the RSS subscribers are whacking me every hour, which is actually costing me cash money in excess bandwidth charges. How can I set it up so they only visit once a day? Is this an RSS option? I rarely post more than once or twice a day.
We need to solve Joel’s RSS problem, before it becomes our problem:
Maybe I should change the RSS feed to just include headlines with links.
I subscribe to some feeds that only do title/link, or title/a few words/link, but only feeds from sites that I don’t much like. I only want to read a few articles on news sites like InfoWorld or the Reg, so headlines work just fine. For things like Joel’s site, where I want to read every post, having just titles in the feed makes it nothing more than a notification service, and I already have one of those.
We already have two ways to tell an aggregator when and how often to visit: RSS 0.91+’s skipHours and skipDays, and RSS 1.0’s sy:updatePeriod, sy:updateFrequency and sy:updateBase. Of course, neither one is very satisfactory, especially for weblogs.
If I understand skipHours and skipDays, if you never publish anything on Saturday or Sunday, or between 10pm and 6 am, then you first translate those hours to GMT, then add the following to the <channel>:
<skipHours>
<hour>22</hour>
<hour>23</hour>
<hour>24</hour>
<hour>1</hour>
<hour>2</hour>
<hour>3</hour>
<hour>4</hour>
<hour>5</hour>
</skipHours>
<skipDays>
<day>Saturday</day>
<day>Sunday</day>
</skipDays>
That ought to work, if you live in GMT (which isn’t actually called GMT anymore, is it?), assuming that 24, 1 rather than 0, 1 is right, but if you live somewhere else you not only have to translate hours, you also have to make a guess about how the days will be interpreted (midnight to midnight, GMT?). Then you have to persuade aggregators to honor skipHours and skipDays, since as the RSS 0.91 spec said, “Most aggregators seem to ignore this element”.
RSS 1.0’s syndication module has a slightly different model: you have an update period (“hourly|daily|weekly|monthly|yearly”), an update frequency (how often you update in that period), and an update base (the date/time to start calculating from). So,
<sy:updatePeriod>daily</sy:updatePeriod>
<sy:updateFrequency>2</sy:updateFrequency>
<sy:updateBase>2000-01-01T11:00+00:00</sy:updateBase>
would say that you update twice a day, at 11 am and 11 pm GMT. That would work fine, if you publish on a precise and even schedule. Probably a good fit for some news sites, but less useful for personal sites.
What do we really want, instead? Smart aggregators. What a smart aggregator would do isn’t quite clear to me, though.
For starters, since for whatever reason the community standard is to only check once an hour, an aggregator shouldn’t check more often than that unless it has been told that it should (by a <sy:updatePeriod>hourly</sy:updatePeriod>
for example). However, I’d rather have aggregators check every 30 minutes from 7 to 11 pm Pacific Time (when I’m likely to be home an updating), only nine times, than once per hour every hour, whether I’m awake, asleep, or at work.
<sy:updateFrequency>2</sy:updateFrequency>
While I was trying to make some sense of that, Dave reminded me of RSS 2.0’s ttl element. I hadn’t thought of it for anything but syndicating over Gnutella, which is what the spec mentions, but suppose instead that it was used to create a flexible aggregator schedule. For someone who updates on a regular schedule, it could be a simple matter of changing the ttl based on the time of day. For someone on a random schedule like me, it might be a more complex combination of whether my WinAmp playlist is updating (if it’s not moving, my computer’s off, so you might as well not hurry back), and how long it’s been since my last update (if I post once, I’m probably in a posting mood, so you better check back a bit more often). Joel’s could be set with a bit of fudging, to roughly an hour before the earliest he might post the next day: post at 2pm, never up before 6am, set the ttl to 900 for the 2 o’clock hour, then lop 60 minutes off every hour until it gets down to 60, and leave it there until the next update. I don’t have a clue how his feed is generated or could be regenerated, but for me it’s just a bit more mod_rewrite, and a bit of PHP in what will still look to you like index.xml (or maybe a bit of SetHandler: I’m not sure which would be easier, or better).
Getting it implemented will, I suspect, require that Dave lead the way with Radio: it’s not exactly something that feed consumers are clamoring for (until people start pulling feeds because they can’t afford the hits, anyway), so it’ll need more pressure than just a few dozen geeks rolling their own custom ttls to get anyone excited about saving their users a bit of bandwidth checking feeds. But maybe with Radio producing and consuming ttl, it’ll catch on before people get sick of seeing hundreds of hits from people who are asleep, coming while they’re asleep too, and start dropping their RSS feeds to save bandwidth.
Why conditional-GET won’t solve everything for all time, from a message from Jeremy Bowers on the radio-dev group:
We can solve the problem now or later, but I don’t think it hurts to talk about what the solution might look like now.
HTTP already solves both of these problems, based on years of dealing with it.
Conditional GET is one half of the expiration is the other half and one of its built in concepts.
http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.2
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.21
Client side HTTP stacks already know how to handle much if not all of this. Server side, you’ll need a setting to set the expires tag right.
I’m a bit surprised by the confusion re: conditional GET. To make it easier, people might want to look at cgi_buffer – this is a set of libraries (for Perl, Python and PHP) that automatically generate and validate ETags, as well as handle HTTP/1.0 persistent connections and do gzip compression with content encoding. It’s activated with a one-line include into your script, so it’s easy to use with just about anything.
(It’s implemented correctly and according to the specs; I’ve been active in the Web caching community for more than five years and was until recently the resident HTTP expert at Akamai Technologies)
Hope this helps,
I think Joel’s problems, a bit old by now, would be somewhat relieved by supporting RFC3229 with the ”feed” instance-manipulation method. Check out:
For a description of ”feed” IM method:
http://bobwyman.pubsub.com/main/2004/09/using_rfc3229_w.html
For a list of servers and clients that support RFC3229 delta encoding with ”feed”:
http://bobwyman.pubsub.com/main/2004/09/implementations.html
bob wyman
RSS bandwidth costs
Phil Ringnaldaweighs in on Joel’s RSS problem. It is worth noting that there is an existing solution that significantly reduces bandwidth without affecting latency or content. It is called HTTP HEAD requests. Amphetadesk already supports it. Check
Sunday afternoon Ramble
I was not surprised to get negative comments in my posting, Everything to do with her being a woman. I was surprised, and gratified, when a commenter going by the name of ”Lord Trickster” seemed to defend what I was
If-Modified-Since: whenever
The Digital Magpie Crows
Joelcontemplates autogenerating a different rss feed for less polite aggregators. Within hours, Daveadds the code to Radio. Philpromptly takes credit for not inventing the idea.In the process, a new and somewhat more apt description for Phil’s weblog
If-Modified-Since and dynamic content
Brent added support for Etags and If-Modified-Since headers to the latest NetNewsWire beta. It’s very cool. He added it after hints and pressure from among others Joel, Phil, Sam and Mark; I’ll refrain from pointing out that I suggested it to him sever…
RSS bandwidth and aggregators
Joel has a problem with RSS. Phil Ringnalda writes about the problem with aggregators that download RSS feeds hourly (or
Feed pounding and bandwidth issues
This worries the RSS geek in me a little bit.
Radio Killed The Internet Star.
Over the last few months there have been a few articles (one here another here) and concerns over RSS and the effec…