Joel’s RSS problem

Joel has an RSS problem:

Most of the RSS subscribers are whacking me every hour, which is actually costing me cash money in excess bandwidth charges. How can I set it up so they only visit once a day? Is this an RSS option? I rarely post more than once or twice a day.

We need to solve Joel’s RSS problem, before it becomes our problem:

Maybe I should change the RSS feed to just include headlines with links.

I subscribe to some feeds that only do title/link, or title/a few words/link, but only feeds from sites that I don’t much like. I only want to read a few articles on news sites like InfoWorld or the Reg, so headlines work just fine. For things like Joel’s site, where I want to read every post, having just titles in the feed makes it nothing more than a notification service, and I already have one of those.

We already have two ways to tell an aggregator when and how often to visit: RSS 0.91+’s skipHours and skipDays, and RSS 1.0’s sy:updatePeriod, sy:updateFrequency and sy:updateBase. Of course, neither one is very satisfactory, especially for weblogs.

If I understand skipHours and skipDays, if you never publish anything on Saturday or Sunday, or between 10pm and 6 am, then you first translate those hours to GMT, then add the following to the <channel>:


<skipHours>
<hour>22</hour>
<hour>23</hour>
<hour>24</hour>
<hour>1</hour>
<hour>2</hour>
<hour>3</hour>
<hour>4</hour>
<hour>5</hour>
</skipHours>
<skipDays>
<day>Saturday</day>
<day>Sunday</day>
</skipDays>

That ought to work, if you live in GMT (which isn’t actually called GMT anymore, is it?), assuming that 24, 1 rather than 0, 1 is right, but if you live somewhere else you not only have to translate hours, you also have to make a guess about how the days will be interpreted (midnight to midnight, GMT?). Then you have to persuade aggregators to honor skipHours and skipDays, since as the RSS 0.91 spec said, “Most aggregators seem to ignore this element”.

RSS 1.0’s syndication module has a slightly different model: you have an update period (“hourly|daily|weekly|monthly|yearly”), an update frequency (how often you update in that period), and an update base (the date/time to start calculating from). So,


<sy:updatePeriod>daily</sy:updatePeriod>
<sy:updateFrequency>2</sy:updateFrequency>
<sy:updateBase>2000-01-01T11:00+00:00</sy:updateBase>

would say that you update twice a day, at 11 am and 11 pm GMT. That would work fine, if you publish on a precise and even schedule. Probably a good fit for some news sites, but less useful for personal sites.

What do we really want, instead? Smart aggregators. What a smart aggregator would do isn’t quite clear to me, though.

For starters, since for whatever reason the community standard is to only check once an hour, an aggregator shouldn’t check more often than that unless it has been told that it should (by a <sy:updatePeriod>hourly</sy:updatePeriod>
<sy:updateFrequency>2</sy:updateFrequency>
for example). However, I’d rather have aggregators check every 30 minutes from 7 to 11 pm Pacific Time (when I’m likely to be home an updating), only nine times, than once per hour every hour, whether I’m awake, asleep, or at work.

While I was trying to make some sense of that, Dave reminded me of RSS 2.0’s ttl element. I hadn’t thought of it for anything but syndicating over Gnutella, which is what the spec mentions, but suppose instead that it was used to create a flexible aggregator schedule. For someone who updates on a regular schedule, it could be a simple matter of changing the ttl based on the time of day. For someone on a random schedule like me, it might be a more complex combination of whether my WinAmp playlist is updating (if it’s not moving, my computer’s off, so you might as well not hurry back), and how long it’s been since my last update (if I post once, I’m probably in a posting mood, so you better check back a bit more often). Joel’s could be set with a bit of fudging, to roughly an hour before the earliest he might post the next day: post at 2pm, never up before 6am, set the ttl to 900 for the 2 o’clock hour, then lop 60 minutes off every hour until it gets down to 60, and leave it there until the next update. I don’t have a clue how his feed is generated or could be regenerated, but for me it’s just a bit more mod_rewrite, and a bit of PHP in what will still look to you like index.xml (or maybe a bit of SetHandler: I’m not sure which would be easier, or better).

Getting it implemented will, I suspect, require that Dave lead the way with Radio: it’s not exactly something that feed consumers are clamoring for (until people start pulling feeds because they can’t afford the hits, anyway), so it’ll need more pressure than just a few dozen geeks rolling their own custom ttls to get anyone excited about saving their users a bit of bandwidth checking feeds. But maybe with Radio producing and consuming ttl, it’ll catch on before people get sick of seeing hundreds of hits from people who are asleep, coming while they’re asleep too, and start dropping their RSS feeds to save bandwidth.

62 Comments

Comment by Phil Ringnalda #
2002-10-21 07:40:14

Why conditional-GET won’t solve everything for all time, from a message from Jeremy Bowers on the radio-dev group:

The problem is, as Dave pointed out on the thread. is that this only creates a linear savings over time. According to my copy of RU, Scripting News has 5000+ subscriptions, and AFAIK, that’s *just* RU subscribers on the main Userland RCS. It could be several times higher, I don’t know.

5000 * 500 * 24 = 60(decimal)MB a day, 420MB a week, about 1.5(real)GB a month. The whole (long-term) goal of this is to scale the system in a rapidly-approaching era where 5000 subscribers is at *best* a ”medium” sized site, and we’re already hitting monthly limits on many ISP accounts with *just* the conditional get.

Imagine a world where a college student’s site such as Hack The Planet (before he graduated and got a job) gets 100,000 subscriptions. He updates too often for exponential backdown to be of any serious use (you can eke out a factor of two or so with aggresive settings, but not much more), and 100,000 * 500 * 24 is a gig a day _just_ for the conditional gets; send out 10K on average every fourth hourly request for the actual content and you’re into 100,000 * 10,000 * 6 = 6GB/day. That’s an expensive site; my ISP became annoyed and charged me extra for much less then that.

We can solve the problem now or later, but I don’t think it hurts to talk about what the solution might look like now.

 
Comment by mb #
2002-10-21 15:18:04

HTTP already solves both of these problems, based on years of dealing with it.

Conditional GET is one half of the expiration is the other half and one of its built in concepts.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.2

http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.21

Client side HTTP stacks already know how to handle much if not all of this. Server side, you’ll need a setting to set the expires tag right.

 
Comment by mnot #
2002-11-11 08:32:46

I’m a bit surprised by the confusion re: conditional GET. To make it easier, people might want to look at cgi_buffer – this is a set of libraries (for Perl, Python and PHP) that automatically generate and validate ETags, as well as handle HTTP/1.0 persistent connections and do gzip compression with content encoding. It’s activated with a one-line include into your script, so it’s easy to use with just about anything.

(It’s implemented correctly and according to the specs; I’ve been active in the Web caching community for more than five years and was until recently the resident HTTP expert at Akamai Technologies)

Hope this helps,

 
Comment by Bob Wyman #
2004-09-19 14:26:31

I think Joel’s problems, a bit old by now, would be somewhat relieved by supporting RFC3229 with the ”feed” instance-manipulation method. Check out:

For a description of ”feed” IM method:
http://bobwyman.pubsub.com/main/2004/09/using_rfc3229_w.html

For a list of servers and clients that support RFC3229 delta encoding with ”feed”:
http://bobwyman.pubsub.com/main/2004/09/implementations.html

bob wyman

 
Trackback by Sam Ruby #
2002-10-20 09:08:33

RSS bandwidth costs

Phil Ringnaldaweighs in on Joel’s RSS problem. It is worth noting that there is an existing solution that significantly reduces bandwidth without affecting latency or content. It is called HTTP HEAD requests. Amphetadesk already supports it. Check

 
Trackback by Burningbird #
2002-10-20 12:13:08

Sunday afternoon Ramble

I was not surprised to get negative comments in my posting, Everything to do with her being a woman. I was surprised, and gratified, when a commenter going by the name of ”Lord Trickster” seemed to defend what I was

 
Trackback by markpasc.blog #
2002-10-20 23:08:56

If-Modified-Since: whenever

 
Trackback by Sam Ruby #
2002-10-21 08:02:02

The Digital Magpie Crows

Joelcontemplates autogenerating a different rss feed for less polite aggregators. Within hours, Daveadds the code to Radio. Philpromptly takes credit for not inventing the idea.In the process, a new and somewhat more apt description for Phil’s weblog

 
Trackback by Ask Bjørn Hansen #
2002-10-23 02:59:49

If-Modified-Since and dynamic content

Brent added support for Etags and If-Modified-Since headers to the latest NetNewsWire beta. It’s very cool. He added it after hints and pressure from among others Joel, Phil, Sam and Mark; I’ll refrain from pointing out that I suggested it to him sever…

 
Trackback by Rant Central #
2002-10-28 09:14:55

RSS bandwidth and aggregators

Joel has a problem with RSS. Phil Ringnalda writes about the problem with aggregators that download RSS feeds hourly (or

 
Trackback by Seb's Open Research #
2003-11-25 05:30:13

Feed pounding and bandwidth issues

This worries the RSS geek in me a little bit.

 
Trackback by Richard Giles blog #
2004-10-13 06:12:27

Radio Killed The Internet Star.

Over the last few months there have been a few articles (one here another here) and concerns over RSS and the effec…

 
Name (required)
E-mail (required - never shown publicly)
URI
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <del datetime="" cite=""> <dd> <dl> <dt> <em> <i> <ins datetime="" cite=""> <kbd> <li> <ol> <p> <pre> <q cite=""> <samp> <strong> <sub> <sup> <ul> in your comment.