Joel’s RSS problem

Joel has an RSS problem:

Most of the RSS subscribers are whacking me every hour, which is actually costing me cash money in excess bandwidth charges. How can I set it up so they only visit once a day? Is this an RSS option? I rarely post more than once or twice a day.

We need to solve Joel’s RSS problem, before it becomes our problem:

Maybe I should change the RSS feed to just include headlines with links.

I subscribe to some feeds that only do title/link, or title/a few words/link, but only feeds from sites that I don’t much like. I only want to read a few articles on news sites like InfoWorld or the Reg, so headlines work just fine. For things like Joel’s site, where I want to read every post, having just titles in the feed makes it nothing more than a notification service, and I already have one of those.

We already have two ways to tell an aggregator when and how often to visit: RSS 0.91+’s skipHours and skipDays, and RSS 1.0’s sy:updatePeriod, sy:updateFrequency and sy:updateBase. Of course, neither one is very satisfactory, especially for weblogs.

If I understand skipHours and skipDays, if you never publish anything on Saturday or Sunday, or between 10pm and 6 am, then you first translate those hours to GMT, then add the following to the <channel>:


<skipHours>
<hour>22</hour>
<hour>23</hour>
<hour>24</hour>
<hour>1</hour>
<hour>2</hour>
<hour>3</hour>
<hour>4</hour>
<hour>5</hour>
</skipHours>
<skipDays>
<day>Saturday</day>
<day>Sunday</day>
</skipDays>

That ought to work, if you live in GMT (which isn’t actually called GMT anymore, is it?), assuming that 24, 1 rather than 0, 1 is right, but if you live somewhere else you not only have to translate hours, you also have to make a guess about how the days will be interpreted (midnight to midnight, GMT?). Then you have to persuade aggregators to honor skipHours and skipDays, since as the RSS 0.91 spec said, “Most aggregators seem to ignore this element”.

RSS 1.0’s syndication module has a slightly different model: you have an update period (“hourly|daily|weekly|monthly|yearly”), an update frequency (how often you update in that period), and an update base (the date/time to start calculating from). So,


<sy:updatePeriod>daily</sy:updatePeriod>
<sy:updateFrequency>2</sy:updateFrequency>
<sy:updateBase>2000-01-01T11:00+00:00</sy:updateBase>

would say that you update twice a day, at 11 am and 11 pm GMT. That would work fine, if you publish on a precise and even schedule. Probably a good fit for some news sites, but less useful for personal sites.

What do we really want, instead? Smart aggregators. What a smart aggregator would do isn’t quite clear to me, though.

For starters, since for whatever reason the community standard is to only check once an hour, an aggregator shouldn’t check more often than that unless it has been told that it should (by a <sy:updatePeriod>hourly</sy:updatePeriod>
<sy:updateFrequency>2</sy:updateFrequency>
for example). However, I’d rather have aggregators check every 30 minutes from 7 to 11 pm Pacific Time (when I’m likely to be home an updating), only nine times, than once per hour every hour, whether I’m awake, asleep, or at work.

While I was trying to make some sense of that, Dave reminded me of RSS 2.0’s ttl element. I hadn’t thought of it for anything but syndicating over Gnutella, which is what the spec mentions, but suppose instead that it was used to create a flexible aggregator schedule. For someone who updates on a regular schedule, it could be a simple matter of changing the ttl based on the time of day. For someone on a random schedule like me, it might be a more complex combination of whether my WinAmp playlist is updating (if it’s not moving, my computer’s off, so you might as well not hurry back), and how long it’s been since my last update (if I post once, I’m probably in a posting mood, so you better check back a bit more often). Joel’s could be set with a bit of fudging, to roughly an hour before the earliest he might post the next day: post at 2pm, never up before 6am, set the ttl to 900 for the 2 o’clock hour, then lop 60 minutes off every hour until it gets down to 60, and leave it there until the next update. I don’t have a clue how his feed is generated or could be regenerated, but for me it’s just a bit more mod_rewrite, and a bit of PHP in what will still look to you like index.xml (or maybe a bit of SetHandler: I’m not sure which would be easier, or better).

Getting it implemented will, I suspect, require that Dave lead the way with Radio: it’s not exactly something that feed consumers are clamoring for (until people start pulling feeds because they can’t afford the hits, anyway), so it’ll need more pressure than just a few dozen geeks rolling their own custom ttls to get anyone excited about saving their users a bit of bandwidth checking feeds. But maybe with Radio producing and consuming ttl, it’ll catch on before people get sick of seeing hundreds of hits from people who are asleep, coming while they’re asleep too, and start dropping their RSS feeds to save bandwidth.

62 Comments

Comment by Kafkaesquí #
2002-10-20 00:35:28

As a slightly lower tech, producer’s side solution for the here and now*, is there an option or opportunity to customize ones RSS feed to provide full articles for just the first say 1-2 entries, and have the rest included only as title/link? I may be missing something here (not the first time), but it seems for regularly visited sites, one is normally only reading the most recent few entries. Of course this wouldn’t fully alleviate the issue, but as they used to say: every bit counts, but every bit removed counts more.

* Asked from the perspective of someone who hasn’t had or received an RSS feed in a long time.

 
Comment by Phil Ringnalda #
2002-10-20 01:04:51

In Movable Type? Sure: MTEntries lastn=”2”, include content:encoded, then MTEntries lastn=”13” offset=”2”, only include description. In CityDesk? Dunno: I’m falling down on my wareblogger job, but I’ve never really looked at it.

Maybe not ideal for someone just adding the feed, but then when I add a feed, I mostly do it from the web page, where I’ve just seen the most recent several entries. That’s why I’m adding it. It sounds to me like it has potential for people who are really being pinched by RSS bandwidth (I overbought, myself, so I need to transfer about 30 times more, not less).

 
Comment by Simon Willison #
2002-10-20 04:17:46

How about tying aggregators in to a service like weblogs.com or blo.gs ? That way an aggregator could grab the new RSS feed only when new content had been made available.

 
Comment by Ian Davis #
2002-10-20 06:24:31

Aggregators and publishers should take advantage of the existing HTTP methods of detecting whether a feed has changed or not (Last-Modified and ETags: see Mark Nottingham’s caching tutorial). An aggregator should do a HEAD first to see if the feed has changed since last time it fetched it. If it hasn’t then it doesn’t need to perform the GET, saving badnwidth. Publishers need to make sure their systems produce the correct HTTP headers such – not really all that hard to do but most dynamic systems don’t bother to do it. This is where Moveable Type wins big. Because it publishes all the feeds as static files the web server can generate all the necessary headers automatically. It makes the site more efficient when Google comes for a sniff every couple of days too.

As with the recent redirect debate, other people have gone through these hoops before and their experiences were baked into HTTP.

 
Comment by Morbus Iff #
2002-10-20 07:43:54

I’d be interested to see which User-Agents are doing him in. Some, like Radio, hit a site every hour it’s open, regardless of if there’s new content or not. Phil – can you get those stats from him?

 
Comment by Phil Ringnalda #
2002-10-20 08:52:00

Okay, I asked.

 
Comment by Phil Ringnalda #
2002-10-20 09:06:43

I’d love to be able to handle it completely in HTTP, but the first two steps look pretty damn big to me: first, convince every single author of RSS consuming apps to do it (okay, just the big ones would be a huge help), and second, convince authors of publishing programs to be more careful of what they save: I don’t think Movable Type wins quite as big as you’re thinking it does, since adding a comment rebuilds all index templates (so that the comment count in the main index page will be updated), and RSS is just an index template. Last time my RSS changed was last night, but the mod date for my RSS file is currently 8:52 this morning, and once I hit Post it will be 9:06.

 
Comment by Már Örlygsson #
2002-10-20 09:12:30

Re Ian Davis’ excellent KISS suggestion:

Instead of HEAD method calls, use conditional GET requests with If-Modified-Since: and/or If-None-Match: headers. These save even more bandwidth than HEAD requests.

 
Comment by Phil Ringnalda #
2002-10-20 09:17:38

If we could work out the details, weblogs.com/blo.gs would be a great check on ttl: I’d be much more comfortable letting a feed set a 900 minute ttl if I know that my aggregator will ignore the ttl if it spots a ping to weblogs.com in the meantime. How to get it sufficiently fine-grained notification is another question, though: if we’re talking about letting aggregators go below the magic one-hour mark, then they would need to hit the notification service more often as well.

 
Comment by Jason #
2002-10-20 09:27:20

Why is it harder to ask the authors of RSS-consuming apps to do it in HTTP than to ask them to support something (either the current RSS elements or new ones) that they don’t now? Or, put differently, why create a new ways to do something when the old, reliable ways aren’t used at all? It seems that some of the authors of the current RSS-consuming apps are hitting the same obstacles that early authors of HTML-consuming apps (e.g., browsers) hit, and the appropriate first response would be to see how those authors overcame them (e.g., as Ian said, in HTTP).

(And as for MT redoing the index template every time there’s a new comment, I totally agree; I’ve been pondering a feature request that would set up classes of templates, such as RSS or TrackBack, that would behave slightly differently under different circumstances.)

 
Comment by VK #
2002-10-20 09:33:43

The issue here is really that we’re using a ”document-centric” transaction (”download it all”) to communicate between database-backed systems.

Nobody routinely prepares an RSS feed by hand, right? It’s machines all the way.

Ideally, one machine would ask the other machine ”Anything new for me, old chap?” and the machine would reply with

1> No, it’s all the same
or
2> Yes, here are your five new headlines

If we **insist* on staying in ”document land”, let me propose that each RSS feed incorporates an element which names a URL which will give the time of the last change to the feed, i.e.

time.xml:

Thirty bytes, once an hour, isn’t going to kill anybody.

Still not as good as machines distributing only the changes and keeping each-other up to date, but it’ll do for now, right?

VK.

 
Comment by VK #
2002-10-20 09:36:38

Sorry: it ate my code fragments:

The document feed contains:
(blah rss blah)
(last_changed_file = ”time.xml”)
(/blah rss blah)

The time file contains:
time.xml:
(last_changed_time = ”3929593929393”)

Simple, right? And, if we’re feeling fascist, we can deny RSS feeds to machines which don’t show us that they know the most recent mod time or have not checked it before coming to get our feed.
VK.

 
Comment by Phil Ringnalda #
2002-10-20 09:48:46

Jason: you caught me: it’s a shiny new toy. Spending months becoming a total nuisance to aggregator developers, nagging them to support ETags, HEAD requests, and conditional GETs, all of which I barely know how to do myself in the simplest scripting language known to man, and then doing it all over again with every content producing program developer, sounds boring as hell. Playing with a new idea for a while sounds like fun.

I’ll ”me too” anything you suggest for MT, but I can’t figure out how to suggest it so it will be something that I think Ben will like the sound of: do you add a complicated interface to let users specific the dependencies for each template? Do you add a routine that figures them out? That would work well if it wasn’t for the ”link this template to a file” feature: otherwise, you save a template, MT checks what tags it includes, and sets dependencies on that (”it’s an index template, but it doesn’t have any MTComment* tags, I won’t rebuild it when comments are added”). With ”link to a file,” you would have to check every time you wanted to rebuild, which would probably slow things down to a crawl.

 
Comment by Jeremy Bowers #
2002-10-20 10:06:27

One possibility for ’quick relief’ from the aggregator makers is doing some sort of exponential backdown, if there haven’t been any changes to the feed. I was playing with this with one scanning service I once toyed with.

Scan every hour for, say, three hours. If there haven’t been any changes, don’t scan again for three hours. Then try again in three, and if there aren’t any changes, try again in six, etc. You can make a smoother progression and there’s a certain art in picking the final maximum value, but that can give immediate relief to some feeds (like Joel’s, in fact, which is not a frequent updater), without requiring massive changes in much of anything. And even after some of the other suggested changes are implemented, this provides a good backup plan for the inevitable feeds that don’t have support for the fancier HTTP-based solutions…

… which IMHO in the long run are the correct solution to this.

 
Comment by Aaron Swartz #
2002-10-20 10:09:31

Conditional GETs (If-None-Match, If-Modified-Since) are definitely the way to go. And they’re super easy (just need to add one header to your request). I’d be happy to work with developers on adding support.

As for MT rebuilding to often, have you checked to make sure this actually affects the ETags? They should only change when the file actually modifies.

There’s also an easy fix for MT: do a cmp between the generated file and the file on disk and do touch the file on disk if they’re the same.

The better fix is the one Phil suggests above (grepping out the tags). ”Link to a file” isn’t an issue, since MT only updates the linked templates when you go into the template interface, not every time you rebuild (AFAICT).

 
Comment by Jeremy Bowers #
2002-10-20 10:11:11

(Sorry to double post.) For that matter, it’s a benefit to some of the aggregators, too: If someone is scanning 300 feeds an hour, the scan takes a while to complete. With a system that automatically backs down, the scans will go much faster because on any given hour, many fewer feeds will be scanned, and statistically speaking, a much larger percentage of them will actually have changes. It can also eliminate some of the pain of long-dead comment-feed subscriptions… if you have the system check, say, every week for changes, that’s not a huge load. If after, say, a year (which is only 52 scans with this scheme, rather then the current 8760(!)), there’s still no change, unsubscribe automatically; the feed is 99% likely dead.

 
Comment by Aaron Swartz #
2002-10-20 10:12:51

Argh, the ETags do change (albeit only slightly–check the last two chars). Oh well.

Last-Modified: Sun, 20 Oct 2002 17:09:34 GMT
ETag: ”25570a8-7e51-3db2e34e”

Last-Modified: Sun, 20 Oct 2002 17:11:15 GMT
ETag: ”25570ab-7e51-3db2e3b3”

 
Comment by Joel Spolsky #
2002-10-20 10:23:13

The HTTP HEAD method seems like the right way to go, in fact, it already works for me as a publisher because my site is served by a traditional web server out of static files.

Looking at my logs, it’s pretty clear that Amphetadesk is already doing this and saving about 1/2 a meg a day per subscriber as a result. Radio, NetNewsWire, and ”Mozilla/3.0+(compatible)” (whoever that is) are not.

 
Comment by Joel Spolsky #
2002-10-20 10:35:21

Technically, I suppose, I could autogenerate a different rss.xml based on the user agent so that the less-polite aggregators only get headlines while everyone else gets full content.

 
Comment by Phil Ringnalda #
2002-10-20 10:46:46

If you can make sure that the users know why they’re getting shorted, that sounds like a good way to persuade the developers of less-evolved aggregators that it’s worth their while to support more polite access.

 
Comment by Phil Ringnalda #
2002-10-20 11:06:52

template linked to file: I’m really not sure how big this issue is. Looks to me like every time you rebuild, when you call build() it calls text(), which calls _sync_from_disk(), which checks the file size and mtime. If you haven’t changed the linked file, no problem, but if you have, and MT’s trying to rebuild in a hurry while a TrackBack ping waits, it would have to rebuild the dependencies, which would not only involve grepping tags out of the template text, but also identifying plugin tags, and seeing what they do. I don’t have a good solution for them, unless you require that plugins identify the dependencies they create.

 
Comment by Dave Winer #
2002-10-20 11:24:34

One possibility for ’quick relief’ from the aggregator makers is doing some sort of exponential backdown, if there haven’t been any changes to the feed. I was playing with this with one scanning service I once toyed with.

We did that on the old Weblogs.Com and I swear we do it somewhere in Radio (although my poor mind can’t remember right now). How it worked on Weblogs.Com — if you haven’t updated in X days, we start checking once every 24 hours. Once we spot a change, we move you back up to hourly checking. If you haven’t updated in 90 days, we assume you stopped blogging altogether, and stop checking. But even with those heuristics it became impossible. I’m afraid something like that might be coming for aggregators. If so, it’ll be a much bigger problem because the software is deployed and impossible to take back. At least with Weblogs.Com there was just one instance of the software. Not true anymore. :-(

And checking for HEADs is a good idea, yes it is, but it only gets you a linear improvement. Much better idea is a new architecture. For that we’re going to need centralized resources, and that costs money — and in 2002, no one likes to spend money on centralized resources.

More food for thought.

 
Comment by Michael Bernstein #
2002-10-20 11:57:59

<blue-sky>

It seems like the problem here is partly one of centralization, in that each weblog is a central point of failure for publishing it’s own RSS, and vulnerable to Denial-Of-Service attacks, or the Slashdot Effect, or just becoming a victim of their own success.

What we need here to really scale is a peer-to-peer distribution network for RSS feeds.

What if participating aggregators were willing to answer each other’s questions of whether a feed has updated, and created an ad-hoc distributed heirarchy of notification?

Ideally, the feed contents could be passed around too, but you get into trust issues, and feeds would probably have to be cryptographically signed to get around them.

A malicious peer could send out ”hey, he updated again!” notifications in an attempt to create a DOS attack, but that wouldn’t necessarily work more than once for each client, as they would discover that it wasn’t true once they check the actual feed for the update.

</blue-sky>

 
Comment by Dave Winer #
2002-10-20 12:26:23

Michael, that’s what we were working on with Morpheus in the spring. A couple of things, one on each side, caused us to get distracted. I should get back in touch with them and see what’s up with them. I noted they shipped their 2.0.

 
Comment by mikel #
2002-10-20 12:30:52

Dave:

Wondering – why not implement the stop-gap in Radio? Adding the check for HEAD is straightforward. I’m looking at the code right now, it would take 10 minutes (though there are no callback hooks for me to do this personally).

Big easing on publishers and readers, even if only a temporary one, with a tiny effort. Of course, the work on a better architecture would still continue.

 
Comment by DJ #
2002-10-20 12:59:59

I remember Matt Webb posting thoughts on syndication/aggregation induced server load a couple of years ago.

DJ
(p.s. Coffee-break-sized ETag-enabled wget here :-)

 
Comment by Dave Winer #
2002-10-20 13:04:45

Mikel, I did just add the code. If you’d like to review it, here it is. I’m going to let it burn in a bit before releasing it. Please let me know if you see any problems. Thanks!!

 
Comment by michel v #
2002-10-20 13:11:44

It might seem a bit crazy, but how about a Content-md5sum header ?

The problem with weblogs.com was that weblogs’ pages are prone to display randomised content, or various counters’ updates, making checking for actual new content in a modified file quite hard if not impossible.
This problem isn’t there for RSS, unless the feed includes some random content in it; but I doubt anyone would do that, really…

So, the only way an RSS feed’s md5sum is going to change is by either editing existing stories or posting new ones. Re-building the RSS wouldn’t be an issue for Content-md5sum since the content would stay the same.

Then, checking if a document has changed becomes as easy as doing a HEAD request to check if the md5sum changed :)

 
Comment by Dave Winer #
2002-10-20 13:34:39

Michel V, I don’t see why that is better than using HEAD and checking the date against the last time you read it.

I also read Simon Fell’s concise BDG for etags, and have the same question. Why would someone use that over the HEAD method.

 
Comment by michel v #
2002-10-20 13:42:39

Dave, I’m concerned with dynamically generated RSS (say, by PHP or a CGI).

For example, here’s what tidakada.com/b2rss.xml gives for HEAD:
HTTP/1.1 200 OK
Date: Sun, 20 Oct 2002 20:38:26 GMT
Server: Apache/1.3.26 (Unix) [... a bunch of modules...]
Last-Modified: Sat, 19 Oct 2002 23:28:10 GMT
ETag: ”af43b-2d70-3db1ea8a”
Accept-Ranges: bytes
Content-Length: 11632
Content-Type: text/xml

Now here’s what tidakada.com/b2rss.php gives:
HTTP/1.1 200 OK
Date: Sun, 20 Oct 2002 20:40:42 GMT
Server: Apache/1.3.26 (Unix) [...a bunch of modules...]
X-Powered-By: PHP/4.2.3
X-Pingback: http://tidakada.com/xmlrpc.php
Content-Type: text/xml

See the b2rss.php file won’t give a Last-Modified header.
The workaround would be to make it generate a Last-Modified header, I guess…

 
Comment by Dave Winer #
2002-10-20 13:51:53

Michel, Manila works the same way. The RSS feeds are dynamic. Doing it the HEAD way causes two dynamic renderings of the RSS feed where there used to be only one.

 
Comment by michel v #
2002-10-20 14:00:29

…Is that good or bad? You just got me confused. How are going to cope with Manila feeds then, make then generate a Last-Modified header?

 
Comment by Phil Ringnalda #
2002-10-20 14:01:42

I might be way wrong, but isn’t that why you want to use ETags? I can’t find a citation offhand, but I thought that the way ETags work with dynamic scripts is that it gets rendered (once), and then if it’s the same output you get the same ETag, and you send a 304. Still more expensive than a static file, but better than rendering twice.

 
Comment by Jeremiah Rogers #
2002-10-20 14:03:52

I ran into the Last-Modified problem too, here’s the idea I came up with to fix it:

1) check and see if the feed has a last-modified header
2) if it doesn’t, get a hash
3) compare the value you got to a stored value from the last time you hit the page
4) if they’re different, the page has updated

That seems to be the comprehensive and non bandwith intensive solution.

I think skipping hours and days isn’t very useful. Many people, especially if they’re like me, never know when they’re going to update their weblog.

I’m working on a proxy of sorts that checks for website updates. It’s nothing grand, but it would make the job easier.

I hope to support XMLRPC as well as simple file-level access (like visiting http://blogmass.org/updates/url) to check for a timestamp of the last time a page updated.

I don’t think there’s any easy way to guess when someone will update next, it’s a fairly random process. It would be easy to test though, get a bunch of RSS feeds and see at what times the update.

 
Comment by michel v #
2002-10-20 14:08:58

Phil, do you see any Etag in my b2rss.php’s headers? None :(
Jeremiah, how are you going to get a hash without GETting the file?

 
Comment by Charles Miller #
2002-10-20 14:24:27

I’d like to add my voice (me too!) to there being absolutely no need for an esoteric solution.

HTTP’s standard ”conditional GET” using If-Modified-Since is perfect for the task. It’s a very, very simple process, I could explain it in one paragraph so my mother could understand it.

And it’s literally only a few hundred bytes both ways if the document hasn’t updated, so there’s no real need to compress further than that.

It’s a two-part solution. Placing _all_ the burden for change on the RSS aggregator is unreasonable.

Part One: RSS consumers update their tools to follow the standard (conditional GET), making the users of their product happy that it’s doing the right thing. (and avoiding punitive webmasters who may degrade their feed)

Part Two: Those weblog tools that do not support this mechanism natively will update to follow the standard, since that will make _their_ users happy. Otherwise people with limited bandwidths will just switch to other tools.

 
Comment by michel v #
2002-10-20 14:38:13

Oh well. I just added generation of a Last-Modified header based on the last post’s date in b2, in the meanwhile :)
I don’t know how to generate an ETag. Any resource?

 
Comment by Charles Miller #
2002-10-20 14:58:11

An ETag is just ”whatever is in the contents of the ETag: header”. It’s kinda like a cookie, the contents only mean something when you send it back to the server.

So it could be something high-tech like an md5 sum, or something low-tech like a version number. Implement it however you wish. The only real restriction is that you shouldn’t have it begin with ”W/”, because that implies it’s only good for weak validation.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.11

 
Comment by michel v #
2002-10-20 15:14:01

Yeah, I was just over that RFC2616 section, but I’m still wondering just why the ETag could be used for checking if the content has been updated.
Is it meant to be an unique string that would get re-generated everytime the content is updated ?

 
Comment by Jeremiah Rogers #
2002-10-20 16:51:41

Charles, does the conditional-if statement work when the site doesn’t have a last-modified header? The issue here, I think, is for scripts that either don’t have one set or have one that doesn’t change.

Michael – the hash would only be retrieved if there was no last-modified header set. This would work in most situations except where there is a script and it has a static last-modified header. In that situation we’d have more work to do.

Perhaps no matter what, every 24 hours we could check the RSS file to make sure it hadn’t updated, and if it HAS but the last-modified header hasn’t changed, we could start checking for hashes.

BTW: none of this is necessary with pub/sub, so that might be an incentive for people like Joel to get on the pub/sub bandwagon.

 
Comment by Phil Ringnalda #
2002-10-20 16:51:51

Ow, my head! There must be a good reason for every single bit of that RFC, but my simple mind has an easier time with this zend.com article which just says ”figure out the last time that anything affecting your output was modified, and use that for Last-Modified and ETag”, so that whichever one a client is using (or going to use on their next request), it’ll work the same way.

 
Comment by Mark Paschal #
2002-10-20 17:33:40

How I read it, you could use md5 hashes for your ETags, and so ETag will be just like a ”Content-md5sum” header. Send ”ETag: (md5 hash)” and if the client sends ”If-None-Match: (the current md5 hash)”, send a 304 instead of 200.

I was going to say HEAD is better than nothing though If-Modified-Since and If-None-Match are preferable, but the HTTP spec says to ignore If-None-Match if Date > If-Modified-Since. The hypothetical case is for dynamically generated content in which Date is always now but ETags may stay the same; HTTP seems to say using both If-Modified-Since and If-None-Match in that case would result in a 200. Should clients use HEAD and compare themselves, so they can honor ETag over Date?

 
Comment by Garth Kidd #
2002-10-20 17:51:39

Having all clients poll is the problem, and I’m not sure coming up with different back-off algorithms or ”only poll at these times” subtlety is going to be able to tweak it to health. Even get-if-modified-since et al aren’t going to change the fact that you’re being hit N times per hour.

The simplest tweak I can think of is to [partly] centralise the polling with XML-RPC:

  • Aggregators register with a Radio Community Server or other convenient central point. This can be as simple as ”here’s a unique ID, please give me one in return”. Say,

    myGuid = myURL + "?reg=" + string(clock.time()) # secs since epochmySessionPassword = rcs.registerMe(myGuid)

  • Aggregators then supply a list of the feeds they’re interested in:

    for feed in myFeeds:     rcs.registerFeed(myGuid, mySessionPassword, feed.url, feed.lastSeenChange) # secs since some epoch, GMT/UTC?

  • Finally, at some acceptable interval, aggregators ask which feeds have changed:

    changedFeeds = rcs.changedFeeds(myGuid, mySessionPassword)

The whole reason for the guid/password thing is that the RCS can keep all the state in a temporary table or in memory if it needs to. If it bounces and loses its state, it can just hand out error responses, in which case the aggregator should re-register.

The RCS polls the feeds, implementing whatever back-off algorithms, cloud pub-sub, weblogs.com ping, IM notification, consideration of "only poll at these times" semantics, etc it can exploit to lighten the load. The aggregators don’t need any of this code, merely having to regularly ask which feeds have changed.

For added points, we could arguably have the RCS hand out the changed items, but there’s an awful lot of detail work in there to get it right. Keeping it a simple "tell me which of my feeds have changed" protocol like above ensures that both client and server developers can whip it up in no time.

 
Comment by mikel #
2002-10-20 18:56:24

Ah, I think I should draw attention to this working experiment in using the RCS to solve this problem.
Perhaps this can be a jumping off point for further improvements.

http://radio.weblogs.com/0100875/outlines/rcsAggregatorCache/

With this tool, the Radio Community Server can act as a RSS Cache for Radio Userland Aggregators.

The RCS is able to gather the most popular RSS feeds on behalf of an entire community of Radio Userland Users. The Users Aggregators request those feeds from the RCS, rather than directly from the source.

Content providers should receive significantly less requests of RSS feeds, resulting in less drain of bandwith and server resources.

 
Comment by Charles Miller #
2002-10-20 20:35:57
 
Comment by Garth Kidd #
2002-10-20 20:58:34

I’d forgotten about your aggregatorCache! What does the protocol look like?

 
Comment by Phil Ringnalda #
2002-10-20 22:29:50

Okay, I asked Ben to stop MT from overwriting unchanged files during an index template rebuild (at least). We’ll see.

 
Comment by Phil Ringnalda #
2002-10-20 23:00:39

Second time around, I managed to get Mark’s point: if you follow Charles’ instructions, and return a 304 when you get the correct ETag but an If-Modified that’s more recent than the one you are using, you’re not being HTTP 1.1 compliant, and you should at least realize that they may have had a reason for requesting that way, and may have expected to get the 200 the spec says they should get.

 
Comment by Charles Miller #
2002-10-20 23:04:54

That was entirely due to me misreading the spec. I shall modify my document.

 
Comment by Anonymous #
2002-10-21 04:34:14

Please, please let’s handle this on the HTTP level–not the RSS level. The fact that people suggest kluges to RSS (time-to-live, entity encoded HTML) before looking at the existing stuff (HTTP expiry date and XML namespaces) worries me. Getting to know the frameworks and existing services is good.

And let’s not forget that HTTP proxies exist.

 
Comment by Phil Ringnalda #
2002-10-21 07:40:14

Why conditional-GET won’t solve everything for all time, from a message from Jeremy Bowers on the radio-dev group:

The problem is, as Dave pointed out on the thread. is that this only creates a linear savings over time. According to my copy of RU, Scripting News has 5000+ subscriptions, and AFAIK, that’s *just* RU subscribers on the main Userland RCS. It could be several times higher, I don’t know.

5000 * 500 * 24 = 60(decimal)MB a day, 420MB a week, about 1.5(real)GB a month. The whole (long-term) goal of this is to scale the system in a rapidly-approaching era where 5000 subscribers is at *best* a ”medium” sized site, and we’re already hitting monthly limits on many ISP accounts with *just* the conditional get.

Imagine a world where a college student’s site such as Hack The Planet (before he graduated and got a job) gets 100,000 subscriptions. He updates too often for exponential backdown to be of any serious use (you can eke out a factor of two or so with aggresive settings, but not much more), and 100,000 * 500 * 24 is a gig a day _just_ for the conditional gets; send out 10K on average every fourth hourly request for the actual content and you’re into 100,000 * 10,000 * 6 = 6GB/day. That’s an expensive site; my ISP became annoyed and charged me extra for much less then that.

We can solve the problem now or later, but I don’t think it hurts to talk about what the solution might look like now.

 
Comment by mb #
2002-10-21 15:18:04

HTTP already solves both of these problems, based on years of dealing with it.

Conditional GET is one half of the expiration is the other half and one of its built in concepts.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.2

http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.21

Client side HTTP stacks already know how to handle much if not all of this. Server side, you’ll need a setting to set the expires tag right.

 
Comment by mnot #
2002-11-11 08:32:46

I’m a bit surprised by the confusion re: conditional GET. To make it easier, people might want to look at cgi_buffer – this is a set of libraries (for Perl, Python and PHP) that automatically generate and validate ETags, as well as handle HTTP/1.0 persistent connections and do gzip compression with content encoding. It’s activated with a one-line include into your script, so it’s easy to use with just about anything.

(It’s implemented correctly and according to the specs; I’ve been active in the Web caching community for more than five years and was until recently the resident HTTP expert at Akamai Technologies)

Hope this helps,

 
Comment by Bob Wyman #
2004-09-19 14:26:31

I think Joel’s problems, a bit old by now, would be somewhat relieved by supporting RFC3229 with the ”feed” instance-manipulation method. Check out:

For a description of ”feed” IM method:
http://bobwyman.pubsub.com/main/2004/09/using_rfc3229_w.html

For a list of servers and clients that support RFC3229 delta encoding with ”feed”:
http://bobwyman.pubsub.com/main/2004/09/implementations.html

bob wyman

 
Trackback by Sam Ruby #
2002-10-20 09:08:33

RSS bandwidth costs

Phil Ringnaldaweighs in on Joel’s RSS problem. It is worth noting that there is an existing solution that significantly reduces bandwidth without affecting latency or content. It is called HTTP HEAD requests. Amphetadesk already supports it. Check

 
Trackback by Burningbird #
2002-10-20 12:13:08

Sunday afternoon Ramble

I was not surprised to get negative comments in my posting, Everything to do with her being a woman. I was surprised, and gratified, when a commenter going by the name of ”Lord Trickster” seemed to defend what I was

 
Trackback by markpasc.blog #
2002-10-20 23:08:56

If-Modified-Since: whenever

 
Trackback by Sam Ruby #
2002-10-21 08:02:02

The Digital Magpie Crows

Joelcontemplates autogenerating a different rss feed for less polite aggregators. Within hours, Daveadds the code to Radio. Philpromptly takes credit for not inventing the idea.In the process, a new and somewhat more apt description for Phil’s weblog

 
Trackback by Ask Bjørn Hansen #
2002-10-23 02:59:49

If-Modified-Since and dynamic content

Brent added support for Etags and If-Modified-Since headers to the latest NetNewsWire beta. It’s very cool. He added it after hints and pressure from among others Joel, Phil, Sam and Mark; I’ll refrain from pointing out that I suggested it to him sever…

 
Trackback by Rant Central #
2002-10-28 09:14:55

RSS bandwidth and aggregators

Joel has a problem with RSS. Phil Ringnalda writes about the problem with aggregators that download RSS feeds hourly (or

 
Trackback by Seb's Open Research #
2003-11-25 05:30:13

Feed pounding and bandwidth issues

This worries the RSS geek in me a little bit.

 
Trackback by Richard Giles blog #
2004-10-13 06:12:27

Radio Killed The Internet Star.

Over the last few months there have been a few articles (one here another here) and concerns over RSS and the effec…

 
Name (required)
E-mail (required - never shown publicly)
URI
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <del datetime="" cite=""> <dd> <dl> <dt> <em> <i> <ins datetime="" cite=""> <kbd> <li> <ol> <p> <pre> <q cite=""> <samp> <strong> <sub> <sup> <ul> in your comment.