Lies, damn lies, and mod_syndication

Since Bill mentioned supporting the RSS syndication module, I’ve been looking at the way it is actually used, and I have to say that I hope no aggregator I use supports it, at least not until people start using it right.

mod_syndication borrows three elements from Ian Davis’s OCS format: updatePeriod, updateFrequency, and updateBase. In OCS, all three are defined as optional, though the RSS spec doesn’t mention whether they are or not. If updatePeriod is omitted, it’s assumed to be “daily”, and if updateFrequency is omitted, it’s assumed to be 1.

If you give only updatePeriod and updateFrequency, then you are just saying “I think it’s appropriate for an aggregator to poll me this often” (to which I say “I know more about how I use my aggregator than you do, so keep your hints to yourself: if I’m only online for an hour and fifteen minutes before work, I want to poll all my subscriptions twice, and you can jolly well take that bandwidth hit”).

However, if you include an updateBase, you are saying that you update on a schedule, and that aggregators shouldn’t expect any content outside that schedule. Suppose I updated my site religiously at midnight, 6 am, noon, and 6 pm (stop that giggling about the idea of me updating regularly, you!). By saying:

<sy:updatePeriod>daily</sy:updatePeriod>
<sy:updateFrequency>4</sy:updateFrequency>
<sy:updateBase>2000-01-01T12:00-08:00</sy:updateBase>

I would tell compliant aggregators “after you do your first update after noon, whenever that might be, you should then wait until after 6 pm to update again,” and the same for an update after 6 pm, midnight, and 6 am. You don’t have to check at noon precisely (in fact, you’d be foolish to risk cutting it that close), but having checked once between noon and 6 pm, I guarantee that you won’t find anything new until after 6.

However, when it’s misused by a site which doesn’t update on a regular schedule, updateBase says foolish and damaging things. To pick on an anonymous site I was just looking at, using:

<sy:updatePeriod>daily</sy:updatePeriod>
<sy:updateFrequency>1</sy:updateFrequency>
<sy:updateBase>2000-01-01T12:00+01:00</sy:updateBase>

says “this site is updated once per day, at noon in Western Europe.” Once an aggregator which supports mod_syndication has gotten the feed at five after noon, it shouldn’t try again until after noon the next day. In fact, as is typical of the sites I’ve seen using updateBase, it updates at any old time: some days there’s an update at 9-something in the morning and another at 3-ish in the afternoon, other days there’s nothing at all. Say one day the site updates twice in the afternoon, and then once the next morning. If I use an aggregator which supports mod_syndication, and the first day I fire it up shortly after noon, then leave it running all afternoon, I won’t see a single one of the afternoon updates, since they told me that there wouldn’t be anything new until noon the next day. Next day, I start my aggregator when I get up, and shut it down shortly before noon. I still won’t see the updates from the day before, nor will I see the updates from that morning, because my aggregator is still waiting for noon to roll around so it can check again.

So: for those rare sites that are updated by the clock, rather than by when content is available, mod_syndication with an updateBase is a good thing, but for most sites, it’s either an unwelcome suggestion about what the owner thinks is an appropriate polling period, or a positive nuisance, denying you updates if you happen to use your aggregator at the wrong time of the day. I’m pretty sure there are vastly more sites that can say “I never update during these hours,” which is what skipHours says, than there are sites that can say “I update with this exact frequency, starting at this time of day.” Using RSS 1.0 rather than 0.9x/2.0? No problem, just use mod_rss091, add the namespace, and you can use

<rss091:skipHours>
<rss091:hour>10</rss091:hour>
</rss091:skipHours>

Erm. Note to implementers of skipHours: could you please support both skipHours and http://purl.org/rss/1.0/modules/rss091#skipHours?

17 Comments

Comment by Morten Frederiksen #
2002-11-11 05:22:39

Hi Phil,

I agree with your position, to a certain degree.

The syndication module may not be perfectly suitable for blogs with multiple and unevenly distributed daily updates, but for a lot of other cases it is much more useful.

Likewise, the skipHours (and skipDays) elements are not very useful for i.e. something that is updated every 5 hours.

Also, consider the reverse case of using the format (as in OCS) for scheduling – it is quite useful for a wide variety of feeds, not just the ones from blogs.

I would think that a combination of the two ways of specifying updates would be prefered, but I’m not quite sure how that would look. Perhaps it would be more of ”next projected update” element (TTL?), that is updated continually by the producer, but that would to some degree ruin the If-Modified-Since etc. approaches.

BTW, the example with a misconfigured interval is not something that should be said against the format of the syndication module itself, the same negative results can be obtained with misusing skipHours.

 
Comment by Bill Kearney #
2002-11-11 12:51:25

Erm, no, how about just using the 2.0 elements instead?

xmlns:rss2=”http://whatever/its/uri/uses”

and then <rss2:skipHours> etc…

The trick, of course, is getting readers to grasp this convolution.

 
Comment by Phil Ringnalda #
2002-11-11 13:08:13

”But Bill,” he said naively, ”RSS 2.0 doesn’t have a namespace!”

As you well know.

Also, there’s no reason to use the RSS 2.0 namespace to get an element that’s been around since RSS 0.91, when there’s an existing module that for all we know someone already supports. Now the RSS 2.0 comments element, that’s a different matter, and one that I’ll want to pursue at some point, but not right this minute.

 
Comment by Bill Kearney #
2002-11-11 13:35:49

I find it much more likely a feed is going to offer that it’s updated daily, weekly, monthly than to jump through the skipHours and skipDays hoops.

CONTINUED…

 
Comment by Phil Ringnalda #
2002-11-11 15:17:35

Morten – I probably should have better explained that I didn’t object to the existence of mod_syndication, just that I don’t feel it should be implemented by aggregators while every single feed I could find using it was misusing it.

With enough explanation in the spec or linked to the spec, saying that mod_syndication is only workable if you update by the clock, and if you don’t you shouldn’t provide an updateBase and shouldn’t be surprised when your suggestion about frequency is ignored, and you might be better off using rss091:skipHours instead, I think eventually mod_syndication could be worth supporting. Now, no.

 
Comment by Dave Winer #
2002-11-11 16:43:31

Phil do you have a feed that uses <rss091:skipHours> and <rss091:hour>?

 
Comment by Phil Ringnalda #
2002-11-11 16:47:33

Sure thing, it’s in http://philringnalda.com/index.rdf right now.

 
Comment by Daryl Oidy #
2002-11-12 02:05:07

I’m not sure I follow your interpretation of the syndication spec. It indicates when you can reasonably expect the site to have been updated; they do not forbid you from checking the site at other times. Why should it matter whether or not your aggregator is running at the updateBase time (or any other update time derived from it)? If you shut it down at 11:30am and don’t fire it up again until 5pm, I don’t see any reason it shouldn’t remember that it last checked for an update at 12:01pm yesterday, and determine that you’re due for another update.

(One improvement I would like to see is in the Example section; what’s the point of giving examples of the XML tags without stating what they’re supposed to mean? One sentence explaining what information those example tags are intended to convey would make things a lot clearer I think.)

And I’m really lost regarding that I know more about how I use my aggregator than you do, so keep your hints to yourself: if I’m only online for an hour and fifteen minutes before work, I want to poll all my subscriptions twice, and you can jolly well take that bandwidth hit comment. If I only ever update on Sundays and Wednesdays, in what way can your browsing habits ever justify hitting the site twice on a Friday morning? The RSS feed is trying to help you here. It’s saying ”save yourself the effort, I’m not going to have any new content”. Why would you want to ignore that?

 
Comment by Dave Winer #
2002-11-12 06:37:49

Bravo Daryl!

The rare weblog post where I actually learned something new.

”If you shut it down at 11:30am and don’t fire it up again until 5pm, I don’t see any reason it shouldn’t remember that it last checked for an update at 12:01pm yesterday, and determine that you’re due for another update.”

That’s a good idea that will appear in my software.

Thank you.

 
Comment by Phil Ringnalda #
2002-11-12 07:35:32

If you only ever update on Wednesday and Sunday (and you can find a way to tell me that with mod_syndication, which I don’t think you can), then there’s no reason to hit you twice Friday morning. However, that’s not the situation I found by looking at the top 80 hits for the namespace URL on Google (the only way I could find to see how people are actually using the module). What I found was two things: people using just an updatePeriod and updateFrequency to say ”I update once a day” or ”I update every two hours”, when looking at the timestamps on their posts shows that neither one is true, and people also using an updateBase to say ”I update once a day, at noon” or ”I update four times a day, at noon, 6, midnight, and 6”, which was completely and utterly false.

I think I know a thing or two about irregular and infrequent updating myself, but just because I don’t post very often doesn’t mean I should tell aggregators that I only post once a day, which would result in people who happened to check some time before my one post not getting to see it until the next day. That’s why I think the sort of scheduling hints that mod_syndication offers are unwelcome: my aggregator just ran at 7:10, and will run again at 8:10, and then that will be it until after work tonight. If anyone updated between 7:10 and 8:10, whether they claim to only update twice a day or twice an hour, I want my aggregator to check, because I know how I’m using it and I know that if I don’t catch it now I won’t be looking again for hours.

And certainly my aggregator in my example should check when it restarts at 5 the second day (how else could it work?), but by then it will be picking up day-old news, stuff I’ve long since seen in a dozen weblogs and on Daypop. That’s not why I use an aggregator. I’ve got no problem with you telling my aggregator to stay away at times when you’re asleep/the office is closed, which skipHours does, but the only thing mod_syndication offers to people who don’t update on a precise schedule is a way avoid a few 200 byte If-Modified-Since requests by denying people fresh content until they’ve waited long enough. No thanks, I’ll just point my browser at your page, and force-refresh at 50KB per until something new pops up.

 
Comment by Morten Frederiksen #
2002-11-12 18:00:26

Sorry Phil, I think you’re making the same mistake again – confusing bad usage with bad module design.

You could equally well have used an example with ”invalid” skipHours usage, that would leave you just as news-starved…

That said: People make mistakes, and it could very well be because of bad design or bad documentation – methinks the last option seems to be the first one to investigate and fix.

 
Comment by Phil Ringnalda #
2002-11-12 22:51:48

Damn. I go along thinking that I’m a reasonably capable writer, and then something like this makes it clear to me that I can’t communicate at all.

I haven’t intended to say word one about design. All I wanted to talk about was implementation. Bill said that Radio should implement mod_syndication along with skipHours, so I looked at how people were actually using it, and came to the conclusion that it would be a bad idea to implement it with the way it’s currently being used.

As to design, I think mod_syndication is really quite nice, for sites that should be using it. If they could be bothered with an RSS feed Corante would be perfect for it: they say that they update by 10:30 am EST (well, by 10:00 on interior pages, but by some specific time anyway), so they could have an RSS feed with updatePeriod daily, updateFrequency 1, and updateBase 2001-01-01T10:35:00-05:00, and compliant aggregators would know that they only need to check once a day, on their first scan after 10:35 (or more precisely, they need to check on a given scan if any 10:35 am has passed since their last scan). If you update more (or less) often, but on some regular schedule, it’s great for you. Even if you don’t actually update on every one of the scheduled times, if you update on the hour every three hours during the day, it’s still worthwhile using mod_syndication to put aggregators on a three hour check schedule (especially if you can also use skipHours to tell them when you are closed or asleep).

But, if you don’t update on a regular schedule, but still use mod_syndication to claim that you do, then not only are you lying to your readers, you end up making any aggregator which supports mod_syndication look worse to its users than one which doesn’t. Say Hayseed doesn’t support it, and BarbituateDesk does. Someone like Mark, who claims that he updates every three hours starting at noon UTC, posts something so brilliant/hilarious that you know everybody and his dog will be linking to it, at 12:45 UTC. Hayseed picks it up right away, but BarbituateDesk knows that he won’t update again until 15:00, so it doesn’t even bother squandering a couple hundred bytes on an If-Modified-Since. So BarbituateDesk users see links to his great new post for several hours before they see his post in their aggregator, and eventually they put two and two together, and realize that by switching they can get their news at the same time as everyone else. By supporting misused mod_syndication, BarbituateDesk has made itself not just less useful to its users, but obviously less useful.

Could skipHours be screwed up the same way? Sure, but you have to make an actual effort to do it. Bill linked to someone who did just that: they say to skip the hours when they are likely to be sleeping, and then they have skipHours for two out of every three hours the rest of the day. It’s certainly not something that I would do, but at least they are still making an RSS feed available, and there’s not the slightest question that it’s exactly what they intended to do. With mod_syndication, especially with people using an updateBase, I’m pretty sure they looked at it, said ”how often do I update? mostly once a day. updateBase? huh? I’ll put in the one that’s in the example.”

To mess up your feed so that it can take two days before someone sees a new post, in mod_syndication:

<sy:updatePeriod>daily</sy:updatePeriod>
<sy:updateFrequency>1</sy:updateFrequency>
<sy:updateBase>2000-01-01T12:00+00:00</sy:updateBase>

(copied from the spec, changed to daily/1)

To do the same thing in skipHours:

<skipHours>
<hour>0</hour>
<hour>1</hour>
<hour>2</hour>
<hour>3</hour>
<hour>4</hour>
<hour>5</hour>
<hour>6</hour>
<hour>7</hour>
<hour>8</hour>
<hour>9</hour>
<hour>10</hour>
<hour>11</hour>
<hour>13</hour>
<hour>14</hour>
<hour>15</hour>
<hour>16</hour>
<hour>17</hour>
<hour>18</hour>
<hour>19</hour>
<hour>20</hour>
<hour>21</hour>
<hour>22</hour>
<hour>23</hour>
</skipHours>

Seems to me that people are a bit less likely to accidently do that.

Now that everyone has gotten completely tired of reading, I do actually have some thoughts about design, despite the fact that it’s spilled and long since dried milk: it was a really unfortunate idea to make skipHours and skipDays in UTC/GMT time. The conversion from whatever number is in an hour element to local time will always be done by a computer, and they are good at that sort of thing. However, other than Radio users, skipHours will probably mostly be written by humans, and anyone who has ever called or been called more than three timezones away knows how bad humans are at timezone conversion. If I could just say that I’m in zone -08:00 (or, far better, that I’m in a particular zone with a particular daylight savings regime), then I could be sure that I’m actually saying to skip the hours I intend.

The situation with skipDays is even worse: it’s moderately usable for Europe, and useless for the rest of the world. If someone in California says to skip Saturday and Sunday, because they have a feed that’s only updated weekdays, they have actually said to skip from 4pm Friday through 4pm Sunday. In parts of Australia, they’ve said to skip from 9am Saturday through 9am Monday. I don’t think it’s fixable: just because we don’t know anyone using it, as a producer or as a consumer, doesn’t mean nobody is, and I don’t think it’s appropriate to change a spec that’s been stable for that long. But if both sides are going to keep being pissy about using the other flavor’s stuff, and someone reinvents skipHours and skipDays in RSS 1.0, I hope they consider doing it in localtime with an offset tag, so that a computer will do the conversion instead of a human.

 
Comment by Morten Frederiksen #
2002-11-13 05:05:13

Phil,

Good points about the probability of misuse when copying examples.

Rereading all your posts on this, I think I was a little quick pointing out your ”mistake” – you state quite clearly that it’s the current situation that leads to your decision – sorry about that.

Of course, bad current use and implementations shouldn’t discourage us from trying to aim for something better.

Also, fine point about removing timezone issues from ”userspace” and the problems with skipDays. If skipDays isn’t usable, how do we handle feeds with less than daily updates? Perhaps just limiting the scope for RSS (it’s supposed to be news, right?), saying ”just use skipHours – if you update less often, even only weekly, just indicate all hours except one…”?

 
Comment by Phil Ringnalda #
2002-11-14 00:42:31

If you have a popular but rarely updated feed, I wouldn’t think that skipHours{all but one} would be a very good idea, would it? Even if you’re only returning 304s most days, cramming them all into one hour doesn’t sound like it would make your server very happy.

Depending on how you interpret the meaning of mod_syndication, it might actually work for some irregular but infrequent situations: if you update once, sometime during the day on Monday, Wednesday, and Friday, one interpretation of:

<sy:updatePeriod>weekly</sy:updatePeriod>
<sy:updateFrequency>3</sy:updateFrequency>
<sy:updateBase>2002-11-11T08:00-08:00</sy:updateBase>

would be ”start polling at 8 am Monday, keep going at your regular interval until you get an update, then stop until 8 am Wednesday and then 8 am Friday.” Of course, it would be just as reasonable to interpret it as saying ”poll me at 8 am Monday, then again every 56 hours after that.” There seem to be cases where it would be more useful to treat the next smaller updatePeriod as atomic, and others where it wouldn’t: months and weeks would be especially troublesome.

 
Trackback by metaGarbage #
2002-11-11 09:02:28

Do the RSS

Phil Ringnalda has a nice post about update intervals in RSS feeds. After reading this, I’ve made the following changes

 
Trackback by A young man #
2002-12-07 05:44:54

More RSS junk .. um learning to do

One of these days, I’ll get over my hangup on RSS that I’m dealing with now. Until then, sorry.

 
Trackback by AYM - Links #
2002-12-18 11:55:52

More RSS junk .. um learning to do

One of these days, I’ll get over my hangup on RSS that I’m dealing with now. Until then, sorry.

 
Name (required)
E-mail (required - never shown publicly)
URI
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <del datetime="" cite=""> <dd> <dl> <dt> <em> <i> <ins datetime="" cite=""> <kbd> <li> <ol> <p> <pre> <q cite=""> <samp> <strong> <sub> <sup> <ul> in your comment.