[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] robots.txt and rss



> when i said 'aggregator', i wasn't necessarily talking about end-user
> tools like radio, but more about large-scale tools and sites that collect
> mountains of feeds at once.

Oh which there's what, a half-dozen or so?

> this isn't really a short-term thing, but something that will be handy in
> the *long* term.  let's say that i'm using RSS to syndicate indices of
> content, and that the RSS is generated on-the-fly - maybe inside Zope or
> some other environment.  well, i probably DONT want a site like syndic8 to
> re-collect those feeds on a daily or even weekly basis, if there's a lot
> of content - HUGE server load could result.

So if it's not a lightweight newsfeed don't list it.  Or, as we've recently
implemented for Newsisfree and soon for others, ask Syndic8 to use a different
polling interval.  There's a way to have it listed but never polled.  This is
usually reserved as a temporary setting for feeds in a defective state.  But it
certainly could be utilized.  Mike Krus asked Syndic8 to poll less frequently.
Others can do the same.  At some point (hopefully) the feeds will be able to
embed this in their XML directly.

> and *who knows* how people might use the feature?  i sure don't.

To implement something without at least some clear understanding of what might
implement it is no way to go about inventing functionality.  The code just isn't
going to write itself without at least a use case explaining it.

I have an archiving meta module in development.  That'd be one way to have a
feed indicate what it does or doesn't want done with it's content.

> maybe i want to block out specific software users or types of users, and
> allow individuals to keep doing what they're doing.

And robots.txt has NO mechanism for doing that.  It's an all user-agent or
nothing approach.

> maybe i don't want to block the whole site, just an RSS feed - maybe
> because of some peculiar interaction between a particular site (or
> something exposed on it - maybe some sort of service?) and an aggregator.
> (i'm making this up, clearly - sometimes that's useful.)

If I'm reading this correctly you want to avoid the possible controversies of
other sites?  Again, that's not something robots.txt supports.  Using server
directives is THE way to accomplish that.  Just look at how Apache can be
configured to block people from using IMG links on their pages that point to
images on your servers.  That's one way to block things and, actually, would be
a fine way to block the RSS readers that incorrectly set the referral header
info.  But, again, that's only valid if you wanted to block ALL access using
that software.

> nah, i just might want to use the per-file granularity that robots.txt
> supports.

Sure, but per-file blocking to have an aggregator notice it doesn't touch on the
fact it isn't the aggregators wasting the bandwidth.  It's poorly configured
reader programs on desktops and portals making this mistake.

Again, I don't disagree with the idea of managing bandwidth more effectively.
It's just that robots.txt is not the way to do it.

-Bill Kearney