[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] robots.txt and rss



> > opinion:  RSS aggregators probably *should* respect robots.txt.
>
> How so?  How is the robots.txt file germaine to a reader's behavior?

when i said 'aggregator', i wasn't necessarily talking about end-user
tools like radio, but more about large-scale tools and sites that collect
mountains of feeds at once.

this isn't really a short-term thing, but something that will be handy in
the *long* term.  let's say that i'm using RSS to syndicate indices of
content, and that the RSS is generated on-the-fly - maybe inside Zope or
some other environment.  well, i probably DONT want a site like syndic8 to
re-collect those feeds on a daily or even weekly basis, if there's a lot
of content - HUGE server load could result.

and *who knows* how people might use the feature?  i sure don't.

> The only way a robots.txt file is going to have any relevance here is
> that the robots.txt file could indicate that a particular user-agent
> should NOT load from within a given part of the hierarchy.  This would
> be equivalent to not having the feed available.

maybe i want to block out specific software users or types of users, and
allow individuals to keep doing what they're doing.


> If you're interested in blocking the hammering of a feed then you'd need to use
> other means to do so.  Ban the IP address of the offending client machine.  Or
> use the server mechanisms to detect the user-agent and block it that way.

maybe i don't want to block the whole site, just an RSS feed - maybe
because of some peculiar interaction between a particular site (or
something exposed on it - maybe some sort of service?) and an aggregator.
(i'm making this up, clearly - sometimes that's useful.)

> But you're on a slipperly slope here is you speak of banning user-agents in a
> wholesale manner and that's all robots.txt would allow.

nah, i just might want to use the per-file granularity that robots.txt
supports.

see the example at this url:

http://www.searchtools.com/robots/robots-txt.html

~elijah