[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Google News, Syndication and stuff



S. Mike Dierken wrote:
>    Date: Wed, 25 Sep 2002 13:55:10 -0700
>    From: "S. Mike Dierken" <mdierken@hotmail.com>
> Subject: Re: Google News, Syndication and stuff

[snip]

> I like the 'heres my latest content' approach - but how would they verify
> the authority of the submission? The administrative overhead of passwords
> will probably put a damper on this approach.
> I simple 'site changed' that caused them to GET against a registered URL
> ensures them that the content is actually from the site it is supposed to
be
> from.
> What if they just kept track of how often a site appeared to change and
> adjusted that sites schedule accordingly?

Mike, I have thought of implementing a similar idea before.  Having the
scraper keep track of whether or not the page has changed since the last
scrape and determining the approximate average update frequency and then
using this to limit the number of useless scrapes.  The original data
collection would have as much scraping as current scraping systems but the
overall benefit to minimizing useless GET's would be phenomenal.  As well,
in conjunction with the update ping, the scraper would have very ideal data
for determining average update frequency.  Which begs the question, if a
site were to start pinging a scraper everytime it had been updated, would
the scraper still scrape the site at regular intervals?  Or would it simply
consider that site as being a "ping-style updating site" and then only GET
the material when it has been pinged?

Gordon Bonnar
gordonbonnar@hotmail.com