[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Scraping sites with code

To: syndication@yahoogroups.com
Subject: Scraping sites with code
From: Julian Bond <julian_bond@voidstar.com>
Date: Mon, 8 Sep 2003 08:29:18 +0100
User-agent: Turnpike/6.02-U (<NihPaT1eq7QPEjqHWIMIZRyCL3>)

Somebody has just told me that Google News look for PHP in the useragent and return 403 forbidden if they find it. It seems likely thatthey also look for other auto-generated code names.

FWIW http://www.google.com/robots.txt also bans robots from all the maindirectories.

I wonder how many other sites that routinely get scraped to produce RSSdo the same thing.

For those of you using gnews2rss.php, coding round this is easy and leftas an exercise for the php programmer. Using Curl or looking at thedocumentation for fopen() will solve the problem.

Meanwhile isn't it about time Google produced RSS themselves as analternate output format from Search and News Search? Way back in June asenior Google person told me it was coming soon.


--
Julian Bond Email&MSM: julian.bond@voidstar.com
Webmaster:              http://www.ecademy.com/
Personal WebLog:       http://www.voidstar.com/
M: +44 (0)77 5907 2173   T: +44 (0)192 0412 433

Prev by Date: RSS - A Primer for Publishers & Content Providers
Next by Date: Blogger Pro goes free
Previous by thread: RSS - A Primer for Publishers & Content Providers
Next by thread: Blogger Pro goes free
Index(es):
- Date
- Thread