[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Scraping sites with code



Somebody has just told me that Google News look for PHP in the user agent and return 403 forbidden if they find it. It seems likely that they also look for other auto-generated code names.

FWIW http://www.google.com/robots.txt also bans robots from all the main directories.

I wonder how many other sites that routinely get scraped to produce RSS do the same thing.

For those of you using gnews2rss.php, coding round this is easy and left as an exercise for the php programmer. Using Curl or looking at the documentation for fopen() will solve the problem.

Meanwhile isn't it about time Google produced RSS themselves as an alternate output format from Search and News Search? Way back in June a senior Google person told me it was coming soon.

--
Julian Bond Email&MSM: julian.bond@voidstar.com
Webmaster:              http://www.ecademy.com/
Personal WebLog:       http://www.voidstar.com/
M: +44 (0)77 5907 2173   T: +44 (0)192 0412 433