[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Scraping sites with code
Somebody has just told me that Google News look for PHP in the user
agent and return 403 forbidden if they find it. It seems likely that
they also look for other auto-generated code names.
FWIW http://www.google.com/robots.txt also bans robots from all the main
directories.
I wonder how many other sites that routinely get scraped to produce RSS
do the same thing.
For those of you using gnews2rss.php, coding round this is easy and left
as an exercise for the php programmer. Using Curl or looking at the
documentation for fopen() will solve the problem.
Meanwhile isn't it about time Google produced RSS themselves as an
alternate output format from Search and News Search? Way back in June a
senior Google person told me it was coming soon.
--
Julian Bond Email&MSM: julian.bond@voidstar.com
Webmaster: http://www.ecademy.com/
Personal WebLog: http://www.voidstar.com/
M: +44 (0)77 5907 2173 T: +44 (0)192 0412 433