[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [syndication] Re: Translate non-structured documents into Xml RSS format
I think that Ben is asking for an HTML scraper. They
generally use some obscenely complex Perl regular
expressions to extract the relevant headlines from
a page. The expressions are specific to the page.
I know that Ian over at Internet Alchemy runs one.
I'm not a big fan of scraping -- it seems to be
fragile and error-prone -- if the site changes
its format the regular expressions could break.
Jeff;
Jeff Barr - Home: 425-836-5624 Office: 425-936-3098
mailto:jeff@vertexdev.com
http://www.vertexdev.com/~jeff
http://jeffbarr.editthispage.com/
4610 191st Place NE. Redmond, WA
-----Original Message-----
From: Aaron Swartz [mailto:aswartz@swartzfam.com]
Sent: Monday, September 25, 2000 3:17 PM
To: syndication@egroups.com
Subject: [syndication] Re: Translate non-structured documents into Xml
RSS format
ben@ubiquick.com <ben@ubiquick.com> wrote:
> I would like to know if anybody has already worked on a bot that
> could grab unstructured documents and translate them into RSS format.
I'm not quite sure I follow. You mean a spider that would crawl the website
and output a channel with a listing of all the pages on that site? I've
never heard of such a thing, it does sound like an interesting possibility,
however.
What would you use this for, since the site map would rarely change (making
it not very useful for news)?
--
Aaron Swartz |"This information is top security.
<http://swartzfam.com/aaron/>| When you have read it, destroy yourself."
<http://www.theinfo.org/> | - Marshall McLuhan