[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [syndication] Re: Translate non-structured documents into Xml RSS format



I think that Ben is asking for an HTML scraper. They
generally use some obscenely complex Perl regular
expressions to extract the relevant headlines from
a page. The expressions are specific to the page.

I know that Ian over at Internet Alchemy runs one.

I'm not a big fan of scraping -- it seems to be
fragile and error-prone -- if the site changes
its format the regular expressions could break.

Jeff;

Jeff Barr - Home: 425-836-5624 Office: 425-936-3098
mailto:jeff@vertexdev.com
http://www.vertexdev.com/~jeff
http://jeffbarr.editthispage.com/
4610 191st Place NE. Redmond, WA


-----Original Message-----
From: Aaron Swartz [mailto:aswartz@swartzfam.com]
Sent: Monday, September 25, 2000 3:17 PM
To: syndication@egroups.com
Subject: [syndication] Re: Translate non-structured documents into Xml
RSS format


ben@ubiquick.com <ben@ubiquick.com> wrote:

> I would like to know if anybody has already worked on a bot that
> could grab unstructured documents and translate them into RSS format.

I'm not quite sure I follow. You mean a spider that would crawl the website
and output a channel with a listing of all the pages on that site? I've
never heard of such a thing, it does sound like an interesting possibility,
however.

What would you use this for, since the site map would rarely change (making
it not very useful for news)?

--
        Aaron Swartz         |"This information is top security.
<http://swartzfam.com/aaron/>|     When you have read it, destroy yourself."
  <http://www.theinfo.org/>  |             - Marshall McLuhan