[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [syndication] Re: Translate non-structured documents into Xml RSS format

To: <syndication@egroups.com>
Subject: RE: [syndication] Re: Translate non-structured documents into Xml RSS format
From: "Jeff Barr" <jeff@vertexdev.com>
Date: Mon, 25 Sep 2000 22:45:19 -0700
Importance: Normal
In-reply-to: <B5F538F3.19A5D%aswartz@swartzfam.com>

I think that Ben is asking for an HTML scraper. They
generally use some obscenely complex Perl regular
expressions to extract the relevant headlines from
a page. The expressions are specific to the page.

I know that Ian over at Internet Alchemy runs one.

I'm not a big fan of scraping -- it seems to be
fragile and error-prone -- if the site changes
its format the regular expressions could break.

Jeff;

Jeff Barr - Home: 425-836-5624 Office: 425-936-3098
mailto:jeff@vertexdev.com
http://www.vertexdev.com/~jeff
http://jeffbarr.editthispage.com/
4610 191st Place NE. Redmond, WA

-----Original Message-----
From: Aaron Swartz [mailto:aswartz@swartzfam.com]
Sent: Monday, September 25, 2000 3:17 PM
To: syndication@egroups.com
Subject: [syndication] Re: Translate non-structured documents into Xml
RSS format

ben@ubiquick.com <ben@ubiquick.com> wrote:

> I would like to know if anybody has already worked on a bot that
> could grab unstructured documents and translate them into RSS format.

I'm not quite sure I follow. You mean a spider that would crawl the website
and output a channel with a listing of all the pages on that site? I've
never heard of such a thing, it does sound like an interesting possibility,
however.

What would you use this for, since the site map would rarely change (making
it not very useful for news)?

--
        Aaron Swartz         |"This information is top security.
<http://swartzfam.com/aaron/>|     When you have read it, destroy yourself."
  <http://www.theinfo.org/>  |             - Marshall McLuhan

Follow-Ups:
- Re: Translate non-structured documents into Xml RSS format
  - From: "Rick Winfield" <rick@rickwinfield.com>
- Re: [syndication] Re: Translate non-structured documents into Xml RSS format
  - From: Ian Davis <ian@calaba.com>
- Re: [syndication] Re: Translate non-structured documents into Xml RSS format
  - From: jm-onelist@jmason.org

References:
- Re: Translate non-structured documents into Xml RSS format
  - From: Aaron Swartz <aswartz@swartzfam.com>

Prev by Date: Re: Categorisation, database searches and RSS format results
Next by Date: Re: Digest Number 130
Previous by thread: Re: Translate non-structured documents into Xml RSS format
Next by thread: Re: Translate non-structured documents into Xml RSS format
Index(es):
- Date
- Thread