[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] Re: Translate non-structured documents into Xml RSS format

To: syndication@egroups.com
Subject: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
From: Ian Graham <ian.graham@utoronto.ca>
Date: Tue, 26 Sep 2000 09:35:53 -0400 (EDT)
In-reply-to: <8qq038+sp77@eGroups.com>
Reply-to: Ian Graham <ian.graham@utoronto.ca>

May scrapers are more sophisticated than that, and actually try and
parse the HTML structure, looking for patterns in the hierarchical
page structure. 

REgardless, it is fragile and error-prone -- simple changes in page
design will break it completely, and it does have to be tailored for
each page design apttern.

Ian


On Tue, 26 Sep 2000 ben@ubiquick.com wrote:

> Thanks a lot for your answer... Ian, you are talking about 
> Webmethods, have you an idea of the product name that Moreover use to 
> scrape headline ?
> 
> 
> --- In syndication@egroups.com, Ian Davis <ian@c...> wrote:
> > 
> > On Monday, September 25, 2000, 10:45:19 PM, Jeff wrote:
> > 
> > > I think that Ben is asking for an HTML scraper. They
> > > generally use some obscenely complex Perl regular
> > > expressions to extract the relevant headlines from
> > > a page. The expressions are specific to the page.
> > 
> > > I know that Ian over at Internet Alchemy runs one.
> > I do run one still, although I don't maintain it as much as I 
> should.
> > 
> > > I'm not a big fan of scraping -- it seems to be
> > > fragile and error-prone -- if the site changes
> > > its format the regular expressions could break.
> > Scrapers can be fragile, but the breakage is not as high as you 
> might
> > think. Many sites use them. I believe Moreover uses WebMethods to
> > create their large set of feeds.
> > 
> > Ian
> 
> 
> 
> 
>

Follow-Ups:
- Re: [syndication] Re: Translate non-structured documents into Xml RSS format
  - From: Dan Lyke <danlyke@flutterby.com>

References:
- Re: Translate non-structured documents into Xml RSS format
  - From: ben@ubiquick.com

Prev by Date: Re: [syndication] Re: Digest Number 130
Next by Date: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
Previous by thread: Re: Translate non-structured documents into Xml RSS format
Next by thread: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
Index(es):
- Date
- Thread