[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Translate non-structured documents into Xml RSS format

To: syndication@egroups.com
Subject: Re: Translate non-structured documents into Xml RSS format
From: ben@ubiquick.com
Date: Tue, 26 Sep 2000 11:08:24 -0000
In-reply-to: <9872341311.20000926115553@calaba.com>
User-agent: eGroups-EW/0.82

Thanks a lot for your answer... Ian, you are talking about 
Webmethods, have you an idea of the product name that Moreover use to 
scrape headline ?


--- In syndication@egroups.com, Ian Davis <ian@c...> wrote:
> 
> On Monday, September 25, 2000, 10:45:19 PM, Jeff wrote:
> 
> > I think that Ben is asking for an HTML scraper. They
> > generally use some obscenely complex Perl regular
> > expressions to extract the relevant headlines from
> > a page. The expressions are specific to the page.
> 
> > I know that Ian over at Internet Alchemy runs one.
> I do run one still, although I don't maintain it as much as I 
should.
> 
> > I'm not a big fan of scraping -- it seems to be
> > fragile and error-prone -- if the site changes
> > its format the regular expressions could break.
> Scrapers can be fragile, but the breakage is not as high as you 
might
> think. Many sites use them. I believe Moreover uses WebMethods to
> create their large set of feeds.
> 
> Ian

Follow-Ups:
- Re: [syndication] Re: Translate non-structured documents into Xml RSS format
  - From: Ian Graham <ian.graham@utoronto.ca>
- RE: [syndication] Re: Translate non-structured documents into Xml RSS format
  - From: "David Galbraith" <david@moreover.com>

References:
- Re: [syndication] Re: Translate non-structured documents into Xml RSS format
  - From: Ian Davis <ian@calaba.com>

Prev by Date: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
Next by Date: Re: [syndication] Re: Digest Number 130
Previous by thread: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
Next by thread: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
Index(es):
- Date
- Thread