[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Translate non-structured documents into Xml RSS format



Thanks a lot for your answer... Ian, you are talking about 
Webmethods, have you an idea of the product name that Moreover use to 
scrape headline ?


--- In syndication@egroups.com, Ian Davis <ian@c...> wrote:
> 
> On Monday, September 25, 2000, 10:45:19 PM, Jeff wrote:
> 
> > I think that Ben is asking for an HTML scraper. They
> > generally use some obscenely complex Perl regular
> > expressions to extract the relevant headlines from
> > a page. The expressions are specific to the page.
> 
> > I know that Ian over at Internet Alchemy runs one.
> I do run one still, although I don't maintain it as much as I 
should.
> 
> > I'm not a big fan of scraping -- it seems to be
> > fragile and error-prone -- if the site changes
> > its format the regular expressions could break.
> Scrapers can be fragile, but the breakage is not as high as you 
might
> think. Many sites use them. I believe Moreover uses WebMethods to
> create their large set of feeds.
> 
> Ian