[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [syndication] Re: Translate non-structured documents into Xml RSS format
May scrapers are more sophisticated than that, and actually try and
parse the HTML structure, looking for patterns in the hierarchical
page structure.
REgardless, it is fragile and error-prone -- simple changes in page
design will break it completely, and it does have to be tailored for
each page design apttern.
Ian
On Tue, 26 Sep 2000 ben@ubiquick.com wrote:
> Thanks a lot for your answer... Ian, you are talking about
> Webmethods, have you an idea of the product name that Moreover use to
> scrape headline ?
>
>
> --- In syndication@egroups.com, Ian Davis <ian@c...> wrote:
> >
> > On Monday, September 25, 2000, 10:45:19 PM, Jeff wrote:
> >
> > > I think that Ben is asking for an HTML scraper. They
> > > generally use some obscenely complex Perl regular
> > > expressions to extract the relevant headlines from
> > > a page. The expressions are specific to the page.
> >
> > > I know that Ian over at Internet Alchemy runs one.
> > I do run one still, although I don't maintain it as much as I
> should.
> >
> > > I'm not a big fan of scraping -- it seems to be
> > > fragile and error-prone -- if the site changes
> > > its format the regular expressions could break.
> > Scrapers can be fragile, but the breakage is not as high as you
> might
> > think. Many sites use them. I believe Moreover uses WebMethods to
> > create their large set of feeds.
> >
> > Ian
>
>
>
>
>