[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [syndication] Re: Translate non-structured documents into Xml RSS format
- To: syndication@egroups.com
- Subject: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
- From: Ian Davis <ian@calaba.com>
- Date: Tue, 26 Sep 2000 11:55:53 -0700
- In-reply-to: <001301c0277c$f48b9430$020d0dc0@vertexdev.com>
- Organization: Calaba Ltd.
- References: <001301c0277c$f48b9430$020d0dc0@vertexdev.com>
- Reply-to: Ian Davis <iand@internetalchemy.org>
On Monday, September 25, 2000, 10:45:19 PM, Jeff wrote:
> I think that Ben is asking for an HTML scraper. They
> generally use some obscenely complex Perl regular
> expressions to extract the relevant headlines from
> a page. The expressions are specific to the page.
> I know that Ian over at Internet Alchemy runs one.
I do run one still, although I don't maintain it as much as I should.
> I'm not a big fan of scraping -- it seems to be
> fragile and error-prone -- if the site changes
> its format the regular expressions could break.
Scrapers can be fragile, but the breakage is not as high as you might
think. Many sites use them. I believe Moreover uses WebMethods to
create their large set of feeds.
Ian