[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] Re: Translate non-structured documents into Xml RSS format



iPal uses a homegrown scraper which handles about 8000 sites/day. The breakage is something on the order of 2-3 per month, so it's not the huge number you might expect.

And yes, it's a set of monstrous perl regexps. We have also built a java tool which lets non-techs figure out the regexp for a site by letting them select portions of a rendered html page. This is an internal tool at the moment however...

-s
--
Steve Dossick
Founder and Chief Architect
iPal, Inc.
310-578-8331 (voice)
310-578-8336 (fax)


On Tuesday, September 26, 2000, at 11:55 AM, Ian Davis wrote:


On Monday, September 25, 2000, 10:45:19 PM, Jeff wrote:

> I think that Ben is asking for an HTML scraper. They
> generally use some obscenely complex Perl regular
> expressions to extract the relevant headlines from
> a page. The expressions are specific to the page.

> I know that Ian over at Internet Alchemy runs one.
I do run one still, although I don't maintain it as much as I should.

> I'm not a big fan of scraping -- it seems to be
> fragile and error-prone -- if the site changes
> its format the regular expressions could break.
Scrapers can be fragile, but the breakage is not as high as you might
think. Many sites use them. I believe Moreover uses WebMethods to
create their large set of feeds.

Ian