[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] Re: Translate non-structured documents into Xml RSS format

To: syndication@egroups.com
Subject: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
From: jm-onelist@jmason.org
Date: Wed, 27 Sep 2000 12:30:02 +0100
In-reply-to: Message from "Jeff Barr" <jeff@vertexdev.com> of "Mon, 25 Sep 2000 22:45:19 PDT." <001301c0277c$f48b9430$020d0dc0@vertexdev.com>
Sender: jm@mail.netnoteinc.com

Jeff Barr said:

> I think that Ben is asking for an HTML scraper. They
> generally use some obscenely complex Perl regular
> expressions to extract the relevant headlines from
> a page. The expressions are specific to the page.
> I know that Ian over at Internet Alchemy runs one.
> I'm not a big fan of scraping -- it seems to be
> fragile and error-prone -- if the site changes
> its format the regular expressions could break.

Doesn't moreover do something similar?

BTW Sitescooper does this ( http://sitescooper.org/ ), and as Ian D said,
they're not as prone to breakage as you might think -- especially when you
pick the patterns well, and use other techniques to optimise them:

    * notice small <td>'s (width < 150 or thereabouts) and strip them

    * allow patterns for URLs, and if a story item does not link to
    one of those, strip it from the output

That kind of thing.

(It would be cool if sitescooper could output RSS, I agree.  I've had it
on my TODO list for a while, but my current time famine has been a
problem.)

--j.

References:
- RE: [syndication] Re: Translate non-structured documents into Xml RSS format
  - From: "Jeff Barr" <jeff@vertexdev.com>

Prev by Date: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
Next by Date: Re: Translate non-structured documents into Xml RSS format
Previous by thread: RE: [syndication] Re: Translate non-structured documents into Xml RSS format
Next by thread: Categorisation, database searches and RSS format results
Index(es):
- Date
- Thread