[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [syndication] Re: Translate non-structured documents into Xml RSS format
- To: syndication@egroups.com
- Subject: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
- From: jm-onelist@jmason.org
- Date: Wed, 27 Sep 2000 12:30:02 +0100
- In-reply-to: Message from "Jeff Barr" <jeff@vertexdev.com> of "Mon, 25 Sep 2000 22:45:19 PDT." <001301c0277c$f48b9430$020d0dc0@vertexdev.com>
- Sender: jm@mail.netnoteinc.com
Jeff Barr said:
> I think that Ben is asking for an HTML scraper. They
> generally use some obscenely complex Perl regular
> expressions to extract the relevant headlines from
> a page. The expressions are specific to the page.
> I know that Ian over at Internet Alchemy runs one.
> I'm not a big fan of scraping -- it seems to be
> fragile and error-prone -- if the site changes
> its format the regular expressions could break.
Doesn't moreover do something similar?
BTW Sitescooper does this ( http://sitescooper.org/ ), and as Ian D said,
they're not as prone to breakage as you might think -- especially when you
pick the patterns well, and use other techniques to optimise them:
* notice small <td>'s (width < 150 or thereabouts) and strip them
* allow patterns for URLs, and if a story item does not link to
one of those, strip it from the output
That kind of thing.
(It would be cool if sitescooper could output RSS, I agree. I've had it
on my TODO list for a while, but my current time famine has been a
problem.)
--j.