[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] Re: Translate non-structured documents into Xml RSS format



Jeff Barr said:

> I think that Ben is asking for an HTML scraper. They
> generally use some obscenely complex Perl regular
> expressions to extract the relevant headlines from
> a page. The expressions are specific to the page.
> I know that Ian over at Internet Alchemy runs one.
> I'm not a big fan of scraping -- it seems to be
> fragile and error-prone -- if the site changes
> its format the regular expressions could break.

Doesn't moreover do something similar?

BTW Sitescooper does this ( http://sitescooper.org/ ), and as Ian D said,
they're not as prone to breakage as you might think -- especially when you
pick the patterns well, and use other techniques to optimise them:

    * notice small <td>'s (width < 150 or thereabouts) and strip them

    * allow patterns for URLs, and if a story item does not link to
    one of those, strip it from the output

That kind of thing.

(It would be cool if sitescooper could output RSS, I agree.  I've had it
on my TODO list for a while, but my current time famine has been a
problem.)

--j.