[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] (Random Thoughts) Content syndication and content "cleansing"



burtonator said:

> > However the reason I wrote a scraper is because I wanted to read sites on
> > my Palm handheld. This seems to apply for many of the tools I've found
> > (Plucker, AvantGo, InfoRover, WebFetch, NewsClipper etc.)
> 
> (URL?)

http://jmason.org/scraping/ has links to the lot of them I think.

> > Yep. This is one reason why I wanted to institute a scraping-related
> > mailing list, so people writing HTML scrapers could swap and coordinate
> > site details, to handle changes like this. The bigger the community, the
> > faster the site layout descriptions could be fixed...
> 
> Cleansing and scraping are really different issues.  When you are
> scraping you are trying to use HTML as a protocol layer.  IMO this is
> fragile and really dangerous.  But cleansing just destroys content
> (doesn't make anything new) and gives a subset of the content so should
> theoretically be much safer than scraping.

OK -- clarification here -- I take from this you mean:

  * cleansing == removing HTML content such as <img> tags, colors, etc.,
  reformatting tables, possibly stripping narrow "sidebar" tables, that
  kind of thing, without knowledge of the site in question.

  * scraping == using predefined patterns to select sections of the
  HTML to strip.

In my experience cleansing only goes so far, and scraping (even given its
fragility) works quite well... granted however it does mean the site data
sometimes needs to be updated when the site changes layout.

Having said that I have not yet implemented a full HTML parser in
sitescooper, so the cleansing code isn't as good as it could be.  For
example I've been considering using an approximation of font rendering to
work out exactly how wide a given <td> will be, to work out whether or not
it's a sidebar table which should be stripped out.

> > Yep, I'm aware of RSS -- sitescooper uses it to find stories ;)
> 
> I might use sitescooper in Jetspeed.  Just black box the whole thing so
> that anyone can swap in a URL filter.

Ah sure why not -- although beware, it's written in perl! ;)

>>[rss]
> Check out xmltree.com for about 1700 more of them :)

argh -- RSS overload! ;)

--j.