[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] Re: Translate non-structured documents into Xml RSS format



On Tue, 26 Sep 2000, Dan Lyke wrote:

> Ian Graham writes:
> > May scrapers are more sophisticated than that, and actually try and
> > parse the HTML structure, looking for patterns in the hierarchical
> > page structure. 
> 
> I've written a couple of scrapers, my experience is that ones that
> parse the HTML structure are both harder to write and more fragile
> than ones that just apply regexps. Most of mine were for mining book
> data from the online stores, and once I found the title it was fairly
> easy to make things that looked for ISBNs and dollar amounts and
> authors and such, and the difficulty was finding the right title.
> 
> People do tweak HTML and appearance, amazingly they tend to tweak the
> language of the page and the structure of the language less.

My fault:  by 'sophisticated' I meant to imply more complex parsing
/structural models, not that it was better at doing the scraping job. It's
interesting to have some evidence that, indeed, the text is more powerful
than the markup ;-)

Ian