[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] Re: Translate non-structured documents into Xml RSS format

To: syndication@egroups.com
Subject: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
From: Ian Graham <ian.graham@utoronto.ca>
Date: Tue, 26 Sep 2000 16:55:53 -0400 (EDT)
In-reply-to: <14800.50488.780295.290786@wynand.flutterby.com>
Reply-to: Ian Graham <ian.graham@utoronto.ca>

On Tue, 26 Sep 2000, Dan Lyke wrote:

> Ian Graham writes:
> > May scrapers are more sophisticated than that, and actually try and
> > parse the HTML structure, looking for patterns in the hierarchical
> > page structure. 
> 
> I've written a couple of scrapers, my experience is that ones that
> parse the HTML structure are both harder to write and more fragile
> than ones that just apply regexps. Most of mine were for mining book
> data from the online stores, and once I found the title it was fairly
> easy to make things that looked for ISBNs and dollar amounts and
> authors and such, and the difficulty was finding the right title.
> 
> People do tweak HTML and appearance, amazingly they tend to tweak the
> language of the page and the structure of the language less.

My fault:  by 'sophisticated' I meant to imply more complex parsing
/structural models, not that it was better at doing the scraping job. It's
interesting to have some evidence that, indeed, the text is more powerful
than the markup ;-)

Ian

References:
- Re: [syndication] Re: Translate non-structured documents into Xml RSS format
  - From: Dan Lyke <danlyke@flutterby.com>

Prev by Date: Re: [syndication] Re: Digest Number 130
Next by Date: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
Previous by thread: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
Next by thread: Re: Translate non-structured documents into Xml RSS format
Index(es):
- Date
- Thread