[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [syndication] Re: Translate non-structured documents into Xml RSS format
On Tue, 26 Sep 2000, Dan Lyke wrote:
> Ian Graham writes:
> > May scrapers are more sophisticated than that, and actually try and
> > parse the HTML structure, looking for patterns in the hierarchical
> > page structure.
>
> I've written a couple of scrapers, my experience is that ones that
> parse the HTML structure are both harder to write and more fragile
> than ones that just apply regexps. Most of mine were for mining book
> data from the online stores, and once I found the title it was fairly
> easy to make things that looked for ISBNs and dollar amounts and
> authors and such, and the difficulty was finding the right title.
>
> People do tweak HTML and appearance, amazingly they tend to tweak the
> language of the page and the structure of the language less.
My fault: by 'sophisticated' I meant to imply more complex parsing
/structural models, not that it was better at doing the scraping job. It's
interesting to have some evidence that, indeed, the text is more powerful
than the markup ;-)
Ian