[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] Re: Translate non-structured documents into Xml RSS format

To: syndication@egroups.com
Subject: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
From: Dan Lyke <danlyke@flutterby.com>
Date: Tue, 26 Sep 2000 08:48:08 -0700 (PDT)
In-reply-to: <Pine.SOL.4.21.0009260934330.18255-100000@ic-unix.ic.utoronto.ca>
References: <8qq038+sp77@eGroups.com> <Pine.SOL.4.21.0009260934330.18255-100000@ic-unix.ic.utoronto.ca>

Ian Graham writes:
> May scrapers are more sophisticated than that, and actually try and
> parse the HTML structure, looking for patterns in the hierarchical
> page structure. 

I've written a couple of scrapers, my experience is that ones that
parse the HTML structure are both harder to write and more fragile
than ones that just apply regexps. Most of mine were for mining book
data from the online stores, and once I found the title it was fairly
easy to make things that looked for ISBNs and dollar amounts and
authors and such, and the difficulty was finding the right title.

People do tweak HTML and appearance, amazingly they tend to tweak the
language of the page and the structure of the language less.

Dan

Follow-Ups:
- Re: [syndication] Re: Translate non-structured documents into Xml RSS format
  - From: Mark Nottingham <mnot@mnot.net>
- Re: [syndication] Re: Translate non-structured documents into Xml RSS format
  - From: Ian Graham <ian.graham@utoronto.ca>
- Re: Translate non-structured documents into Xml RSS format
  - From: "Stephen Tyler" <tuesday@growinglifestyle.com>

References:
- Re: Translate non-structured documents into Xml RSS format
  - From: ben@ubiquick.com
- Re: [syndication] Re: Translate non-structured documents into Xml RSS format
  - From: Ian Graham <ian.graham@utoronto.ca>

Prev by Date: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
Next by Date: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
Previous by thread: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
Next by thread: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
Index(es):
- Date
- Thread