[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [syndication] Re: Translate non-structured documents into Xml RSS format
Ian Graham writes:
> May scrapers are more sophisticated than that, and actually try and
> parse the HTML structure, looking for patterns in the hierarchical
> page structure.
I've written a couple of scrapers, my experience is that ones that
parse the HTML structure are both harder to write and more fragile
than ones that just apply regexps. Most of mine were for mining book
data from the online stores, and once I found the title it was fairly
easy to make things that looked for ISBNs and dollar amounts and
authors and such, and the difficulty was finding the right title.
People do tweak HTML and appearance, amazingly they tend to tweak the
language of the page and the structure of the language less.
Dan