[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to scrape?

To: <syndication@yahoogroups.com>
Subject: Re: How to scrape?
From: Aaron Swartz <aswartz@swartzfam.com>
Date: Thu, 22 Mar 2001 12:51:36 -0600
In-reply-to: <x5lmpz5ehb.fsf@bitsko.slc.ut.us>

Ken MacLeod <ken@bitsko.slc.ut.us> wrote:

> HTML Tidy[1] will also let you work with any XPath implementation.
> Overall, I highly suggest XPath, once you've loaded the HTML document,
> XPath lets you get right to the element value you want:
> /HTML/BODY/TABLE/TR[3]/TD[1]

The problem with this (as someone else may have said) is what happens when
(as inevitably will) the site decides to have a new layout, and adds another
table, or perhaps they stick another row in the document, and your tr[3]
becomes tr[4]. Xpaths (without IDs) cannot be trusted to last very long.
Well-designed regexps can last much longer (but they can also fall apart
with changes).

> Another entirely different solution is Pyxie[2] and it's file/stream
> format PYX.  Using any Pyxie HTML parser, you can convert HTML into a
> drop-dead simple, line-oriented format that's really easy to process
> with Unix-like filters, such as grep, awk, sed, Perl, and shell.

Thanks for point me towards this -- somehow I had missed it. It's actually
pretty cool.

-- 
[ Aaron Swartz | me@aaronsw.com | http://www.aaronsw.com ]

References:
- Re: [syndication] Re: How to scrape?
  - From: Ken MacLeod <ken@bitsko.slc.ut.us>

Prev by Date: RE: how to scrape
Next by Date: Revisting NNTP?
Previous by thread: Re: [syndication] Re: How to scrape?
Next by thread: RadioUserland/AmphetaDesk/RSS Mention...
Index(es):
- Date
- Thread