[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to scrape?



Ken MacLeod <ken@bitsko.slc.ut.us> wrote:

> HTML Tidy[1] will also let you work with any XPath implementation.
> Overall, I highly suggest XPath, once you've loaded the HTML document,
> XPath lets you get right to the element value you want:
> /HTML/BODY/TABLE/TR[3]/TD[1]

The problem with this (as someone else may have said) is what happens when
(as inevitably will) the site decides to have a new layout, and adds another
table, or perhaps they stick another row in the document, and your tr[3]
becomes tr[4]. Xpaths (without IDs) cannot be trusted to last very long.
Well-designed regexps can last much longer (but they can also fall apart
with changes).

> Another entirely different solution is Pyxie[2] and it's file/stream
> format PYX.  Using any Pyxie HTML parser, you can convert HTML into a
> drop-dead simple, line-oriented format that's really easy to process
> with Unix-like filters, such as grep, awk, sed, Perl, and shell.

Thanks for point me towards this -- somehow I had missed it. It's actually
pretty cool.

-- 
[ Aaron Swartz | me@aaronsw.com | http://www.aaronsw.com ]