[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to scrape?



> The way I do it is that I load the text of the page in to memory, 
than use
> regular expressions to extract the proper information. Then I just 
spit it
> out in RSS.
> 
> If you'd like some Tcl code to do this, I can send you some.

Yet another way is to use an XML parser and XPath.  There is a parser 
that comes with the Resin servlet engine (http://www.caucho.com) that 
parses HTML even though HTML isn't proper XML, or SGML for that 
matter.  The servlet engine includes an XPath library.

-- David Smiley