[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to scrape?

To: syndication@yahoogroups.com
Subject: Re: How to scrape?
From: "David Smiley" <dsmiley@mitre.org>
Date: Wed, 21 Mar 2001 13:14:24 -0000
In-reply-to: <B6DD6495.7024%aswartz@swartzfam.com>
User-agent: eGroups-EW/0.82

> The way I do it is that I load the text of the page in to memory, 
than use
> regular expressions to extract the proper information. Then I just 
spit it
> out in RSS.
> 
> If you'd like some Tcl code to do this, I can send you some.

Yet another way is to use an XML parser and XPath.  There is a parser 
that comes with the Resin servlet engine (http://www.caucho.com) that 
parses HTML even though HTML isn't proper XML, or SGML for that 
matter.  The servlet engine includes an XPath library.

-- David Smiley

Follow-Ups:
- Re: [syndication] Re: How to scrape?
  - From: Ken MacLeod <ken@bitsko.slc.ut.us>

References:
- Re: How to scrape?
  - From: Aaron Swartz <aswartz@swartzfam.com>

Prev by Date: [syndication] Re: How to scrape?
Next by Date: Re: [syndication] Re: How to scrape?
Previous by thread: [syndication] Re: How to scrape?
Next by thread: Re: [syndication] Re: How to scrape?
Index(es):
- Date
- Thread