[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] (Random Thoughts) Content syndication and content "cleansing"

To: syndication@egroups.com, scraping@jmason.org
Subject: Re: [syndication] (Random Thoughts) Content syndication and content "cleansing"
From: jm-onelist@jmason.org
Date: Mon, 22 May 2000 13:01:15 +0100
In-reply-to: Message from "Carmen" <chv@vertexdev.com> of "Fri, 19 May 2000 22:25:18 PDT." <000201bfc21b$ca1f55e0$91da14d1@vertexdev.com>
Sender: jm@netnoteinc.com

[scraping@jmason.org added to recipients]

Carmen said:

> But consider that scraping is like taking a newspaper, clipping out all
> of the ads, taping the remnants together, and then giving this version
> out to the world. The way I see it, the sites (and the newspapers) are
> giving you the content for free (or at reduced cost in the case of a
> newspaper) because the advertisers are paying for some, most, or all of
> the costs of giving you the info. When you scrape out the goodies you
> are taking the good stuff and ignoring the "bill" for it (the ad). If
> everyone did this then the content provider would realize no revenue for
> their effort.  Please don't take this as a criticism. Getting a scraper
> to work is definitely an accomplishment to be proud of.

I agree... it damages the revenue stream pretty severely and, if it became
widespread, would encourage sites to imposed charged subscriptions. :(

I personally think it would be far from acceptable to scrape other news
sites' content, remove source advertising and copyright info, and place it
in HTML on my own site under my own advertising, for example.

However the reason I wrote a scraper is because I wanted to read sites on
my Palm handheld. This seems to apply for many of the tools I've found
(Plucker, AvantGo, InfoRover, WebFetch, NewsClipper etc.)

Currently it's impossible to do this (assuming no wireless modems etc)
*without* using a scraper; and the trimming of images & extraneous HTML is
a definite bonus when you've got 1Mb of space and a 2" screen.

I'm sure there's similar situations where scraping technology is a plus,
or a requirement.


> Legality and ethics aside, I am also pretty concerned about 
> the fragility of the scraping process. It seems that the scraper
> can be broken (for a site) if the site makes a simple change
> or a redesign.

Yep. This is one reason why I wanted to institute a scraping-related
mailing list, so people writing HTML scrapers could swap and coordinate
site details, to handle changes like this. The bigger the community, the
faster the site layout descriptions could be fixed...

> You may want to take a look at the list of providers on
> our site (www.headlineviewer.com). We do no scraping,
> although I know that some of our suppliers do.

Yep, I'm aware of RSS -- sitescooper uses it to find stories ;)

(I should have written up a quick para on the site clarifying that BTW)

The headline viewer's pretty cool, and you've done a great job of tracking
down those RSS URLs!

--j.

Follow-Ups:
- Re: [syndication] (Random Thoughts) Content syndication and content "cleansing"
  - From: burtonator <burton@relativity.yi.org>

References:
- RE: [syndication] (Random Thoughts) Content syndication and content "cleansing"
  - From: "Carmen" <chv@vertexdev.com>

Prev by Date: RE: [syndication] (Random Thoughts) Content syndication and content "cleansing"
Next by Date: Re: [syndication] (Random Thoughts) Content syndication and content "cleansing"
Previous by thread: RE: [syndication] (Random Thoughts) Content syndication and content "cleansing"
Next by thread: Re: [syndication] (Random Thoughts) Content syndication and content "cleansing"
Index(es):
- Date
- Thread