[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [syndication] Automatically Transforming Blog or HTML Content into XML
> i'd argue that a link-crawler is totally the wrong approach to getting
> accurate stats on a mega-site like blogger - it works much better to get a
> manual dump of the blogs hosted there that are updated, as weblo.gs
> does....
I'd argue a combination of the two is an even better idea. It's one thing to
see that a site exists. It's trivial to create a site and particpate in a
listing site like blo.gs or syndic8. Where it's not trivial is to have that
same site appear 'associated' with other sites already in the index. The
combination of link crawling with site cross-linking is probably the best bet.
On Syndic8, we purposely /avoid/ doing standalone crawling for new feeds. This
to prevent the onslaught of "gee, I have a new feed" submissions. No bias
intended against the new sites, of course. Experience shows it's better to let
a site 'settle into a routine' before picking up it's feed. This way existing
users of the catalog aren't buried under an avalanche of new stuff that hasn't
developed a value yet. Yes, there's a chance users might "miss" something.
This can be countered by the submission of a site by an existing catalog user.
We've found it's better to depend on submissions from interested users than it
is to spider around looking for them.
Jeff does this periodically, he scours the various update ping lists and
cross-references them against various blogroll resources. A feed 'seen' in
several places and not already in the catalog, gets added. This does manage to
find a few fringe things that haven't yet hit the memespace radar.
What's *really* been insightful has been making use of a random sites feed.
Instead of just lurking around scouring the new feed lists I've found it a LOT
more useful to subscribe to a feed that picks from existing ones on a random
basis. You'd be surprised what's out there. Had I approached it from just a
'new feeds' perspective I'd get overwhelmed with the barrage of new things that
are /constantly/ outside the scope of my interests. The feed would essentially
fatigue my interest level to such a point that I'd either ignore or unsubscribe
from it. Rather, using randomness provides a certain 'surprise' factor. Thus,
a well written site and/or item description might pique my interest more
readily. I may well not want the majority of feeds that were picked but there's
enough chance of finding one 'out of left field' that I'd stay interested in the
feed.
As we build out larger lists of feeds we stand the chance of building out larger
lists that can be cross-referenced. Likewise, as more content comes online the
users will demand more effective ways to refine the list of what's presented to
them. But at this stage of the game there's frankly TOO LITTLE content online
to start thinking about using exclusionary filters.
-Bill Kearney