[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[no subject]
To me, the RSS standard seems to be an almost perfect match to the
needs of search engine syndication. It is a list of urls with titles
and descriptions. The only thing that has really changed is the
notion of time ordering, something that is not a part of the spec, but
instead only implied by the problems it was designed to solve.
A little bit more background about what growinglifestyle.com is trying
to achieve in the content syndication/aggregation area.
Firstly, it is a content portal in the home/garden/lifestyle area.
Unlike news/technology, the content is not overly time sensitive.
Last year's article is only slightly less valuable than yesterday's
article. Contrast this with tech news, where there is a desperate
scramble to provide real-time news feeds.
So rather than discard old articles, we continually compile them in a
permanent repository (link hierarchy).
We scrape, harvest and spider over 100 of the top content sites in our
category. Our spider is tuned to return articles (not products,
"about us", chat etc) and, with human guidance, has a hit rate in the
high 90's.
We title, summarise and classify the articles based on a range of
in-document and off-document meta-data and learning algorithms. And
unlike the real-time news feeds, the results go through a human QA
phase. Productivity is in the hundreds of articles per hour range.
The results are then fed to the live server (currently roughly
weekly). We could technically do it live, but as I said, our subject
area is not incredibly time sensitive, and we do a lot of batch
crunching on the data.
The result is a full-text search engine, but without the:
- spam
- bad titles
- bad descriptions
- zero classification
the plagues most web search engines like altavista, inktomi etc.
And the result is a hierarchicaly organised set of links (like Yahoo
et al), but with:
- individual articles, not just entire web sites
- greater opportunity for metadata-relationships (like "more like
this")
without the huge labour costs associated with other human categorised
directories.
And the result is a whole pile of content aggregation and syndication
channels, to suit every combination and permutation of interests.
It is still very early days (most of the feeds have only been up a few
days, and the content is growing by around 20% per week), but I hope
this is the start of something big in content syndication.
I would be very interested to hear what others working in the area of
content syndication thought of search result syndication, and how
other aggregators like userland/xmltree/meerkat etc could/should work
with syndication channels that are not time ordered.
Cheers,
Steve