[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: channel classification



Dear Mark and Carmen and Dave (and whoever else is interested in classification),

> So the actual categories that the publisher can choose from, how does that
> get decided? I don't think it's a good idea to allow channel publishers to
> create new categories; that would be a mess. A centrally-managed list would
> be idea, but probably impractical - do we want to give this a try?

The centrally-managed list is not impractical if it already exists and is well understood
and documented.  Candidates for such a central list might be:

Group I
1) Newsgroups structrure (like comp.text.xml)
2) Yahoo
3) Open Directory Project

Group II
4) Library of Congress classification
5) Dewey Decimal System

The following comments are my opinions only.

Group I advantages : Uses easy-to-understand terms and has been modelled on the
categorisation needs of the web.
Group I disadvantages : The categorisation is ad-hoc and has not been refined over many
years of use.  The categorisation structure is not published in any way (Yahoo; ODP
publishes structure in RDF) or the logic behind the categorisation is not explained in any
way (Open Directory Project).  No algorithms exist for automated classification (as far as
I know).

Group II advantages : Refined over a hundred years of use by trained categorisation
experts (librarians).  The categorisation structure has been studied and understood by
millions of classifiers, and used by everyone who has visited a library.
Group II disadvantages : The categorisation is rigid and cannot be changed by individual
decisions, but only by peer review. The structure is not (IMO) 'optimised' for the
subjects that people think of when dealing with the web.

My own feeling is that the Group I structures only came into being on an ad-hoc and
reactive basis, and it would be a waste to ignore the collective thought that went into
refining the Group II structures.  Of these, the Dewey System is preferable to the Library
of Congress since the classification is extensible and remains logical:
Quote from David Mundie:
"The fundamental difference is that DDC defines a conceptual space, and
the LOC does not. That is, the decimal nature of Dewey conveys a very rich
set of relationships among categories, and permits "chaining", where the LOC
does not. To take an example chosen completely at random, the Dewey code for
French birds is 598.2944, which is broken down as follows:

500        - Science
590        - Zoological Science
598        - Birds
598.29     - Geographical treatment
598.294    - Europe
598.2944   - France

In this system, the conceptual space is a continuum: it is immediately
apparent from the code that 598.29 is a subdivision of 598, and 598.2944 a
subdivision of 598.29, and so on. Contrast this with the LOC code
"QL683.W4P4", which is an unanalyzable, arbitrary code for "French Birds",
bearing no relationship to "QL682.W4P4" or "QL683.W4P5" or "QK683.W4P4".
That is, the LOC is an enumerated system, and DDC is (largely) faceted - and
it is generally acknowledged that faceted systems are the way to go."

> Maybe another way to go about it is to allow the publisher to decide which
> classification system they want to be considered part of. For instance:
>
> <FooFormat>
> <header>
> ...
> <category
> authority="http://bar.com/categorydefinition.xml";>Widgets/FrobNobs</category
> >
> ...
> </header>
> <item>
> ...
>
> This is nice and flexible, but may lead to problems; if an aggregator didn't
> want to support too many categorization methods, or there was a lot of
> overlap between categorisation schemes, there would be trouble.
The danger is that we get a proliferation of schemes, but each "My." portal will only
offer navigation through one of them at the risk of overwhelming it's users.  This would
mean that the content provider might provide classification through scheme
"http://bar.com"; but the portal might offer the chance to browse channels by the scheme
http://foo.com - the portal would need to re-categorise to foo.com, which would either be
manual, or write a schema mapping algorithm for every combination of bar1.com, bar2.com,
bar3.com etc. to foo.com.

> Hopefully, selecting the channel from a large list at the aggregator will be
> only one method of adding channels. Netscape had the right approach in
I agree - there should be alternative ways of finding channels, such as by traditional
free-text search, or by keyword navigation (see an example at
http://www.xmltree.com/metadata/search.cfm and excuse the quality of keywords; inspired by
http://www.aeiwi.com)

Best regards,
James Carlyle

james@xmltree.com
www.xmltree.com - directory of XML content on the web
------------------------------------------------------------------------