[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RSS 0.91 - Missing Vital Metadata



While RSS 0.91 is extremely powerful, it strikes me as missing two 
vital pieces of metadata:

1) Ordering method

2) Categorisation


Ordering method
---------------

RSS defines a list of items, or more specifically an ordered 
sequence. But what is the ordering criteria?

Weblogs and news are ordered by time. Most current RSS channels fall 
into this category.

Top 10 lists are ordered by a popularity measure. Some examples might 
be "Lettermans top 10 reasons for ...", "Top selling CDs", "most 
popular pages". There are a sprinkling of these channels.

Other lists are ordered by degree of match. For example the results 
of a search might be presented in this manner.

To allow the encoding of this data, I propose the following:

<ordering>time</ordering>
Other values: none, top, match

A simple example. I gather several RSS streams about computer books. 
Using this new <ordering> item, I can automatically distinguish "top 
books" from "new books". I can merge multiple "new books" streams 
together, removing duplicates. On the other hand, I can merge 
merge "top books" streams together, weighting elements by duplication 
and order within each stream.

Categorisation
--------------

Content aggregators need to be able to categorise their content, or 
risk providing extremely long lists of channels (like userland's :-
( ). How is a new user meant to select from a list of 2500 channels, 
presented as a flat list?

Unfortunately categorisation is EXTREMELY hard to do across a broad 
range of subjects, in a way that suits most people.

Rather than define the one true categorisation schema and taxonomy, I 
think we should permit the channel author some flexibility, but still 
allow content aggregators some real meat to kickstart their 
categorisation.

I propose the following new element, by way of an example for an RSS 
channel associated with a book on encryption software:

<category>
   <method>yahoo.com</method>
   
<value>Computers_and_Internet/Internet/World_Wide_Web/Security_and_Enc
ryption</value>
   
<value>Business_and_Economy/Shopping_and_Services/Books/Booksellers/Co
mputers/Internet/Titles/World_Wide_Web</value>
</category>
<category>
   <method>dmoz.org</method>
   <value>Computers/Security/Products_and_Tools/Cryptography/</value>
   
<value>Business/Industries/Publishing/Publishers/Nonfiction/Computers<
/value>
</category>

notes:

1) You can have multiple <value> items in each category.

2) You can have multiple <category> items.

3) Users can define their own methods. Yahoo and DMOZ are 
recommended, with DMOZ "more" recommended.

4) the <value> string is a list of "/" seperated values, from 
broadest to most specific.

Now the pedantic among you will probably disagree with me as to the 
ideal place in Yahoo and DMOZ to categorise this content. But this is 
missing the point. The point is that armed with the above data, the 
job of classifying this RSS document in any new category tree is made 
vastly simpler.

Even if my category tree does not align precisely with Yahoo or DMOZ, 
there is going to be some overlap. And the <value> string contains 
some good keywords, which I can disambigaute using WordNet or similar 
to automatically align with my own arbitrary hierarchy.

For aggregation portals targetting narrow niches, it is a simple job 
to find relevant RSS channels using a hand-compiled list of relevant 
paths on Yahoo and DMOZ.

The presence of these new items should not upset existing RSS clients 
(I hope they have been coded to ignore unknown elements).



Perhaps it would be clearer with an explaination of what I am trying 
to do with RSS, and where I have been struggling to apply RSS.

I publish a vertical search portal. 
http://www.growinglifestyle.com/h/garden/index.html

Currently it covers 2 topics (more are coming), one of which is 
gardening.

I scrape the top gardening web sites for articles (and only the 
articles), and assemble it in a categorised hierarchy. So the user 
can browse the hierarchy (like Yahoo), or do a full-text search (like 
Altavista). But no matter which way they look, they will only get 
quality articles on gardening.

What I have just done is add an RSS file at each node (well a few 
thousand nodes anyway) on this hierarchy. For example, there is an 
RSS file for "gardening", for "plants", for "bulbs" and for "tulips" 
(progressively narrower topics). Each of these RSS files is a weblog, 
displaying a time ordered sequence of articles being added to the 
tree. I'm adding about 1000 articles a week at the top level, so as 
you go down the tree the RSS files get quieter until the final nodes 
may only get 1 article added per month or two.

Why have I created so many RSS files? Well, not everybody is 
interested in everything. You in effect customise the RSS feed that 
suits your needs. If all you are interested in are "Dahlias", then 
that is all you will get. And publishing it in RSS makes re-purposing 
the content so much easier.

Actually, I am even thinking of adding an RSS file for every possible 
search phrase. In this case, the RSS file would be ordered by rank 
rather than time. Would you subscribe to such an RSS channel? 
Probably not as it would not change so often, but you might want to 
fetch it on demand. For example, a shopping site might want to 
display articles about each of their products. They could create a 
unique url containing the keywords and phrases, grab the rss file 
corresponding to that search, and display it using an RSS-reading 
content module.

So what is the problem?

Well how can I let other sites know the RSS channels I offer?

It is not really a good idea to add several thousand narrow channels 
to userland, and then have userland hammer my site every hour.

I could (and am) creating an OCS description, but this does not 
describe the categorisation or even the hierarchy. And I run the risk 
of having aggregators blindly add every single available RSS channel, 
and fetch them all hourly.

I could (and am) entering some of the more generally useful channels 
into xmltree and userland. But unless people actually wander all over 
my site, they will not be aware of the RSS customisation 
possibilities available.

Another problem with content syndication by RSS is as follows. I am 
adding around 1000 items per week, with around 1 update event per 
week. Ideally it would be better to release the articles in real-
time, but alas it is computationally (and mentally) much less 
burdensome to do my processing in batches.

RSS has an implied length of circa 15 items. I know this is not 
fixed, but an RSS file with 1000 items is definitely considered 
unfriendly (My.Netscape requests file sizes below 8kB). The problem 
is that 990 of these new items will never make it onto my 10 element 
RSS file. Thus 990 items will miss out on the opportunities for 
content syndication and repurposing that RSS allows.

I am still thinking about the best way to solve this last problem. 
Some possibilities:

a) Trickle feed the RSS file. Instead of instantly acknowleding the 
1000 new articles, the RSS generator could be spoon fed a steady 
dribble of articles (say 6 per hour = 1000/week). Clients reading the 
RSS file every single hour will get a chance to see all the new 
articles.

b) Track the IP address of clients reading the RSS file. Feed each IP 
address all the articles added since the last time they read the 
file, with some upper limit. So after a few big gulps of new articles 
the RSS file will settle down to a list of 10.

Neither of these approaches strikes me as being particularly clean.

I hope this stimulates some discussion about the application of RSS 
to search engines, instead of just the traditional areas of blogs and 
news feeds.

Steve