[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RSS Referencing



RSS Referencing
Stephen Downes
July 27, 2005

0. Disclaimers

- Maybe this discussion has already taken place somewhere. But if so,
I haven't seen it, and pointers would be appreciated. 

- Terminology used below is RSS terminology, however the same points
are intended to be applied to Atopm.

1. The Disappearing Reference

Back at the beginning, an RSS item had three major elements: title,
description and link. What to put in the first two of these was always
reasonably clear. However, an ambiguity existed with respect to the third.

The link element was taken to contain a URL for the item being
described in the title and the description. This created two
possibilities: whether the link pointed to one's own article, or
whether it pointed to an article written by a third party.

For example, the RSS for the New York Times might contain a list of
articles, and the link for each item would point to the URL of the New
York Times article, everything being in the nytimes.com domain.
However, the RSS for Fred's Big Links might point to articles from
many newspapers, the link in one pointing to nytimes.com and another
to wapost.com and so on.

When software such as Blogger and LiveJournal embraced RSS, they
embraced the first model. Thus, every link in every item in the feed
generated by halfanhour.blogspot.com pointed to an article within the
halfanhour.blogspot.com domain. RSS, therefore, was thought as a site
summary document, rather than a linking document.

Over time this has become the dominant model for RSS; my own RSS feed
- http://www.downes.ca/news/OLDaily.xml - is one of the very few feeds
left in the world listing feeds outside the feed domain. Almost all
feeds point to a page within their own domain. However RSS
aggregators, such as Daypop, PubSub and Syndicate, provide RSS feeds
with external links in the link element (as one would expect).

But what if we want to do both? What if we want to, say, create a post
in Blogger that talks about an external resource, such as an article
in the NY Times? It seems that we must pick one of the two possible
links - the blogspot.com link or the nytimes.com link - to put into
the link element. Blogger, of course, makes the choice for us, placing
the blogspot.com link into the link element. But now, crucially, the
nytimes.com link disappears from the RSS (or the Atom, as both have
this problem).

2. The Need for Reference

But who cares, right? After all, if we have the link to the content
document on Blogger, then we have all the information we need. The
author can simply write about the NY Times as part of the post
content, and embed a link into what will eventually become the
description. Problem solved.

But - not really. For one thing, in order to obtain that link to
nytimes.com it is necessary to do extra parsing in order to extract
the href from addresses embedded in the description HTML. For another,
various urls may be embedded in the HTML some of which may not
actually be references but merely helpful links added to make
navigation easier.

The point is, in order to achieve an expressive power anything beyond
merely replicating the content of an HTML page in another format, RSS
(and Atom) needs some sort of reference element.

For example:

- a discussion list is expressed as a series of RSS items. Reference
is used to keep track of which comment replies to which.

- a conference organizer divides the conference into themes, each of
which is represented as an RSS item, and in addition lists each
conference presentation as an item. Reference is used to associate
each presentation with a theme.

- a person presents a paper at a conference and this presentation is
listed as an item in an RSS feed. Another person blogs about that
presentation. Reference is used to associate the blog commentary with
the original paper.

- a person blogs about an article in the New York Times. Reference is
used to associate the blog post with the NY Times article. An
aggregator uses these references to create a collection of blog posts
about this particular article.

- a taxonomy is created as an RSS feed. Reference is used to associate
items at lower levels in the taxonomy with items at higher levels of
the taxonomy.

- a large document is split into several parts, each of which is
described as a separate item. Reference is used to associate each of
those parts with a common title and table of contents page.

3. Alternatives

RSS referencing essentially creates distributed structured metadata.
Because of the desirability of this, various alternatives are already
available. Each alternative, however, has limited applicability and
therefore does not offer a consistent approach to RSS referencing.

- RDF

Several RDF data elements can be used to accomplish some functions of
referencing. For example, RDF subClassOf
http://www.w3.org/TR/rdf-schema/#ch_subclassof can be used to
represent taxonomical relationships. However no RDF data element
implies referencing specifically.

Referencing may also be accomplished using the rdf:about attribute
inside the item tag, as demonstrated in this column by mark pilgrim.
http://www.xml.com/pub/a/2002/12/18/dive-into-xml.html However, this
mechanism is available only to RSS 1.0 and derivatives. Moreover, it
merely relocates the original problem; the example just cited uses the
rdf:about tag to replicate the contents of the link element.

Dublin Core offers several alternatives, including ispartof,
reference, relation and others.
http://dublincore.org/documents/dc-citation-guidelines/ While
extremely useful, these tags do not specify links to external
resources as the link element does, but rather, contain citation
information, such as (say) a bibliographical element.    

- Tagging

Various aggregators have attempted to create RSS structure through the
use of tag or category elements. For example, authors blogging a
particular conference, say, NECC, are encouraged to use a given tag,
say NECC. http://technorati.com/tag/NECC 

Tagging is not an instance of additional metadata, but rather, the
placement of specific HTML code within the content description (or the
body of a blog post). The NECC tag is created using 'a
href="http://technorati.com/tag/NECC"; rel="tag"'. As such, tagging is
therefore an instance of the original problem wherein the extraction
of structure information requires specialized parsing of the link element.

The RSS 2.0 category element is more useful in the sense that it is an
actual XML element, and does not therefore 
require separate parsing. http://blogs.law.harvard.edu/tech/rss
However, this element is used specifically for the purpose of
categorization, and although a link reference could, in theory, be
placed inside a category element, most aggregators are not going to
expect to process this link.

- Conversion of RSS items to channels

Some blog engines have enabled comment RSS feeds. This is typically
accomplished by creating a separate channel for comments; David
Phillips Moveable Type comment feed template provides a good example.
http://tweezersedge.com/archives/2003/10/000157.html

What has happened here is that the original blog post, which began as
an item in another RSS feed, is not represented as a channel. The
value of the link element in the original item is now the value of the
link element in the channel. 

This allows association between comments and posts, however, at the
cost of multiplying channels and duplicating post information (once in
the original item element, and once in the channel element).

Moreover, the creation of a separate RSS channel presupposes that all
comments are known, or are located in the same place. Where comments
are distributed - as in, say, blog posts responding to other people's
blog posts - the requiste channel might never be created.

4. Specific Mechanisms

The precise mechanism settled on by the RSS community may vary,
however, in this section I propose a specific mechanism as a template.

Essentially, in order to encode reference in XML, one or more RSS (or
Atom) elements need to be created. These may be core elements, or they
may be extensions.

For now, I will treat these elements as extensions. Accordingly, they
are prefaced with 'ssn' (which stands for 'Semantic Social Network').

- parent - this tag, ecoded 'ssn:parent', is a generic parentage
relation. That is to say, when placed inside an RSS item, it refers
the reader (or aggregator) to a higher level entity. For example, a
chapter in a book would use 'ssn:parent' to point to the book home
page; a comment in a discussion would use 'ssn:parent' to point to the
comment it is replying to. Strictly speaking, only 'ssn:parent' is
required to satisfy the requirements outlined above.

- replyto - this tag, encoded 'ssn:replyto', is used specifically in
the dontext of discussion lists, and is used to point to the comment
to which the given comment replies.

- reference - this tag, encoded 'ssn:reference', is used by a blog
post or similar piece of writing to point to an external resource
being described or referenced by the blog post

Inside the tag, RSS content is displayed as though in a typical RSS
item. This allows a content provider to (optionally) include
information over and above link information, for example, the title of
the resource.

For example:

<channel>
...
<item>
   <title>My Reply</title>
   <link>http://myreplylink</link>
   <description>Blah blah blah</description>
   <ssn:reference>
      <link>http://www.originalpost</link>
      <title>Originial Post</title>
   </ssn:reference>
</item>
</channel>

5. Using References

The intent of a reference is to provide information about an external
entity within the context of the current entity. For example, the
intent of a 'replyto' element in a comment item is to provide
information about a different comment, specifically, the comment being
replied to.

The reference element itself must one specific piece of information,
the URL of the external entity. The intent here is that the URL serves
double duty, both as an indication of the location of the external
entity, and as an identifier for the external entity. It may also
include additional information, such as the title.

In a typical use, when additional information is not provided, it is
anticipated that an aggregator will have the rest of the information
about the external entity - the title, description, and the like -
already harvested and in the database. Therefore, the URL of the
external entry serves as a search parameter, allowing this information
to be retrieved and displayed with the current resource.

In other cases, however, this information will not be available - for
example, a person using Blogger does not have access to this data, nor
does a service that harvests from only a few content feeds. In this
case, the reference, as described above, provides *only* the external
link.

If the service displaying the resource does not have a database of
links, several options remain open:

- to use a generic link title, such as 'Reference', and provide the
URL to the viewer as a link

- use the link to access the HTML page and scrape the title from the
page, then display the title

- use the link to access the HTML page and scrape the 'link rel' tag
to obtain RSS for the page

- (best) use the link as a search term to use at an aggregator that
does have the full RSS or Atom description and will return that XML to you

But that said, the reference element is best used in an environment
that is both a writing and reading environment; for example, it is
better used in an environment like Bloglines, which connects a
blog-authoring service to an aggregator, than to Blogger, which does
not offer a blog aggregation service.

Alternatively, it may be worth considering the embedding of external
information locally. 

6. Expanding Reference

The types of reference described in this document are for the most
part content-specific. They describe relations between one type of
content and another.

Not all entities are content entities. Other entities include people,
events, companies and locations.

We have already begin working with reference to some of these other
entities. For example, longitude and latitude data in RSS feeds
http://geourl.org/news/2005/04/26/rssplus.html and GeoURL
http://geourl.org/ convert a place-specific RSS element into a
referenc to an external resource.

Many entities can be described using the simple syntax of RSS with a
minimum of extension. An event, for example, can be desribed in RSS
with the addition of date and location elements. An organization can
be described in RSS with the addition of (say) contact information and
(say) references to organization staff.

Developers in the RSS community (and, for that matter, in other XML
communities) have for the most part not considered seriously the
utility of linkages between entities, being instead focused on
describing the current entity. This focus should, over time, change.

-- Stephen