[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [syndication] Digest Number 368



Hi,
Mark wrote:

> If we're expecting a publisher to mark up their page, why not give
> them better control over what becomes the title, the link, etc.,
> rather than playing guesswork?

Here is a draft of a more generalized version of using inline metadata that
I've been working on, comments appreciated.
Aarons' excellent script takes a link mentioned in a weblog posting and
assumes that the surrounding commentary is about that link (if there is more
than one link in a posting then you have to choose which one the commentary
is about). If you like, the XML that is produced is metadata about another
page other than the one where the metadata is published within span tags.
The more generalized approach assumes that you want to automatically
generate metadata about the page that you are running the script over, be
able to generate richer metadata and allow for multiple links.
In addition you will want to use namespaces defined by URI's to avoid
collisions and it may be usefull to use the concept of weblog style
permanent archiving to attach the default namespaces at the level of
individual postings as opposed to web pages which are a transient thing. In
other words metadata should be attached to a piece of content (a posting
which may be a paragraph or several pages) as it was authored, as opposed to
a page which is merely a rendering of part of some content or a collection
of different pieces of content.

Below is a rough draft of what a marked up HTML would look like page (the
markup is called swml 'semantic web markup language' for want of a better
description):

<html>
<head>
<title>The Liar : !</title>
<!?because we are using the weblog method of defining permanent links to
items as they
appear in an archive the metadata can be extracted from individual nuggets
of information on
the page, the page is merely are temporary view of the data, this allows for
permanent
extraction of information from dynamic websites and is analagous to a view
of records from a
database and the like?>
<!?because we are using swml strict the following stylesheet reference can
define layout
which will notbe extracted as metadata?>
<link rel=?stylesheet? type=?text/css? href=?/stylesheets/pretty_things.css?
>
<!?the mention of swml strict tells parsers to only extract metadata
explicitly defined using
swml class attributes?>
<META NAME=?Keywords? CONTENT=?weblog stuff, swml, strict?>
<META NAME=?Description? CONTENT=?swml strict?>
<meta http-equiv=?Content-Type? content=?text/html; charset=iso-8859-1">
<body>
<span class=?swml?rdf?item?>
<!? the following line is a blank link so that references to this item point
to its top (this is
common weblog usage and has nothing to do with swml per se)?>
<a name=?3748433"></a>
<span class=?swml?blog?date_posted?>22.5.01...</span>
<blockquote>
<span class=?swml?rss1?description?>
<!?the following swml class is a link plus content, the contents of the link
are wrapped in a
tag using the name of the class plus _text and the link itself is given the
name of the class
plus _link, after extraction?>
<a class=?sml?blog?headline? href=?
http://c.moreover.com/click/here.pl?x19439590";>Fears
of a clown</a>
<br>Things look bleak for <span class=?swml?name?>Bob Manion</span>
, aka ?Flasher the Clown?. For the past two decades, Bob has
<br>graced public festivals near his home in Clayton, California.
<br>WHAT? :!
</span><br>
<!?we can use the same class name the order in which they are grouped
determines what
metadata belongs to what link?>
<a class=?sml?blog?headline? href=?
http://c.moreover.com/click/here.pl?x19439590";>
Things still look bleak.</a><br>
<!?since we are using swml class notation on this page, the following item
is ignored as
metadata (although is still part of the parent metadata class, item) and
items like that below
can be rendered using css stylesheets without affecting metadata
extraction?>
<span class=?presentation_item?>Pretty thing.</span><br>
<span class=?swml?blog?time-posted?>12:10 PM</span>
<!?the following line is commented out for non-displayed inline metadata,
but the swml
metadata can still be extracted?>
<!?<span class=?swml?blog?author?>David Galbraith</span>?>
<!?the following line is what determines the rdf about attribute for the
item it also specifies
the actual link that points to the archived version?>
<a class=?rdf?about-3748433" href=?http://www.theliar.com/oldlies/
2001_05_01_oldlies.html#3748433">:.</a>
</blockquote>
<p>
</span>
<span class=?swml?rdf?item?>
<a name=?3554025"></a>
<span class=?swml?blog?date_posted?>8.5.01...</b></span>
<blockquote>
<span class=?swml?rss1?description?><a class=?sml?blog?headline? href=?
http://
c.moreover.com/click/here.pl?x18749952">Ugly males are better
partners</a>The less attrac-tive
make much better fathers because they don?t go around chasing attractive
females.<br>
:!</span><br>
<span class=?swml?blog?time-posted?>2:28 PM</span>
<!?<span class=?swml?blog?author?>David Galbraith</span>?>
<a class=?rdf?about-3554025" href=?http://www.theliar.com/oldlies/
2001_05_01_oldlies.html#3554025">:.</a>
</blockquote>
<p>
</span>
</body>
</html>


The following shows RSS 1.0 output based upon parsing the above SWML:
Since RSS is chosen as the default namespace, all non explicitly declared
namespaces for
span class attributes are presumed to belong to the namespace of the archive
URI which
uses the alias ?my:? by default.
For rendering in other vocabularies, it may be desireable to set the
namespace aliased by
?my:? as the default.

<?xml version=?1.0" encoding=?UTF-8" ?>
<rdf:RDF xmlns:rdf=?http://www.w3.org/1999/02/22-rdf-syntax-ns#? xmlns:dc=?
http://
purl.org/dc/elements/1.1/? xmlns:sy=?
http://purl.org/rss/1.0/modules/syndication/?
xmlns=?http://purl.org/rss/1.0/? xmlns:my=?http://www.theliar.com/oldlies/
2001_05_01_oldlies.html?>
<channel rdf:about=?http://www.theliar.com/oldlies/2001_05_01_oldlies.html?>
<title>Permanent RDF-Ready Archive</title>
<link>http://www.theliar.com/oldlies/2001_05_01_oldlies.html</link>
<description>Weblog archive</description>
<sy:updatePeriod>daily</sy:updatePeriod>
<items>
<rdf:Seq>
<rdf:li rdf:resource=?
http://www.theliar.com/oldlies/2001_05_01_oldlies.html#3748433"; />
<rdf:li rdf:resource=?
http://www.theliar.com/oldlies/2001_05_01_oldlies.html#3554025"; />
</rdf:Seq>
</items>
</channel>
<item rdf:about=?
http://www.theliar.com/oldlies/2001_05_01_oldlies.html#3748433";>
<link>http://www.theliar.com/oldlies/2001_05_01_oldlies.html#3554025</link>
<description>Fears of a clown. Things look bleak for Bob Manion, aka
?Flasher the Clown?. For
the past two decades, Bob has graced public festivals near his home in
Clayton, California.
WHAT? :!.</description>
<my:name>Bob Manion</my:name>
<blog:headline>
<blog:headline_link>http://c.moreover.com/click/here.pl?x19439590</blog:head
line_link>
<blog:headline_text>Fears of a clown.</blog:headline_text>
</blog:headline>
<blog:headline>
<blog:headline_link>http://c.moreover.com/click/here.pl?x19439590</blog:head
line_link>
<blog:headline_text>Things still look bleak.</blog:headline>
</blog:headline>
<blog:date_posted>22.5.01...</blog:date_posted>
<blog:time_posted>12:10 PM</blog:time_posted>
<blog:author>David Galbraith</blog:author>
</item>
<item rdf:about=?
http://www.theliar.com/oldlies/2001_05_01_oldlies.html#3554025";>
<link>http://www.theliar.com/oldlies/2001_05_01_oldlies.html#3554025</link>
<description>Ugly males are better partners. The less attractive make much
better fathers
because they don?t go around chasing attractive females.</description>
<blog:headline>
<blog:headline_link>http://c.moreover.com/click/here.pl?x18749952</blog:head
line_link>
<blog:headline_text>Ugly males are better partners.</blog:headline_text>
</blog:headline>
<blog:date_posted>8.5.01...</blog:date_posted>
<blog:time_posted>2:28 PM</blog:time_posted>
<blog:author>David Galbraith</blog:author>
</item>
</rdf:RDF>

...................................
David Galbraith - Chief Architect, founder
Moreover Technologies, Inc.
http://www.moreover.com
mailto:david@moreover.com
415-577-8828 (US)
0777-565-8880 (UK)
...................................
Moreover Technologies White Paper:
"Managing Online Information to
 Maximize Corporate Intranet ROI"
http://x.moreover.com/c/?sig
...................................

>
> Message: 1
>    Date: Mon, 3 Sep 2001 10:12:24 -0700
>    From: Mark Nottingham <mnot@mnot.net>
> Subject: Re: RSSify your web page
>
>
> Before this approach explodes too much, it seemed like the RSSify
> engine was *only* basing items on a <span class="rss:item"> tag, and
> using heuristics to discover the rest.
>
> If we're expecting a publisher to mark up their page, why not give
> them better control over what becomes the title, the link, etc.,
> rather than playing guesswork?
>
> I thought that Aaron's original engine, and the W3C version [1] did
> this...
>
>
> [1] http://www.w3.org/2000/08/w3c-synd/
>
>
>
>