[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] site-wide metadata discovery



Provide a way to indicate format type in there and then convince whatever
'powers that be' in charge of robots.txt to go along.  It'd be putting a burden
on an RSS aggregator as most do not (and need not) pay any attention to the
robots.txt file.  Here you'd be asking them to break out of using an XML parser
and use a text parser.  Not impossible but not automagic either.

I'd venture using an XPath string here would be a /very/ interesting way to
augment this in much more sophisticated ways.  But I'm not sure the world's
ready for that yet.

-Bill Kearney
Syndic8.com

----- Original Message -----
From: "Chad Everett" <yahoogroups@jayseae.cxliv.org>
To: <syndication@yahoogroups.com>
Sent: Wednesday, October 15, 2003 5:13 PM
Subject: [syndication] site-wide metadata discovery


> Some nice comments on discoverability in general being broken, but that
> doesn't really help solve the problem.  Here's an off-the-wall idea.  What
> about adding functionality to a file that's already present, namely the
> robots.txt file?  As it's tolerated already in many cases, let's make it
> useful.
>
> Rather than user-agent/disallow recordsets, it could use something like:
>
> Site-Index:
> Public-Feeds: myPublicFeeds.opml
>
> According to the standard, unrecognized headers should be ignored, so this
> shouldn't affect any "normal" robot/spider/crawler.  But when an app came
> along that did recognize this recordset, it could get the data it needs.  No
> new file name clutter, no link clutter.  You could still use those if you
> want, of course.  :)
>
> Of course, this doesn't help much if you're talking about folder-level data,
> since robots.txt exists only at the root of the domain.  But at the very
> least, the root could be read to determine the file name, and look for that
> file name in the current folder.
>
> For instance, if browsing example.com/folder, your browsing application of
> choice reads example.com/robots.txt and finds that the public feeds are
> stored in myPublicFeeds.opml, so it looks in
> example.com/folder/myPublicFeeds.opml for the data.  If you want to get data
> below or above the current location, apply the same logic - traverse the
> folder structure and get the named file.
>
> This might be preferred in some cases, where file names should be
> standardized across the domain.  In other situations, perhaps an alternative
> to allow for differences in folders:
>
> Site-Index: folder
> Public-Feeds: myOtherFeeds.opml
>
> Site-Index: another
> Public-Feeds: evenMoreFeeds.opml
>
> Or even add include functionality to the file:
>
> Site-Index: include folder/robots.txt
>
> And let each subdivision of the domain create their own file, which gets
> read into the "master" as the data is parsed.  Naturally this could create a
> whole bunch of crawling to get all the data, so this last idea might not be
> the best - but it could be there for those who want the functionality at the
> cost of the bandwidth/resources required.  What's more, the include allows
> for different file names in different folders.  Only the top-level
> robots.txt is "standardized", and that file is already there in most cases.
>
> If these are all really bad ideas, I blame Mark's medicine.  :)