[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [syndication] robots.txt links



I updated my robots.txt file since I don't get a bunch of hits anyway, and
that allowed me to test it at the validator.  It returns an error that the
values are invalid.  However...

This thread [1] seems to indicate that an empty file will return an error as
well, but it's been mentioned [2] that creating an empty robots.txt file is
an acceptable solution.  So perhaps the validator isn't really validating
the file, but validating actions within the file, and since it doesn't know
how to handle them, it's reporting that they won't work as expected?

On the other hand, according to the W3C in the HTML 4.0 document [3], at
least one "disallow" must exist in the file, meaning that an empty file
isn't valid, apparently indicating that the "empty file" approach may not be
valid, and perhaps indicating that the validator is correct in that case,
even if it seems it might not be correct.  Whew.

I also found another parser [4] that validates the file successfully.
However, it has notes on the "allow" directive, so it's possible that it
doesn't adhere to the "standard".

Here's a thread [5] from 1999 that speaks of search engines stopping if they
encounter an invalid file.  No mention of what determines "invalid", though,
or if this has changed.  Also some mention of search engines not actually
using the robots.txt file at all, or perhaps not on every visit.

I found two proposals [6, 7] that mention extending the standard.  The first
is pretty in-depth, including the use of regular expressions and such in the
file.  Also seems to have been updated recently (2002) compared to the other
documents.  The second mentions that search engines may want to place a size
limit on the robots.txt file, but doesn't have much in the way of actual
extensions.

It seems that these documents and resources available don't really agree on
much of anything, at least well enough to put it into practice without some
level of risk.  Is there much chance of getting response from those running
the spiders to find not what might happen, but what will happen?

[1] http://www.webmasterworld.com/forum48/389.htm
[2] http://www.robotstxt.org/wc/exclusion-admin.html
[3] http://www.webmasterworld.com/forum27/239.htm
[4] http://www.w3.org/TR/html40/appendix/notes.html#h-B.4.1.1
[5] http://www.ukoln.ac.uk/web-focus/webwatch/services/robots-txt/
[6] http://www.conman.org/people/spc/robots2.html
[7] http://www.kollar.com/robots.html

Chad.

-----Original Message-----
From: Bill Kearney [mailto:wkearney@syndic8.com] 
Sent: Thursday, October 16, 2003 9:01 AM
To: syndication@yahoogroups.com
Subject: [syndication] robots.txt links


http://www.robotstxt.org/wc/robots.html

And don't forget, the archive can help if old pages have gone offline, or to
find past revisions:
http://web.archive.org/web/*/http://info.webcrawler.com/mak/projects/robots/
http://web.archive.org/web/20010711193857/www.robotstxt.org/wc/

It would be /very/ interesting to see how various tools handled unexpected
data
showing up in the robots.txt file.  Let's get a grip on the "but we'd break
legacy hacks" risk before this goes too far.

http://www.searchengineworld.com/cgi-bin/robotcheck.cgi

-Bill Kearney