[extropy-chat] Re: Identity Erasure? or faulty platforms?

Alan Eliasen eliasen at mindspring.com
Mon Nov 15 02:21:02 UTC 2004


Amara Graps wrote:
>>On 13 Nov 2004, at 16:08, Amara Graps wrote:
>>
>>>True, they should. I am sufficiently annoyed with companies that don't
>>>adhere by netiquette's courtesies (i.e. respecting the wishes of the
>>>web
>>>site's robot.txt file), that I usually block their future access to my
>>>web site (using .htaccess) if I think that they've behaved badly.
>>
>>Amara:
>>
>>How often does this happen?
> 
> I don't know the true frequency because I check my logs randomly,
> and I don't always check if the cases when a person downloaded my web
> site (~1000 files!) whether there was a check on the robot.txt file
> first. Both situations will cause me to bar future entry to my web site
> via .htaccess (the reason for the second, is that I think it's rude, the
> reason for the first is that I think that they are spammers, and
> spammers have caused me no end of grief these last years)
> 
> So my best guess for how often a robot ignores my robots.txt file is a
> couple of times per month.

   Thankfully, I don't see anything like this.  I wonder if there's a problem
in the syntax of your robots.txt file?  That's usually the root of the
problem.  You can check your file in several places, including:

http://tool.motoricerca.info/robots-checker.phtml
http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
http://www.sxw.org.uk/computing/robots/check.html

   From a quick eyeballing, your robots.txt file has several errors, including
the use of wildcards in Disallow fields, and fields that do not begin with a
slash.

   URLs that are not sufficiently canonicalized are another problem.
(Referring to the same file different ways.)

   I have several hundred thousand files accessible from my web server, (many
of them documentation for other peoples' software,) blocked by a robots.txt
file.  Over the past 2 years, since I got my robots.txt file right, I've not
had a single robot try to index all of this content.  Robots that try to index
gigabytes of documentation that's available in a lot of places don't tend to
crawl too far.

-- 
  Alan Eliasen                 | "You cannot reason a person out of a
  eliasen at mindspring.com       |  position he did not reason himself
  http://futureboy.homeip.net/ |  into in the first place."
                               |     --Jonathan Swift



More information about the extropy-chat mailing list