I was thinking this morning of an application that, amongst other things, would have to visit and parse web pages from links submitted by (or indirectly referenced by) the general public. Yes, I know, not a terribly original idea. This led me to wonder if I'd annoy anyone in the process of sending the program out to visit their sites.
There's this nifty thing called the Robot Exclusion Protocol that allows site administrators and authors to tell web robots not to visit their pages, or if they do visit, not to index them. What the standards seem to be missing is a clear definition of a robot. They were written very much with search-engine spiders in mind, and the existence of automated agents that perform some other function seems to have been forgotten. Reading the various documents, you know for sure that GoogleBot is a robot, you know for sure that you, sitting in front of Safari and clicking links are definitely not. Everything in between seems to be a grey area.
Or maybe I'm just thinking too hard, and making a problem where none exists. Is the general consensus that "Robot" as defined by the exclusion protocol means "spider", and not "Automated agent that doesn't spider"?
Being one who has written the odd crawlbot and the like, there are some trade offs. Firstly, back in the day people used the exclusion protocol to prevent spiders winding up in spider traps. They also used it to prevent spiders from hitting computationally expensive parts of the site that would cause a denial of service style attack if some over eager bot were to pummel.
But, as things have grown, it has become fairly common for people to hide vast chunks of their sites behind the exclusion protocol. Just have a bo-peek at the chain of robots.txt files on ebay for instance. The various auction comparison sites that hit the large number of auction sites out there fairly obviously ignore the REP, along with all the new search engines that claim to search the deep web.
So, my personal opinion on the matter, has basically come to the old adage of don't do too much damage. Follow the spec and max it at two connections to any site. Max the number of requests per appropriate time period to something vaguely human like. Also be aware that spider traps mean that you really need to evaluate each page as you get it, as opposed to doing it in some post processing step.
hth...
"Makes repeated, automated requests" seems a good determinant of robotness. For what it's worth, RSS scrapers generally should honor robots.txt. I couldn't see doing that when fetching HTML documents' titles for link text; that's not a repeated (or, arguably, automated) request.
What about a pre-fetching proxy that loads the links on the current page you're reading into a cache, and then serves the cached pages up when you click on one of the links? I don't know whether you'ld consider it either automated, or making repeated requests... (I suppose for functionality's sake, it shouldn't honour robots.txt, but that same argument could be made for many other programs, most of which would definately be considered robots.)
What about a pre-fetching proxy that loads the links on the current page you're reading into a cache, and then serves the cached pages up when you click on one of the links? I don't know whether you'ld consider it either automated, or making repeated requests... (I suppose for functionality's sake, it shouldn't honour robots.txt, but that same argument could be made for many other programs, most of which would definately be considered robots.)
Curses! Could you delete that duplicate entry, and this one too, when you get a chance, Charles?
Thanks,
Blake.