I was thinking this morning of an application that, amongst other things, would have to visit and parse web pages from links submitted by (or indirectly referenced by) the general public. Yes, I know, not a terribly original idea. This led me to wonder if I'd annoy anyone in the process of sending the program out to visit their sites.
There's this nifty thing called the Robot Exclusion Protocol that allows site administrators and authors to tell web robots not to visit their pages, or if they do visit, not to index them. What the standards seem to be missing is a clear definition of a robot. They were written very much with search-engine spiders in mind, and the existence of automated agents that perform some other function seems to have been forgotten. Reading the various documents, you know for sure that GoogleBot is a robot, you know for sure that you, sitting in front of Safari and clicking links are definitely not. Everything in between seems to be a grey area.
Or maybe I'm just thinking too hard, and making a problem where none exists. Is the general consensus that "Robot" as defined by the exclusion protocol means "spider", and not "Automated agent that doesn't spider"?