As a part of my early morning routine at my job I scan a daily notification on our web site visitor patterns, specifically looking for abusive or abnormal page requests to identify rogue hosts. I wrote the program years ago, but it has become one of my most effective tools in identifying abuse.
No surprise that search engine robots (spiders or crawlers) always top the list. Spiders are necessary evils. You want them crawling and indexing all your pages frequently, but that leads to spiders with voracious appetites that can rob the infrastructure of performance.
Speaking of voracious appetites, for years Googlebot used to top our crawl volume list, far and away from the second position, usually (but not always) occupied by Yahoo Slurp (Yahoo's spider). As of a few weeks ago I am noticing that Yahoo Slurp crawl volume is surpassing that of Googlebot. The first few days I considered it an anomaly, but the levels have remained consistent, knocking Googlebot out of its first position. Googlebot still registers with more or less the same numbers, but Yahoo Slurp has been consistently beating those numbers, sometimes by twice as many. That's a considerable spike in Yahoo's crawl activity.
I find it interesting that Yahoo's crawl volume started to rise in the midst of its take-over battles with Microsoft. Search engines have always competed on who has the most indexed pages. It's a bragging right, more than it is a practical or useful matter in search accuracy. But accuracy in search results is subjective, while page volumes are concrete numbers and that's why they are sometimes used as key differentiators between competing search engines. Still, the timing of the jump in Yahoo's crawling activity is intriguing. Could the motive be to enhance its index size (considered a key asset), thereby adding to the company's value? One wonders.
Meanwhile, sites need to consider the battle repercussions between the search engine titans as their increased crawl rates could put additional pressure on their infrastructures. One way to mitigate the effects is to set up mirror servers and deflect the spider/robot traffic to them using application-aware appliances, but that there's some effort involved and there's room for error as the servers need to be synchronized in real-time, specially for those such as news-related sites that continuously publish new pages and fresh content. The other consideration is bandwidth management to accommodate increased traffic. At the least, faster servers and fatter pipes are in order. As a reference here are the current signatures left behind by Google and Yahoo crawlers:
Yahoo Slurp's user-agent: