When you use Google to search the Web, Google doesn’t rush out and look for what you need after you ask. It’s already done its research using web crawlers. These are programs that ‘crawl’ the web, following links, and recording the contents of the web pages they discover, as well as how they’re all connected. Crawlers scour the web for changes and new pages all day, every day.
Much like when you visit a web page with your web browser, a crawler sends a request to the server, and the server delivers the page for the crawler’s use. What happens then is different because you read the page, and then decide what to do. You may close it or click a link to go to a new page, whether on the same site or a new one. Crawlers digest the page contents far more quickly so they limit how often they load pages from a site in a given period. Otherwise, they could swamp the server, and while that’s going on, no one can load any pages.
The server administrators has a tool at their disposal called the ‘robots.txt’ file. This is a simple text file that crawlers read before getting to work on a site. Using the file, the server admin can restrict crawlers to only certain parts of the server, or even block them entirely. They can also set different limits for each crawler, and specify a maximum crawl rate, to ensure people can still get through.
The weakness of the robots.txt file is that it’s not technically enforced. If you can program a crawler, you can program a crawler to ignore the robots.txt file. Most crawlers are well-behaved, however. Site owners want search engines to bring them traffic, and search engines are nothing without large indexes, so it’s a symbiotic arrangement. The site administrators hold the upper hand because they can take further measures to block a crawler. It’s a little more trouble, but it’s not difficult. In this way, the search engines generally respect the robots.txt file because they’ll be locked out entirely if they don’t behave themselves.
With that backgrounder, you’ll understand my amusement at the article, “Microsoft Bots Effectively DDoSing Perl CPAN Testers” I read on Slashdot earlier today. CPAN is a group of group of PERL users and developers who maintain a number of servers both for testing and for hosting their code.
A post on the CPAN Testers blog describes an attack that brought down some of their servers by overwhelming them with what they describe as a distributed denial of service. Abbreviated DDoS, this attack is simply a large influx of traffic from multiple sources. Although it took down servers and botched things up for them, it appears not to have been intentional. It was a bunch of crawlers harvesting content for Microsoft’s Bing search engine. Amazon or Google servers wouldn’t notice 20 – 30 crawlers doing their work simultaneously, but CPAN’s servers don’t get that kind of traffic so they don’t maintain the hardware to handle it. Further the Bing crawlers ignored the robots.txt file they have in place to limit crawler activity. As I suggested, this is extremely bad form.
So what did the CPAN admin do? He simply blocked Microsoft’s crawlers in a way they cannot ignore. Done. And when admins set these blocks, they typically leave them in place until it’s necessary to remove them, which is usually never.
To his credit, the product manager of Microsoft’s crawler wrote a reply to the blog post:
I am a Program Manager on the Bing team at Microsoft, thanks for bringing this issue to our attention. I have sent an email to firstname.lastname@example.org as we need additional information to be able to track down the problem. If you have not received the email please contact us through the Bing webmaster center at email@example.com.
That credit evaporated when he failed to come out from behind his title, gave only a generic e‑mail address, and made no apology. If the program manager were serious, he’d make it as easy as possible for the CPAN admin to contact him directly, acknowledge that the problem is theirs, and ask for the information they need to fix it. Simply saying they need the information is a demand, and despite their obviously inflated feelings of self-worth, they are in no position to demand anything.
I doubt anything will come of it. The CPAN admin has solved the problem. The crawlers will no longer harass his systems. Why would he lift a finger to help dumb-ass Microsoft fix their problem and make more money? Maybe if the product manager expressed believable concern and contrition, but the product manager wrote to cover his own ass. He can now say that he tried working with the CPAN people, despite the gesture being transparently hollow.