Now blogging at diego's weblog. See you over there!

because search bots have feelings too

For reasons passing understanding, in the last couple of weeks I've developed a curiosity for the topology of both content and links in certain groups of webpages.

So today I sat down and wrote an extremely simple bot/parser to get some data. I was done in about an hour, tested a bit, fiddled, and it started to dawn on me just how hard it is to build a good search bot.

We hear (or read) to no end about the algorithms that provide search results, most notably with Google's. There's a vast number of articles about Google that can be summarized as follows: "PageRank! PageRank! PageRank is the code that Rules the Known Universe! All bow before PageRank! Booo!" (insert "blah blah" at your leisure instead of spaces).

But what's barely mentioned is how complex the Bots (for Google, Yahoo!, Feedster, etc) must be at this point (I bet the parsers aren't a walk in the park either, but that's another story). You see, the algorithm (PageRank! PageRank! Booo!) counts on data already processed in some form. Analyzing the wonderful mess that is the web ain't easy, but the "messiness" that it has to deal with is inherent to its task.

But the task of a Bot, strictly speaking, is to download pages and store them (or maybe pass them on to the parser, but I assume that no one in their right mind would tie in parsing with crawling--it seems obvious to me that you'd want to do that in parallel and through separate processes, using the DB as common point). And yet, even though the task of the Bot is just to download pages, it has to deal with a huge amount of, um, "externalities."

In other words, the bot is the one that has to deal with the planet (ie., reality), while the ranking algorithm (PageRank! PageRank! Booo!) sits happily analyzing data, lost in its own little world of abstractions.

Consider: some sites might lock on the socket and not let go for long periods. Tons of links are invalid, and yet the Bot has to test each one. There are 301s, 404s, 403s, 500s, and the rest of the lot of HTTP return codes. Compressed streams using various algorithms (GZIP, ZLib...). Authentication of various sorts. Dynamic pages that are a little too dynamic. Encoding issues. Content types that don't match the content. Pages that quite simply return garbage. And on and on.

What makes it even harder is that the chaotic nature of the Internet forces the Bot (and those in charge of it) to go down many routes to try to get the content. A Bot has to be:

  • extremely flexible, able to deal with a variety of response codes, encodings, content types, etc.
  • extremely lax in its error management (being able to recover from various types of catastrophic failures).
  • extremely good at reporting those errors with enough information so that the developers can go back and make fixes as appropriate (to deal with some kind of unsupported encoding, for example).
  • as fast as possible, minimizing bandwidth usage.
  • respectful of all sort of external factors: sites that don't want to be crawled, crawling fast, but not too fast, (or webmasters get angry), robots.txt and meta-tag restrictions, etc.
  • massively distributed (with all that it entails). well as any number of things that I probably can't think of right now.

Bots are like plumbing: you only think about them when they don't work. Of course, the algorithm is crucial, but the brave souls that develop and maintain the bots deserve some recognition. (The parser people too :)).

Don't you think?

PS: (tangentially related) Yahoo! should get a cool name for its algorithm, at least for press purposes. (Does it even have a name? I couldn't find it). Otherwise referring to it simply as a "ranking algorithm" --or something-- is kind of lame, and journalists steer towards PageRank and we end up with "PageRank! PageRank! Booo!". :)

Posted by diego on November 20, 2004 at 4:04 PM


Russ will be consulting for Yahoo! starting Monday. Congratz!

Between Russ in mobile stuff and Jeremy in search (another cool move) you've got two big areas of Y! covered by great bloggers. (I wonder if someone is blogging from the services side...Y!Mail, etc.). Most excellent.

Categories: technology
Posted by diego on November 20, 2004 at 3:55 PM

Copyright © Diego Doval 2002-2011.