Now blogging at diego's weblog. See you over there!

conversation finder, part 2

Okay, so going forward with the conversation finder thingy, I think I'm pretty much done with the bot and DB layer for it (nothing to go crazy about, just a few classes, mysql statements, and such). My recent musings on search bots have been helpful, since I had already considered a number of problems that showed up.

DB-related, I simply created a few tables in MySQL (4.x) to deal with the basic data I know I'll need, and then I'll add more as it becomes necessary. To start with, I've got:

  • Site, including main site URL, last spidered date, and robots.txt content.
  • Page, which includes content, URL of a page, pointer to a Site, parsed state, last spidered.
  • PageToPageLinks, just two fields, source PageID and target PageID.
Add to that some methods to create, delete, and query and we're in business.

Bot-related, a simple bot using the Jakarta HttpClient. Why that and not the standard HttpURLConnection from the JDK? Because HttpURLConnection doesn't allow you to set timeouts. That simple. And when you're downloading potentially tens of thousands of links, you need tight timeouts, otherwise a single slow (or worse, non-responsive) site can throw a wrench on things, even if you use thread pools to do simultaneous parsing, you can have threads be locked for way too long and diminish the usefulness of pooling.

Anyway, so the basics out of the way, the bot records ETag and Last-Modified values (which sounds like a good idea but maybe won't always work--we'll see later) to download only changed things. It performs a HEAD request and then a GET if necessary.

But, since there's no parser yet, I can only download single pages I specify myself.

So, coming up: the parser. :)

PS: just to clarify again, I am doing this as a way to relax a bit. This might exist in other forms, in fact, James pointed out in a comment to the previous entry that BottomFeeder already supports this--very cool. I think it doesn't exist in this form, but even if it did, it would be good excercise for the brain anyway.

Posted by diego on December 1 2004 at 5:10 PM

Copyright © Diego Doval 2002-2011.
Powered by
Movable Type 4.37