Now blogging at diego's weblog. See you over there!

humbled... how I feel, btw, after the response to my post on clevercactus. I want to say thanks to everybody. Those that have posted about it offering help or referrals, like Erik, Russ, Scoble, Dave, Om, Frank, James, Cristian, Volker, Daniel and many others, and those that have sent emails or put me in touch with people. I don't know what to say (aside from this). Thanks again all!

Categories: personal
Posted by diego on December 1, 2004 at 5:31 PM

conversation finder, part 2

Okay, so going forward with the conversation finder thingy, I think I'm pretty much done with the bot and DB layer for it (nothing to go crazy about, just a few classes, mysql statements, and such). My recent musings on search bots have been helpful, since I had already considered a number of problems that showed up.

DB-related, I simply created a few tables in MySQL (4.x) to deal with the basic data I know I'll need, and then I'll add more as it becomes necessary. To start with, I've got:

  • Site, including main site URL, last spidered date, and robots.txt content.
  • Page, which includes content, URL of a page, pointer to a Site, parsed state, last spidered.
  • PageToPageLinks, just two fields, source PageID and target PageID.
Add to that some methods to create, delete, and query and we're in business.

Bot-related, a simple bot using the Jakarta HttpClient. Why that and not the standard HttpURLConnection from the JDK? Because HttpURLConnection doesn't allow you to set timeouts. That simple. And when you're downloading potentially tens of thousands of links, you need tight timeouts, otherwise a single slow (or worse, non-responsive) site can throw a wrench on things, even if you use thread pools to do simultaneous parsing, you can have threads be locked for way too long and diminish the usefulness of pooling.

Anyway, so the basics out of the way, the bot records ETag and Last-Modified values (which sounds like a good idea but maybe won't always work--we'll see later) to download only changed things. It performs a HEAD request and then a GET if necessary.

But, since there's no parser yet, I can only download single pages I specify myself.

So, coming up: the parser. :)

PS: just to clarify again, I am doing this as a way to relax a bit. This might exist in other forms, in fact, James pointed out in a comment to the previous entry that BottomFeeder already supports this--very cool. I think it doesn't exist in this form, but even if it did, it would be good excercise for the brain anyway.

Posted by diego on December 1, 2004 at 5:10 PM

conversation finder

My next step is to write something different to get the gears moving in my head again, and Don's conversation category idea is appealing. So, let's do that. :)

Don described it as:

A conversation aggregator subscribes to the category feed of all the participants and merge them into a single feed and publishes a mini-website dedicated to the conversation. The 'referee' of the debate or the conversation moderator gets editorial rights over the merged feed and the mini-website. Hmm. This stuff is very close to what I am currently working on so I think I'll slip this feature in while I am at it.
Since this is probably too big to do in quick & dirty fashion, I was a little worried. But last night I thought of a different approach. How about something that finds conversations, rather than is subscribed to certain categories?

After all, we already have a mechanism to define conversation thread: the permalink. Generally when you're in a cross-blog thread, you point back at the last "reply" from the other person. A cross-blog thread also has the advantage of being a directed graph, with a definite starting point. So permalinks and some kind of graph traversal thingamagic could be used to find the threads that exist, and maybe are in progress.

As Don notes, sometimes you might refer to the other party by name, or make oblique references. That could be step two, using text-based search to add some more information to the graph formation. But let's say we start with permalinks only.

Hm. Okay. So what do I need for this? First things that come to mind:

  • A crawler
  • A DB (the tables, I mean)
  • A parser (to find the set of links)
  • The algorithm to find the conversations
  • Some kind of web front end to make it more usable?
Neither the crawler nor the parser have to be super-sophisticated, so maybe they are doable in a few hours. Or a couple of days?

This sounds like a good starting point. First step should be DB & crawler. More later!

Posted by diego on December 1, 2004 at 6:52 AM

Copyright © Diego Doval 2002-2011.