Now blogging at diego's weblog. See you over there!

conversation engine, v0.2 (enter feedster)

I've been busy with other things, but I had a couple of hours this morning to make some mods to the conversation finder. First, I changed its name. It is now the conversation engine. This is a name that Don used and that I thought was much better than mine, so there it is.

The main change in this version is that I am now using Feedster's search results, combined with my own spidering, to find conversations. This has to consequences, one good and one bad.

The good consequence is that I can now use Feedster's stored metadata for the post, which is excellent. Check out the "canonical" :) search for conversations between Don and Tim. Now you've got pictures and everything! Great. but you'll notice there's one less conversation than there used to be, which brings me to the "bad" consequence.

The bad consequence is that, because I am at the moment using only feedster's first 100 results (the maximum feedster allows on queries) as a filter, it means I am losing the earlier data I had. The data is still on my DB, since it was spidered, but the engine is no longer finding it. This is an easy fix though: just iterating through more feedster results will do the trick, "activating" the older spidered pages on my DB. (BTW, this is why also some conversations that showed up before aren't showing up anymore. I'll be sure to post once I have made this fix, since it's pretty crucial.)

Why am I using both the Feedster DB and my own results, you ask? Because Feedster is, at the moment, returning only a parsed version of the RSS feed for that site:just HTML. And that means that there are no links in the entry. And that means that I can't create the conversation graph, since there are no links to follow.

Even if Feedster was providing the raw entries, it would still be a problem. The reason is summaries: many RSS feeds provide summaries, not the whole content, so it's not guaranteed that you'll be able to extract links from a feed's description element. That means that you absolutely need to crawl the entire site, and then use the combination of Feedster's results and the crawling ("intersecting" them) to come up with the list of links you're interested in.

Anyway, looks much better now, doesn't it? :)

Categories: soft.dev
Posted by diego on December 5, 2004 at 6:40 PM

Copyright © Diego Doval 2002-2011.