Now blogging at diego's weblog. See you over there!

conversation engine: the next step


Since the recent integration of Feedster results into the conversation engine, I stopped coding for a bit and while doing other stuff I've been thinking of how to make it more scalable, covering more weblogs, and not wasting resources in looking at pages with no meaning (read: make it more useful) --- in short, how to solve the problems I mentioned in that entry.

The crucial problem is that Feedster provides only part of the picture. Scott Rafer (Feedster CEO) mentioned in the comments that I could use the Feedster links output, which provides a list of the references to a particular weblog. This doesn't quite do what I need however. The reason is simple: Feedster indexes RSS feeds, not entire sites, and so if someone is providing summary feeds, then Feedster will not be able to find links between weblogs, even if they exist. Because, many, many weblogs provide summary feeds, it is clear that the only way to get the links between entries is to get the actual contents of the HTML page. But.

But what I can do is use Feedster as the source point for the list of pages to index. Right now I am indexing everything on a given website. This has two drawbacks. First, I am forced to download, store, and analyze, waaay more content than I need (which accounts for the small amount of sites the bot is crawling at the moment), particularly when weblogs point to other parts of a site, including Wikis, dynamic apps, etc. Second, it slows down the processing for conversations, which depends on walking the link graph between two sites. This is a problem now, but if I move in the direction of adding multiple-participant conversations (as Don suggests in a comment to my previous conv. engine post, linked above) then this will be even more important.

So.

Next step, then, is to use Feedster as the data source for the entries of a given weblog. Then download/process the pages for each entry's permalink. Then analyze that and combine the results with the Feedster information.

Stay tuned! More in the next few days.

Categories: soft.dev
Posted by diego on December 7 2004 at 10:36 AM

Copyright © Diego Doval 2002-2011.
Powered by
Movable Type 4.37