| d2r diego's weblog |
conversation finder v0.1Okay, so I actually should start using conversation engine, the the name Don suggested and which is think sounds cooler. But for the moment it's still the Conversation Finder. The first version is now live! This is a very limited version. Only a few sites are being indexed, mostly out of concerns for speed, bandwidth and such. I'll see about expanding it later. First thing to look at is this result, the conversations the engine (finder?) discovers between Don and Tim Bray. Interestingly enough, it finds one more aside from their recent Atom conversation, something about flowers :). This is great! It is finding actual conversations! But... the results are just a just little bit off. I keep seeing what it finds and thinking, "come on, you're so close!". Some links are loops. Some links are pointing to index pages (which might have the content, but...). Some of the text extracts are not relevant (look at "conversations" between the other sites that it's indexing). I think a big factor here is the fact that the engine knows nothing of archives, or people that run these blogs. Archives duplicate a lot of information, and the engine gets a little confused by that. So maybe the next step is to fiddle around with some of the metadata present in pages for weblogs (the metadata on RSS would be great, particularly the dates, to infer sequencing, however, RSS feeds only go as far back as a few days or posts, so all that's left is parsing for the different types of metadata embedding in HTML). Anyway, not bad for a few hours of work and a 0.1 version. Looks promising! Now if I just find a way of letting others enabling spidering of their sites without killing my server's bandwith... :)) PS: I wasted a couple of hours on Tomcat setup. Why? Because the JARs I was deploying in WEB-INF/lib didn't have write privileges. Tomcat wants them writable! And it was failing without any error messages, simply not loading the classes in the JARs (and yes, I tried common/lib). Anyway, all is well that ends well. Update (5/12/2004): The Conversation Finder is now the Conversation Engine. Categories: soft.devPosted by diego on December 3 2004 at 8:13 PM Comments (please see the comments & trackback policy).
very impressive! "(the metadata on RSS would be great, particularly the dates, to infer sequencing, however, RSS feeds only go as far back as a few days or posts, so all that's left is parsing for the different types of metadata embedding in HTML)" ...you're going to get the true XHTML geeks going with that one. As for me, for now, all my weblog data is stored as RSS 1.0... http://danielsjourney.com/blog/data/ Posted by: daniel at December 3, 2004 11:26 PMman, that writable jar problem bites! Posted by: brock at December 4, 2004 12:47 AMGosh. I forgot all about our chest beatings gardener posts. Posted by: Don Park at December 4, 2004 1:26 AMI think it's the right time to introduce the concept of making past conversations combustable. By combustable in terms of conversation means bringing past cluster of interactions to the present when and where it's relevant. It's like Dave's occasional 'what happened X years ago on this day' on steroid. You don't have to chase this one but it's rather juicy, I think. Posted by: Don Park at December 4, 2004 6:37 AMDaniel: Thanks. XHTML would be good, the problem is who would adopt it! Some weblogs (including mine) add metadata-of-sorts using HTML comments into the page, which would help. But the more I think about it, the more I'm convinced that the only way to give nicer results is to base it entirely on RSS. The metadata is just too important, and only a feed has a common structure. The problem, of course, is that you have to keep up with the feeds or risk losing piece of the conversation. Will have to think more about that. Brock: yeah :) Don: Funny that it found the flower-thing no? I was pleasantly surprised by that. The idea you mention of "combustible conversations" (I see that you're full of engine metaphors these days :)) is very, very cool. I'll think about that... Posted by: Diego at December 4, 2004 3:42 PMAnother way of getting to the RSS would be to go through the bloglines web service (http://www.oreillynet.com/pub/a/network/2004/09/28/bloglines.html) - where you can get the rss for all posts. Posted by: Hadley at December 4, 2004 6:08 PMI'm working on something similar. The way I detect changes is by getting a page, logging a hash, perhaps you could do the same? Send me email if you'd like my code. Posted by: Hasan at December 6, 2004 7:08 PMCopyright © Diego Doval 2002-2007.
|
