Now blogging at diego's weblog. See you over there!

conversations, metadata, and the URL disambiguation problem

Since my first tack at the conversation finder experiment looks promising, I was looking at what didn't work, and thinking about how it could be improved.

The first point that was clear was that metadata is crucial for this--exactly the kind of metadata that is present in RSS feeds: Author, top-level link, date for posts, etc. While it seems possible to infer a lot of this stuff from the raw HTML, one crucial component, the dates, can't be. While the Last-Modified header in HTTP responses would be mildly useful, it doesn't actually help much becase the page can be rebuilt both by the author and others (e.g., when they post comments).

The date, however, is important but not crucial. The sequencing of the "conversation" is determined by the direction of the link-graph, not by dates.

What is crucial though is solving the ambiguity/duplication of URLs. Most weblogs have archives, which repeat the information already present elsewhere. The result is that the same posts appear many times within the entire index of a single site. Archives cannot be avoided by some generic algorithm because their "shape" varies greatly. So you end up with many pages that have the same content and even appear to create loops within the conversation, particularly when a single archive page contains two posts that belong to the conversation. Right now, I am doing some analysis on the text that surrounds the link to determine whether this is a "duplicate" (see for example the "Other pages from this site with the same reference include" in this conversation list). But while putting duplicates together is clearly possible, I still don't know which was the actual original post and which are archive duplicates--and you'd want the original post of course.

The second problem of URLs is that multiple, completely different URLs point to the same content. This is a notable case in weblogs where sometimes the blog has moved providers over time, chosen a better URL, and both URLs are maintained. Scoble is a good example of this, having both http://scoble.weblogs.com/ and http://radio.weblogs.com/0001011/ pointing to exactly the same content. The only way for the software to "know" that this is actually the same site is by looking at the metadata in the feed (since checking the content will not necessarily be foolproof, since the sites could be slightly out of sync).

Then there's the simpler issue of the host (rather than the full URL) being different in different links. The simplest case is of course something.com and www.something.com pointing to the same thing. But Rui for example has equivalents in www.taoofmac.com and the.taoofmac.com. Again the feed would provide a lot of information to resolve the ambiguity and realize that these two seemingly different sites are actually the same. In this case other things are possible as well, IP checks, content checks, etc, but metadata seems to me a simpler and more effective solution.

I have other things to do today, but this is definitely something interesting to keep thinking about in the background.

Bonus: In the comments to my previous post Don mentioned the idea of "combustible conversations" which means, as describes it "bringing past cluster of interactions to the present when and where it's relevant." This is a great idea! Also something to think about.

Categories: soft.dev
Posted by diego on December 4, 2004 at 3:36 PM

Copyright © Diego Doval 2002-2011.