Now blogging at diego's weblog. See you over there!

conversation finder v0.1

Okay, so I actually should start using conversation engine, the the name Don suggested and which is think sounds cooler. But for the moment it's still the Conversation Finder. The first version is now live!

This is a very limited version. Only a few sites are being indexed, mostly out of concerns for speed, bandwidth and such. I'll see about expanding it later.

First thing to look at is this result, the conversations the engine (finder?) discovers between Don and Tim Bray.

Interestingly enough, it finds one more aside from their recent Atom conversation, something about flowers :). This is great! It is finding actual conversations!

But... the results are just a just little bit off. I keep seeing what it finds and thinking, "come on, you're so close!". Some links are loops. Some links are pointing to index pages (which might have the content, but...). Some of the text extracts are not relevant (look at "conversations" between the other sites that it's indexing).

I think a big factor here is the fact that the engine knows nothing of archives, or people that run these blogs. Archives duplicate a lot of information, and the engine gets a little confused by that. So maybe the next step is to fiddle around with some of the metadata present in pages for weblogs (the metadata on RSS would be great, particularly the dates, to infer sequencing, however, RSS feeds only go as far back as a few days or posts, so all that's left is parsing for the different types of metadata embedding in HTML).

Anyway, not bad for a few hours of work and a 0.1 version. Looks promising! Now if I just find a way of letting others enabling spidering of their sites without killing my server's bandwith... :))

PS: I wasted a couple of hours on Tomcat setup. Why? Because the JARs I was deploying in WEB-INF/lib didn't have write privileges. Tomcat wants them writable! And it was failing without any error messages, simply not loading the classes in the JARs (and yes, I tried common/lib). Anyway, all is well that ends well.

Update (5/12/2004): The Conversation Finder is now the Conversation Engine.

Posted by diego on December 3, 2004 at 8:13 PM

manifold, the 30,000 ft. view

As a follow-up to my thesis abstract, I wanted to add a sort of introduction-ish post to explain a couple of things in more detail. People have asked for the PDF of the thesis, which I haven't published yet, for a simple reason: everything is ready, everything's approved, and I have four copies nicely bound (two to submit to TCD) but... there's a signature missing somewhere in one of the documents, and they're trying to fix that. Bureaucracy. Yikes. Hopefully that will be fixed by next week. When that is done, right after I've submitted it, I'll post it here (or, more likely, I'll create a site for it... I want to maintain some coherency on the posts and here it gets mixed up with everything else).

Anyway, I was saying. Here's a short intro.

Resource Location, Resource Discovery

In essence, Resource Location creates a level of indirection, and therefore a decoupling, between a resource (which can be a person, a machine, a software services or agents, etc.) and its location. This decoupling can then be used for various things: mapping human-readable names to machine names, obtaining related information, autoconfiguration, supporting mobility, load balancing, etc.

Resource discovery, on the other hand, facilitates search for resources that match certain characteristics, allowing then to perform a location request or to use the resulting data set directly.

The canonical example of Resource Location is DNS, while Resource Discovery is what we do with search engines. Sometimes, Resource Discovery will involve a Location step afterwards. Web search is an example of this as well. Other times, discovery on its own will give you what you need, particularly if the result of the query contains enough metadata and what you're looking for is related information.

RLD always involves search, but the lines seemed a bit blurry. When was something one and not the other? What defines it? My answer was to look at usage patterns.

It's all about the user

It's the user's needs that determine what will be used, how. The user isn't necessarily a person: more often than not, RLD happens between systems, at the lower levels of applications. So, I settled on the usage patterns according to two main categories: locality of the (local/global) search, and whether the search was exact or inexact. I use the term "search" as an abstract action, the action of locating something. "Finding a book I might like to read" and "Finding my copy of Neuromancer among my books" and "Finding reviews of a book on the web" are all examples of search as I'm using it here.

Local/Global, defining at a high level the "depth" that the search will have. This means, for the current search action, the context of the user in relation to what they are trying to find.

Exact/Inexact, defining the "fuziness" of the search. Inexact searches will generally return one or more matches; Exact searches identify a single, unique, item or set.

These categories combined define four main types of RLD.

Examples: DNS is Global/Exact. Google is Global/Inexact. Looking up my own printer on the network is Local/Exact. Looking up any available printer on the network is Local/Inexact.

Now, none of these concepts will come as a shock to anybody. But writing them down, clearly identifying them, was useful to define what I was after, served as a way to categorize when a system did one but not the other, and to know the limits of what I was trying to achieve.

The Manifold Algorithms

With the usage patterns in hand, I looked at how to solve one or more of the problems, considering that my goal was to have something where absolutely no servers of any kind would be involved.

Local RLD is comparatively simple, since the size of the search space is going to be limited, and I had already looked at that part of the problem with my Nom system for ad hoc wireless networks. Looking at the state of the art, one thing that was clear was that every one of the systems currently existing or proposed for global RLD depends on infrastructure of some kind. In some of them, the infrastructure is self-organizing to a large degree, one of the best examples of this being the Internet Indirection Infrastructure (i3). So I set about to design an algorithm that would would work at global scales with guaranteed upper time bounds, which later turned out to be an overlay network algorithm (which ended up being based on a hypercube virtual topology), as opposed to the broadcast type that Nom was. For a bit more on overlays vs. broadcast networks, check out my IEEE article on the topic.

Then the question was whether to use one or the other, and it occurred to me that there was no reason I couldn't use both. It is possible to to embed a multicast tree in an overlay and thus use a single network, but there are other advantages to the broadcast algorithm that were pretty important in completely "disconnected" environments such as wireless ad hoc networks.

So Nom became the local component, Manifold-b, and the second algorithm became Manifold-g.

So that's about it for the intro. I know that the algorithms are pretty crucial but I want to take some time to explain them properly, and their implications, so I'll leave that for later.

As usual, comments welcome!

Categories: science,, technology
Posted by diego on December 3, 2004 at 12:41 PM

IBM puts the PC business on the block

It's all over the wire.

Aside from the historical significance of this, my first feeling about was a bit of sadness, and thinking "No!". My second feeling was surprise at feeling anything about a corporate acquisition! My third :) was a realization that if this happened Apple would be alone, as far as I'm concerned, in moving the ball forward on laptops. IBM was the king of laptop evolution in the PC side, and even though Dell and HP are respectable, they've never shown huge amounts of initiative there (HP has been much better in terms of its PocketPC work, or rather, the iPaq team from Compaq was and HP has kept the ball rolling).

So my thought had a lot less to do with the PC business itself than with the Thinkpad business. IBM desktops were always a bit clunky IMO, and, excellent keyboards aside, they didn't do much for me. No doubt the failed PS/2-OS/2 experiment in the late 80s (remember Microchannel?) had a huge effect on the vitality of the IBM desktop PC line for a long time.

But the Thinkpads! Almost all of my laptops have been Thinkpads. Best keyboards of any laptop. Simple, good design. Amazing reliability. I got a 560e in 1998 and in 2000 I gave it to my parents so they could keep using it. And it still works fine. The battery started failing two years ago. 4 years. No doubt we take good care of the machines, but still, that's quite a long time for these things.

More than anything, what will be missed the most will be the innovation that Thinkpads championed. While a bit dry in terms of design (certainly Apple has been far ahead of everyone else on that count, as usual), they've always moved forward in terms of functionality. Thinkpads where the first to include 10.4" color TFTs, introduced the trackpoint device (invented by IBM, the trackpad was invented by Apple), first notebook to include a CD reader, ultraportables (the 560), first to integrate DVD, and small but useful things like the "ThinkLight" the little light at the top of the LCD that illuminates the keyboard. That, plus what didn't make it in the long such as the amazing Butterfly keyboard or LCD projection that was done by removing the back cover on the display and then resting the now see through LCD on top of an overhead projector, and other things like the Thinkpad Transnote.

Anyway. Maybe IBM will keep part of research focused on that (I hope so!). But it seems more likely that from the point that this transaction happens (assuming it does) then Apple will be the main flag-bearer for innovation on notebooks.

Categories: technology
Posted by diego on December 3, 2004 at 10:11 AM

Copyright © Diego Doval 2002-2011.