Now blogging at diego's weblog. See you over there!

social software: automatic relationship clustering


[See also this post and a follow-up here]

Regarding my post on tuesday on social software: representing relationships, my mind kept coming back to one of the things I wrote:

People don't always agree on what the relationship means to each other. This to me points to the need to let each person define their own relationship/trust structures and then let the software mesh them seamlessly if possible.
The reason I kept thinking about it is that I didn't really explain properly what I meant by that.

So what did I have in mind when I said that? Well, lots of things :), but let's start with the basics.

The first thing that the software should be able to do is infer what groups are there, rather than be told what the groups are. With this in hand, if you simply define relationships to your friends, and you take into account their friends and how they relate to you, you should be able to create a graph of probable relationship clusters, that is, groups formed around strong interpersonal relationships. Sounds farfetched? Read on...

Well, as it happens while I was at Drexel I did research on automatic graph clustering, applying genetic algorithms to techniques developed by my advisor at the time, Spiros Mancoridis, except back then it was applied to software systems. But what I realized a couple of days ago is that the same technique, maybe with a few mods, it would work just as well (if not better, because the graphs are smaller). The technique is described here, but to make a long story short, there's basically a set of equations that can be used to provide an objective measure of how well the clustering is done on a certain graph. The graph must define clusters to which nodes belong, along with the relationships between the nodes. The equation system favors loosely coupled clusters with dense inter-cluster relationships between the nodes (cluster coupling is determined by the edges that connect different nodes across clusters). The objective value is called the modularization quality of the graph, which is calculated by using an inter-cluster connectivity measure and an intra-cluster connectivity measure .

To make things more concrete, let's look at the simplest type of MQ, one for a directed graph with unweighted edges. Don't panic, it's not as complex as it looks at first glance! :) The intra-cluster measure is calculated as follows:

clustering-ai.gif

Where Ai is the intra-connectivity measure for cluster i, Ni is the number of nodes in the cluster, and mi is the number of edges in the cluster.

The inter-connectivity measure Eij is calculated as:

clustering-eij.gif
with i and j the cluster numbers (or IDs) to calculate interdependency for, N the number of nodes in each and eij the number of edges between them.

These two measures are combined to calculate the MQ of the whole graph:

clustering-mq.gif

With k the number of clusters in the graph. (Btw, this is all much better explained in the paper, but this is good enough to get an idea of what's going on).

Now, the problem with this calculation is that it depends on a particular graph clustering, which is precisely what we want to find out since we are assuming a set of relationships with no clustering. We have one advantage though, we know that MQ function has the property of being higher the "better" the clustering is (according to the measures just described).

So what we need to do is treat this as an optimization problem.

There are a number of sub-optimal techniques to traverse a space of values, including hill climbing, genetic algorithms, and so on. We just need to choose one, with the caveat that the larger the space the larger the probability that we are hitting a local maximum (rather than the overall maximum) of the space. This is not a problem for relatively small graphs (<50 nodes), with that size we can even do an exhaustive (ie., optimal) search on the space.

If all of this sounds a bit iffy, let me demonstrate with an example. Let's say that we've got the following relationship set from the, um, "real world" (heh):

  • Homer and Marge are married, and they have three children, Bart, Lisa and Maggie. Grandpa is Homer's Dad.
  • Homer works with Carl and Lenny.
  • Homer, Barnie and Moe are friends.
  • Bart, Milhaus, and Lisa know Otto.
The relationship graph file can be represented simply with a text file, relationship per line. (Click here to see the file). Note: because all the relationships described are bidirectional, we need to describe the connection between each "node" twice, i.e., both "Marge Homer" and "Homer Marge" are necessary to describe the relationship.

With the relationship file ready, but no clusters defined in the file, we can now process the graph and see what the optimization process discovers as "clusters". Here is the result (graph visualized using AT&T's dotty tool--ignore the labels for each cluster, they're automatically generated IDs):

cluster-rel-graph.gif

It is easy to underestimate the significance of obtaining this graph automatically, since the "clusters" that we see are for us obvious (if you've watched The Simpsons, that is :)), but keep in mind: the software has no knowledge of the actual groups, just of the relationships between nodes, ie., individuals. Additionally, there are more complex MQ calculations that involves weights on the node relationships; using different dimensions for different target groups allows creating different clustered views based on them. It wouldn't be hard to adapt this to parse, say, FOAF files and do some pretty interesting things.

This is clearly only a first step, but once reasonable clusters are obtained the software can begin doing more interesting things, such as suggesting which people you should meet (e.g., when someone belongs to the same cluster as you but you don't know them), defining levels of warning for requests (exchanges between individuals in the same cluster would have less friction), etc.

Cool eh? :)

PS: Speaking of clustering, check this out. The clustering in this case is fairly obvious, but for more complicated sets of relationships the technique I describe would also apply in this space. [via Political Wire]

Categories: soft.dev
Posted by diego on February 5 2004 at 1:53 AM

Copyright © Diego Doval 2002-2011.
Powered by
Movable Type 4.37