Now blogging at diego's weblog. See you over there!

social software: representing relationships, part 3

[See parts one and two]

Anne linked to my first entry on representing relationships and Alex (sorry, no link) posted a comment there that centered around the following point:

it's interesting that dimensions are here thought of in terms of as static. That the space of visual representation is either 2d/3d. I was under the impression that interactive real-estate is multi-dimensional. I suppose if the design of such renderings is informed by scientific or mathematical diagrams then you are bound - to some degree - to the constraints of such formulations.
I started to reply in a comment there and I just kept typing and typing, so I came to the conclusion it'd be better to post it here...

I noted in my post that relationship patterns are actually n-dimensional, that is, I agree with Alex's comment in that sense. My reasons for looking at 2D/3D formulations are, however, less abstract than Alex implies. Plus, I'll go a bit further (since I don't think that Alex was suggesting that we should all suddenly move to n-dimensional maps) in analyzing why there is a tendency to go after linear, planar (and eventually, at most, volumetric) representations for data.

The 2D/3D representation "lock-in" that we see in UIs today actually has a solid basis in reality. Beyond the physiological limitations that our neural structure and binocular vision create, the laws of physics (as we understand them today :)) dictate that we'll never go beyond 3D visualization. Additionally our current technology dictates that it's impractical to design everything around a 3D display. (Sorry if this seems a bit too obvious, I just want to clarify in which context I'm looking at things).

From that follows that, if we represent n-dimensional data structures, we'll have to create projections. Projections are easy stuff, mathematically speaking (i.e., they involve fairly simple vector math). Visualizing them is not too difficult either. For example, consider hypercubes, which are one of the easiest cases because they're fully symmetrical graphs. For example this is what projections of hypercubes of dimensions n > 3 into 2D look like [source]:

hypercubes.gif

A 2D projection of a, say, 12D space might be pretty to look at, but I think most users would avoid that kind of complexity and its consequent cognitive overload.

My point (I do have a point apparently) is that if we agree that we are bound by 2D (eventually 3D) displays, and that n-dimensional projections onto 2D/3D spaces are confusing to navigate for the majority of users, then we should try to use, as much as possible, actual 2D or 3D representations, simply because they are in their "native" form and can be properly optimized for easy, useful tasks that users might have to perform. The data is "transparent"; there are no abstractions to understand to make use of it (which would be necessary for higher-order spaces).

Additionally, those diagrams (while plausible UIs) are in my view more useful as tools for visualizing what is important about the data we're trying to represent (and allow to be manipulated/analyzed). And while they might be "overused" in the research sense, they haven't been used in actual software products that much. Part of the reason is that they feel "alien" as a way of manipulating data. Products like The Brain have been around for a long time, and yet they haven't taken over the world. Clearly, it's not something that can be simply assigned to, say, a failure of marketing or whatever. People like linearity, they are more comfortable with it, and in a pure design sense the less data there is to deal with the more users can focus on their work instead of focusing on how to navigate around the damn thing. All the major interfaces today are linear: the most complicated data structure people usually deal with (in filesystems, email programs, etc) are linearized hierarchies where they can deal with one linear subspace at a time (the current folder). Yes, there are historical reasons for this, but I also think that there's a strong component of user preference in it.

So, if we could pull off a switch from linear to 2D interfaces, even if they are a bit ancient as far as research is concerned, it would be a good step forward, always with the goal of providing the most accurate representation of the complexity we know it's there within the constraints we've got. After all, this is about making it easy for the majority of users, not people that will read something like this and not run away from the room in a panic. :)

Categories: soft.dev
Posted by diego on February 5, 2004 at 2:14 PM

social software: automatic relationship clustering

[See also this post and a follow-up here]

Regarding my post on tuesday on social software: representing relationships, my mind kept coming back to one of the things I wrote:

People don't always agree on what the relationship means to each other. This to me points to the need to let each person define their own relationship/trust structures and then let the software mesh them seamlessly if possible.
The reason I kept thinking about it is that I didn't really explain properly what I meant by that.

So what did I have in mind when I said that? Well, lots of things :), but let's start with the basics.

The first thing that the software should be able to do is infer what groups are there, rather than be told what the groups are. With this in hand, if you simply define relationships to your friends, and you take into account their friends and how they relate to you, you should be able to create a graph of probable relationship clusters, that is, groups formed around strong interpersonal relationships. Sounds farfetched? Read on...

Well, as it happens while I was at Drexel I did research on automatic graph clustering, applying genetic algorithms to techniques developed by my advisor at the time, Spiros Mancoridis, except back then it was applied to software systems. But what I realized a couple of days ago is that the same technique, maybe with a few mods, it would work just as well (if not better, because the graphs are smaller). The technique is described here, but to make a long story short, there's basically a set of equations that can be used to provide an objective measure of how well the clustering is done on a certain graph. The graph must define clusters to which nodes belong, along with the relationships between the nodes. The equation system favors loosely coupled clusters with dense inter-cluster relationships between the nodes (cluster coupling is determined by the edges that connect different nodes across clusters). The objective value is called the modularization quality of the graph, which is calculated by using an inter-cluster connectivity measure and an intra-cluster connectivity measure .

To make things more concrete, let's look at the simplest type of MQ, one for a directed graph with unweighted edges. Don't panic, it's not as complex as it looks at first glance! :) The intra-cluster measure is calculated as follows:

clustering-ai.gif

Where Ai is the intra-connectivity measure for cluster i, Ni is the number of nodes in the cluster, and mi is the number of edges in the cluster.

The inter-connectivity measure Eij is calculated as:

clustering-eij.gif
with i and j the cluster numbers (or IDs) to calculate interdependency for, N the number of nodes in each and eij the number of edges between them.

These two measures are combined to calculate the MQ of the whole graph:

clustering-mq.gif

With k the number of clusters in the graph. (Btw, this is all much better explained in the paper, but this is good enough to get an idea of what's going on).

Now, the problem with this calculation is that it depends on a particular graph clustering, which is precisely what we want to find out since we are assuming a set of relationships with no clustering. We have one advantage though, we know that MQ function has the property of being higher the "better" the clustering is (according to the measures just described).

So what we need to do is treat this as an optimization problem.

There are a number of sub-optimal techniques to traverse a space of values, including hill climbing, genetic algorithms, and so on. We just need to choose one, with the caveat that the larger the space the larger the probability that we are hitting a local maximum (rather than the overall maximum) of the space. This is not a problem for relatively small graphs (<50 nodes), with that size we can even do an exhaustive (ie., optimal) search on the space.

If all of this sounds a bit iffy, let me demonstrate with an example. Let's say that we've got the following relationship set from the, um, "real world" (heh):

  • Homer and Marge are married, and they have three children, Bart, Lisa and Maggie. Grandpa is Homer's Dad.
  • Homer works with Carl and Lenny.
  • Homer, Barnie and Moe are friends.
  • Bart, Milhaus, and Lisa know Otto.
The relationship graph file can be represented simply with a text file, relationship per line. (Click here to see the file). Note: because all the relationships described are bidirectional, we need to describe the connection between each "node" twice, i.e., both "Marge Homer" and "Homer Marge" are necessary to describe the relationship.

With the relationship file ready, but no clusters defined in the file, we can now process the graph and see what the optimization process discovers as "clusters". Here is the result (graph visualized using AT&T's dotty tool--ignore the labels for each cluster, they're automatically generated IDs):

cluster-rel-graph.gif

It is easy to underestimate the significance of obtaining this graph automatically, since the "clusters" that we see are for us obvious (if you've watched The Simpsons, that is :)), but keep in mind: the software has no knowledge of the actual groups, just of the relationships between nodes, ie., individuals. Additionally, there are more complex MQ calculations that involves weights on the node relationships; using different dimensions for different target groups allows creating different clustered views based on them. It wouldn't be hard to adapt this to parse, say, FOAF files and do some pretty interesting things.

This is clearly only a first step, but once reasonable clusters are obtained the software can begin doing more interesting things, such as suggesting which people you should meet (e.g., when someone belongs to the same cluster as you but you don't know them), defining levels of warning for requests (exchanges between individuals in the same cluster would have less friction), etc.

Cool eh? :)

PS: Speaking of clustering, check this out. The clustering in this case is fairly obvious, but for more complicated sets of relationships the technique I describe would also apply in this space. [via Political Wire]

Categories: soft.dev
Posted by diego on February 5, 2004 at 1:53 AM

Copyright © Diego Doval 2002-2011.