December 22, 2004
a bit more on manifold-g
Since the Manifold-B algorithm is relatively straightforward, I thought I'd spend a bit more time on Manifold-G. Plus Manifolg-G is a bit cooler and has potentially more applications than this one.
Revisiting: Manifold-G creates a virtual n-cube topology on top of the network, to make search paths predictable (on the virtual topology at least).
The core idea is that any name is in essence a bit-string. This bit string can be obtained directly by either converting the ASCII or Unicode into its bit values, or by hashing (which can cause overlap, but that's not a big problem). Normally you'd use hashing anyway, to compress the string space a bit since lots of strings are not valid names and would never be used.
While a lot of what relates to Manifold-G in the dissertation is treated with set theory, it's useful to visualize it in an euclidean space. Here, the bit strings represent the value for the axis in a particular dimension. Thus a 3-bit string would create a three dimensional cube, a 4-bit string a 4-cube, and so on). In this virtual space the strings are connected by the (imaginary) edges of the n-cube in question, which in the real network become node connections.
Now, a cube can be "navigated" along its edges, and the path between any two nodes is always predictable (and easy to calculate). This is why, for example, hypercube topologies where used in parallel systems like the Connection Machine. The problem is that a network will never be a complete hypercube, it will always be incomplete.
So the question was, how to make it behave as complete, even when it wasn't? The key was to create a "shadow" structure that would also be built with predictable rules. This shadow structure would be self-organizing (and self-balancing), distributing load between nodes evenly. Thus the concept of shadow nodes. Shadow nodes are "representatives" of a node that should be present in the network but isn't. At its simplest, the shadow nodes are actual memory structures (as in the sample code at the end), but because the balancing is mathematically predictable, they can largely be implicit by the values on the nodes. Thus the shadow nodes can be used to "complete" the topology and retain the bounded properties of a hypercube even when not all the vertices for it are present.
I imagine this explanation is a bit confusing, so here is a sample presentation (HTML) that shows how an incomplete 3-cube being built. Hopefully that'll clarify things. :)
Finally, some code. Here's a ZIP file containing a sample manifold-g simulator that runs locally (ie., no network). Crucially though, the information used by the nodes is all local. A small manifold-g network (say, max 5,012 nodes) could be created using exactly this technique (which is easier to see in code, but has the downside of average memory usage = [max number of nodes - actual number of nodes / actual number of nodes]) by simply replacing node-to-node calls with network calls.
PS: The code is GPL'ed, for simplicity.
PS2: yes, there are security issues with this, but they are actually not all that different from DNS itself or any other P2P algorithm for that matter. But that's a topic for another post. :)
PS3: as usual, comments and questions most welcome!
Posted by diego at 12:09 PM | Comments (0)
the manifold algorithms
Okay, so, roughly, what are the Manifold algorithms? They are two different self-organizing algorithms. The first, Manifold-B (Manifold-Broadcast) is used for local/inexact searches, while Manifold-G (Manifold-Global) is used for global self-organizing exact resolution.
Manifold-B
Manifold-B is a typical broadcast-style (or flooding-style) p2p system, similar to gnutella. Basically, on startup a node looks for one or more nodes already in the network (using a list of previously known nodes if available) and establishes connections to some or all of them (depending on individual node load, etc). Once at least one connection is complete queries can begin to propagate. Queries are constrained by a time-to-live (TTL) parameter, propagating across all connections of each node as long as the maximum number of hops has not been reached, as is common in these types of networks.
Some advantages of Manifold-B are:
- Low consistency requirements (ie., no topology structural requirements).
- Allows substring match and multiple key insertion per node.
- Fast in its local environment, with network load controlled by number of nodes.
- Extremely well-adapted to ad hoc wireless environments.
Some disadvantages:
- Local only (not global), and thus limited in reach.
- More overhead than Manifold-G for exact queries.
Manifold-G
Manifold-G is an Overlay, basically a virtual topology built on top of an actual network. Why an overlay? Because when building a virtual topology, we can give it a well-defined structure, that is predictable, with lookup path lengths bounded by a certain limit.
In the case of Manifold-G, the topology used is that of an n-dimensional hypercube (or n-cube, or l-cube, depending on the letter used to specify dimensions :)). Manifold-G assigns each name a particular bit-string, which is controlled by the node with that name (local control was an important component of name resolution, other overlays don't necessarily guarantee locality in that sense, and although you could force them to, they are mostly designed as distributed hashmaps, which maintain a set of keys on each node--which has other great uses!).
The tricky part is that a hypercube can only be navigated predictably when it is complete, i.e., when all its nodes (vertices) are present. So Manifold-G uses the predictability of a structure to its advantage to virtualize those missing nodes. More on the specifics in a later post (if you can't wait, you can also check the text of the dissertation).
Manifold-G advantages:
- Fast! Number of steps to target (or "not found" reported) is a maximum of O(log N) with N the number of nodes in the network.
- There is no penalty for more nodes joining the network, in fact the network works better with more nodes because there are less virtual nodes to take care of. (This is weird, I know, but true).
Disadvantages:
- High consistency requirements. This is probably the weakest point of the algorithm, very clear during join and leave operations of a node, which require a multi-node transactional status to keep the hypercube topology intact and not break searches. This can be minimized by pre-loading information.
- A "neighbor" in Manifold-G can actually be on the other side of the Internet, which can be a problem when slow connections are involved. In small networks (e.g., a few dozen nodes), Manifold-G actually has more overhead than Manifold-B, because the paths in the hypercube do not necessarily adjust to the actual topology. This advantage of Manifold-B is most notable in small-medium ad hoc wireless networks, which do broadcast at the physical layer level. In larger networks, some geolocation-based optimizations as well as proxying are possible, as mentioned in the dissertation.
A couple of things
Aside from serving well for the two main usage modes that I was targeting, the algorithms complement each other. For example, Manifold-B has much lower consistency requirements and so can operate when Manifold-G might not be available, and thus either partially cover some queries, or help Manifold-G in building the network.
It has been shown that it is possible to embed broadcast trees within overlays, but in my opinion maintaining two networks is a small price to pay for the advantages of both, particularly on wireless ad hoc (small-to-medium), where Manifold-B has a clear advantage over the overlay.
Posted by diego at 11:27 AM | Comments (0)
December 19, 2004
the 30,000 ft. view
(cross-posted from this entry on my personal weblog)
As a follow-up to my thesis abstract, I wanted to add a sort of introduction-ish post to explain a couple of things in more detail. People have asked for the PDF of the thesis, which I haven't published yet, for a simple reason: everything is ready, everything's approved, and I have four copies nicely bound (two to submit to TCD) but... there's a signature missing somewhere in one of the documents, and they're trying to fix that. Bureaucracy. Yikes. Hopefully that will be fixed by next week. When that is done, right after I've submitted it, I'll post it here (or, more likely, I'll create a site for it... I want to maintain some coherency on the posts and here it gets mixed up with everything else).
Anyway, I was saying. Here's a short intro.
Resource Location, Resource Discovery
In essence, Resource Location creates a level of indirection, and therefore a decoupling, between a resource (which can be a person, a machine, a software services or agents, etc.) and its location. This decoupling can then be used for various things: mapping human-readable names to machine names, obtaining related information, autoconfiguration, supporting mobility, load balancing, etc.
Resource discovery, on the other hand, facilitates search for resources that match certain characteristics, allowing then to perform a location request or to use the resulting data set directly.
The canonical example of Resource Location is DNS, while Resource Discovery is what we do with search engines. Sometimes, Resource Discovery will involve a Location step afterwards. Web search is an example of this as well. Other times, discovery on its own will give you what you need, particularly if the result of the query contains enough metadata and what you're looking for is related information.
RLD always involves search, but the lines seemed a bit blurry. When was something one and not the other? What defines it? My answer was to look at usage patterns.
It's all about the user
It's the user's needs that determine what will be used, how. The user isn't necessarily a person: more often than not, RLD happens between systems, at the lower levels of applications. So, I settled on the usage patterns according to two main categories: locality of the (local/global) search, and whether the search was exact or inexact. I use the term "search" as an abstract action, the action of locating something. "Finding a book I might like to read" and "Finding my copy of Neuromancer among my books" and "Finding reviews of a book on the web" are all examples of search as I'm using it here.
Local/Global, defining at a high level the "depth" that the search will have. This means, for the current search action, the context of the user in relation to what they are trying to find.
Exact/Inexact, defining the "fuziness" of the search. Inexact searches will generally return one or more matches; Exact searches identify a single, unique, item or set.
These categories combined define four main types of RLD.
Examples: DNS is Global/Exact. Google is Global/Inexact. Looking up my own printer on the network is Local/Exact. Looking up any available printer on the network is Local/Inexact.
Now, none of these concepts will come as a shock to anybody. But writing them down, clearly identifying them, was useful to define what I was after, served as a way to categorize when a system did one but not the other, and to know the limits of what I was trying to achieve.
The Manifold Algorithms
With the usage patterns in hand, I looked at how to solve one or more of the problems, considering that my goal was to have something where absolutely no servers of any kind would be involved.
Local RLD is comparatively simple, since the size of the search space is going to be limited, and I had already looked at that part of the problem with my Nom system for ad hoc wireless networks. Looking at the state of the art, one thing that was clear was that every one of the systems currently existing or proposed for global RLD depends on infrastructure of some kind. In some of them, the infrastructure is self-organizing to a large degree, one of the best examples of this being the Internet Indirection Infrastructure (i3). So I set about to design an algorithm that would would work at global scales with guaranteed upper time bounds, which later turned out to be an overlay network algorithm (which ended up being based on a hypercube virtual topology), as opposed to the broadcast type that Nom was. For a bit more on overlays vs. broadcast networks, check out my IEEE article on the topic.
Then the question was whether to use one or the other, and it occurred to me that there was no reason I couldn't use both. It is possible to to embed a multicast tree in an overlay and thus use a single network, but there are other advantages to the broadcast algorithm that were pretty important in completely "disconnected" environments such as wireless ad hoc networks.
So Nom became the local component, Manifold-b, and the second algorithm became Manifold-g.
So that's about it for the intro. I know that the algorithms are pretty crucial but I want to take some time to explain them properly, and their implications, so I'll leave that for later.
As usual, comments welcome!
Posted by diego at 01:10 PM | Comments (0)