Now blogging at diego's weblog. See you over there!

((no single == many) && (no single != no)) point(s) of failure


So today an outage of some sort at Akamai's distributed DNS service brought down access to some major sites from various parts of the world, including Google, Yahoo, and Microsoft. Pretty quickly, as evidenced by this slashdot thread the questions over how the days of "no single point of failure" are over started to pop up.

The myth of the Internet being so resilient that it would never fail is an interesting one. More accurately, its a set of layers of myths, that go back to the often-repeated idea that "the Internet was designed to survive a nuclear attack".

One of the crucial ideas of ARPAnet was that it would be packet-switched, rather than circuit switched. With packet-based communications, clearly the packets will attempt to reach their destination regardless of the circuit used, and there is no question that packet-based networks are much more resilient to failures than circuit-switched networks.

Let me be clear: part of my argument is semantic. That is, the fact that packet-switching means "no single point of failure" doesn't mean that there are no points of failure at all. The problem, however, is that we end up ignoring the word "point" and reading "no failure". The idea of "no single point of failure" eventually ends up implying "failure proof". Which is why we are so surprised when a systemic failure does occur.

ARPAnet, however, never qualified as a failure-proof network, and the points of failure were few enough that "no single point of failure" had little meaning. In the early days you could literally take out most of the Internet by cutting a bunch of cables in certain areas of Boston and California. With time, yes, more lines of communications where available, reducing the probability of failure even further, but even today the amount of trans-continental and intercontinental bandwidth is certainly not infinite.

But, ok. Let's concede the point that a systemic failure at the packet-switching level is of very low probability in today's Internet. What about the services?

Because it is the services that create today's Internet. And many of the services that the Internet depends on are centralized.

Take DNS. Originally, name resolution ocurred by matching names against the contents of the local hosts table (stored in /etc/hosts) and when a new host was added a new hosts table was propagated across the participating hosts. Eventually, this process became impossible, since hosts were being added too fast. This led, in the 80s, to the development of DNS, which eventually became the standard.

DNS, however, is a highly centralized system, and it was designed for a network a couple of orders of magnitude smaller than what we have today. The fact that it does work today is more a credit to sheer engineering prowess in implementation, rather than design, although the design was clearly excellent for its time.

Even today, if the root Internet clusters (those that serve the root domains) where to be seriously compromised), the Internet would last about a week until most of the cached DNS mappings expired. And then we'd all be back to typing IP numbers.

And it doesn't stop with DNS. What if Yahoo! was to go offline? What if Google vanished for a week? What if someone devised a worm that flooded, say, 70% of the world's email servers?

For users, the Internet has now become its applications and services rather than its protocols. And the applications and services leave a lot to be desired.

What's missing is a shift at the service and application level in all fields, routing, security, and so on (Spam is just the tip of the iceberg). Something that brings the higher levels of networking in line with the ideas of packet switching.

So, today, Akamai sneezes and the rest of the world gets a cold. Tomorrow, it will be someone else. This will keep happening until the high-level infrastructure we use everyday becomes decentralized itself. Only then the probability of systemic failure will be low enough. Low enough, mind you, not non-existent: Biomimetism and self-organization, after all, don't guarantee eternity. :)

Categories: soft.dev, technology
Posted by diego on June 15 2004 at 7:41 PM

Copyright © Diego Doval 2002-2011.
Powered by
Movable Type 4.37