Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!uunet!cs.utexas.edu!asuvax!ncar!hsdndev!cmcl2!kramden.acf.nyu.edu!brnstnd
From: brnstnd@kramden.acf.nyu.edu (Dan Bernstein)
Newsgroups: comp.protocols.tcp-ip
Subject: Re: Is the Internet usable for wide-area interactive conversations?
Message-ID: <17719.Jun1904.23.0991@kramden.acf.nyu.edu>
Date: 19 Jun 91 04:23:09 GMT
References: <2039.Jun1803.33.1391@kramden.acf.nyu.edu> <35911@ucsd.Edu>
Organization: IR
Lines: 74

In article <35911@ucsd.Edu> brian@ucsd.Edu (Brian Kantor) writes:
> The real solution is to fix telnet and its ilk so that it doesn't kill
> the connectionn when it gets what could well be a temporary error like
> network unreachable.  It's not at all unusual to get temporary errors
> like that whilst rerouting is taking place.

I agree, that would keep connections alive. Provided, that is, that you
convince Sun (among others) to change this behavior, and replace all the
old machines out there. But you missed my point.

Just because service isn't interrupted doesn't mean it's usable. Folks,
an average of 1 second round trip time, with 1.8 seconds standard
deviation, is just abominable on a route that could easily handle
several times more data with a round trip time under a quarter second.

Let me explain what's really happening on the NYU-Berkeley connection.
On the average there's not too much data on each link: some segments of
the ``optimal'' route are at a mere 50% or 90% capacity, with slightly
sub-``optimal'' routes nearby lying unused. So I see round trip times of
well under a second.

Suddenly a few too many people start ftp requests in the same second.
The ``optimal'' route is quickly overwhelmed, packets die like flies,
and my round trip time goes down the drain. The sub-``optimal'' routes
are still carrying almost no traffic. Our Super Duper Dynamic Routing
Protocols see the disaster and respond, bravely throwing packets to
those no longer suboptimal routes until a permanent lifeline has been
established. In the meantime there's been a service interruption or
delay of several seconds up to a few minutes.

Soon the same thing happens again. Again the route is flooded. Again
service disappears. Again the routers intercede and revert to their
original routing decisions. And so it goes, on and on through the night.

At higher loads, a funny thing happens. The load regularly bursts over
the top of what the current route can handle. Within seconds a router
changes its decisions---but the other end simultaneously comes to the
opposite conclusions. By the time each burst of packets has made its
round trip, the routers have changed their decisions again, feeding
their already obsolete data back into the loop. And so the routes
rapidly flap. Down goes the network.

In the meantime, any dolt can see that the network backbone is
multiply connected. While one route degenerates, several parallel routes
cruise along at 1% or 3% capacity. Sure, they didn't look ``optimal''
five minutes before, because they meant some extra T1 or even 56kb hops.
But if every router simply split its data between the three best routes,
the whole network would be able to handle a far higher load before
*anything* crashed.

A funny thing happens, by the way, when you start using split routes. It
no longer matters much whether you dynamically optimize or not. If your
optimal link goes down, who cares? You're already sending most of your
packets along the three or four slightly suboptimal links. Think of it
as a backup battery system. Not just a backup battery, but a constantly
online backup battery---an uninterruptible power supply, in fact a
supply with three or four big backup batteries that will keep you alive
just as well as the power company.

So there's no point in rushing to react to every little problem. That
way lies inefficiency, route flapping, and madness. You might as well
leave routes constant for a while---a day, say. Just keep track of how
well the routes worked, and the next day adjust the packet flow by a
little bit on each line, making sure never to overload one sensible
route or to ignore another.

I've left out of this story any notes on why NYU-Berkeley was so slow---
why the ``optimal'' routes were so close to capacity that they kept
getting pushed over the edge. Suffice it to say that the entire net,
rather than just isolated pockets, will be seeing similar loads within
two or three years, unless we act now to split packets across every
available line.

---Dan