Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!uunet!cs.utexas.edu!asuvax!ncar!hsdndev!cmcl2!kramden.acf.nyu.edu!brnstnd From: brnstnd@kramden.acf.nyu.edu (Dan Bernstein) Newsgroups: comp.protocols.tcp-ip Subject: Re: Is the Internet usable for wide-area interactive conversations? Message-ID: <17719.Jun1904.23.0991@kramden.acf.nyu.edu> Date: 19 Jun 91 04:23:09 GMT References: <2039.Jun1803.33.1391@kramden.acf.nyu.edu> <35911@ucsd.Edu> Organization: IR Lines: 74 In article <35911@ucsd.Edu> brian@ucsd.Edu (Brian Kantor) writes: > The real solution is to fix telnet and its ilk so that it doesn't kill > the connectionn when it gets what could well be a temporary error like > network unreachable. It's not at all unusual to get temporary errors > like that whilst rerouting is taking place. I agree, that would keep connections alive. Provided, that is, that you convince Sun (among others) to change this behavior, and replace all the old machines out there. But you missed my point. Just because service isn't interrupted doesn't mean it's usable. Folks, an average of 1 second round trip time, with 1.8 seconds standard deviation, is just abominable on a route that could easily handle several times more data with a round trip time under a quarter second. Let me explain what's really happening on the NYU-Berkeley connection. On the average there's not too much data on each link: some segments of the ``optimal'' route are at a mere 50% or 90% capacity, with slightly sub-``optimal'' routes nearby lying unused. So I see round trip times of well under a second. Suddenly a few too many people start ftp requests in the same second. The ``optimal'' route is quickly overwhelmed, packets die like flies, and my round trip time goes down the drain. The sub-``optimal'' routes are still carrying almost no traffic. Our Super Duper Dynamic Routing Protocols see the disaster and respond, bravely throwing packets to those no longer suboptimal routes until a permanent lifeline has been established. In the meantime there's been a service interruption or delay of several seconds up to a few minutes. Soon the same thing happens again. Again the route is flooded. Again service disappears. Again the routers intercede and revert to their original routing decisions. And so it goes, on and on through the night. At higher loads, a funny thing happens. The load regularly bursts over the top of what the current route can handle. Within seconds a router changes its decisions---but the other end simultaneously comes to the opposite conclusions. By the time each burst of packets has made its round trip, the routers have changed their decisions again, feeding their already obsolete data back into the loop. And so the routes rapidly flap. Down goes the network. In the meantime, any dolt can see that the network backbone is multiply connected. While one route degenerates, several parallel routes cruise along at 1% or 3% capacity. Sure, they didn't look ``optimal'' five minutes before, because they meant some extra T1 or even 56kb hops. But if every router simply split its data between the three best routes, the whole network would be able to handle a far higher load before *anything* crashed. A funny thing happens, by the way, when you start using split routes. It no longer matters much whether you dynamically optimize or not. If your optimal link goes down, who cares? You're already sending most of your packets along the three or four slightly suboptimal links. Think of it as a backup battery system. Not just a backup battery, but a constantly online backup battery---an uninterruptible power supply, in fact a supply with three or four big backup batteries that will keep you alive just as well as the power company. So there's no point in rushing to react to every little problem. That way lies inefficiency, route flapping, and madness. You might as well leave routes constant for a while---a day, say. Just keep track of how well the routes worked, and the next day adjust the packet flow by a little bit on each line, making sure never to overload one sensible route or to ignore another. I've left out of this story any notes on why NYU-Berkeley was so slow--- why the ``optimal'' routes were so close to capacity that they kept getting pushed over the edge. Suffice it to say that the entire net, rather than just isolated pockets, will be seeing similar loads within two or three years, unless we act now to split packets across every available line. ---Dan