Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/17/84; site think.ARPA
Path: utzoo!watmath!clyde!burl!ulysses!mhuxr!ihnp4!think!bradley
From: bradley@think.ARPA (Bradley C. Kuszmaul)
Newsgroups: net.arch,net.math
Subject: Re: topology of highly connected computer systems
Message-ID: <1483@think.ARPA>
Date: Tue, 30-Apr-85 10:01:13 EDT
Article-I.D.: think.1483
Posted: Tue Apr 30 10:01:13 1985
Date-Received: Wed, 1-May-85 03:45:10 EDT
References: <2132@sun.uucp> <1447@think.ARPA> <551@lll-crg.ARPA>
Reply-To: bradley@think.UUCP (Bradley C. Kuszmaul)
Distribution: net
Organization: Thinking Machines, Cambridge, MA
Lines: 205
Xref: watmath net.arch:1132 net.math:1966
to: bradley, brooks@lll-crg.ARPA
Summary: 

I am going to disagree a lot with brooks@lll-crg.ARPA, but don't take it
personally.  It sounds like Mr. Brooks has given the topic a lot of
thought, and most of our differences seem to come from different
assumptions.  Once we start with different assumptions, its hard to get
the same conclusions.  Some of my assumptions are:
  - Lots and lots and lots of small processors are better than fewer big
processors.
  - Any particular connection topology will be not quite right for most
applications, so we might be willing to sacrifice the best case behavior
for more generality or better average case behavior (for example nearest
neighbor communication in a 2 dimensional grid is very good, but not
very general.  A cube is well connected, but may use too many wires to
achieve its bandwidth)

In article <551@lll-crg.ARPA> brooks@lll-crg.ARPA (Eugene D. Brooks III) writes:
>> out, but for something like the cosmic cube (64 processors?) we are only
>> talking about 192 wires, (probably expensive coaxial wires), which is
>> not too much of a pain to actually construct in 3-space.
>
>The 64 processor was wired in a single backplane with wire wrap wire and
>not coaxial cable.  The wiring was not even twisted pairs.  Of course it
>probably would have been a lot less flakey with twisted pairs and appropo
>drivers.

Sorry about that.  I was under the impression from the discussion in
this group that the cosmic cube's connections are implemented with
ethernet controllers.  My main point was that a 6-cube is easy to wire
up, and it only gets easier by using smaller wires.

>> I am much more interested in what happens as $n$ becomes much larger
>> (e.g. $n=10^6$).>
>A million processors not not a problem at all (Well perhaps I am stretching it
>a bit.  I don't actually advocate a million processors.  I thing more along the
>lines of a few hundred 20 MIP/MFLOP processors as being interesting.).

I beg to differ.  I think that a million processors is not nearly
enough.  I believe that our differences probably stem from different
approaches to mapping applications onto the parallel processors.  If I
were doing something like electrical simulation of a VLSI
circuit, I would like to have each processor simulate a single
transistor.  If I am doing finite element analysis, each processor might
simulate one of the elements.  While it is true that by speeding up the
processors, and using fewer of them, we should be able to get the same
performance on some applications, I tend to think that it is possible to
get more MIPS per dollar by using smaller, cheaper processing elements.

>>  The wiring problem becomes very hard again.  The
>> Connection Machine with 64K processors in the prototype faces an
>> engineering challenge on this front.
>How to
>construct a very large N cube can easily be worked out on the back of an
>envelope.  You accomplish the feat by building and wiring the machine in 3
>dimensions.  32k processors is a piece of cake (its only 32 processors on a
>side) and one million processors is within the realm of possibility.

I don't see how wiring a 3d cube with 32 processors on a side gives us a
15 cube.  It is important to realize that if you bisect an N cube
"through" one of its dimensions (i.e. pick some $0<=i<N$, put all the
processors with 0 in the $i$th bit of on one side of the cut, and all
the other processors on the other side) that every processor has one
wire crossing the bisection.  (e.g. for a 32K machine, there are 16K
processors on each side, and each processor can communicate with one
processor on the other side of the bisection directly, so there are 16K
wires crossing the bisection.).  For a million processor machine, there
are half a million wires crossing the bisection.  I still claim that it
is hard to lay it out in three space.

>> if some node or wire fails, the machine might be able to keep running.
>> (fault tolerence is important for a million processor machine)).
>> Redundancy is also nice because it means that if lots of messages want
>> to go from A to B, then we may be able to send more than one at a time
>> through the net.
>Deadlock free routing requires that a specific path be used for each
>sender/reciver pair.
We probably are thinking of routing from different points of view.  I
don't believe that deadlock free routing requires a specific path be
used for each pair.  proof: Suppose we have a cube connected packet
switched network with enough storage at each vertice to store on message
from every other vertice (usenet in the year 2010?).  Suppose that we
further place the requirement that no pe (= vertice) is allowed to
inject a message into the network unless its previously injected
messages have already been delivered.  (We might require that every
processor which wants to send a message do it at the same time, and then
wait for the network to empty out before we send any more messages).
There will be no deadlock, since even if all the messages want to go to
the same place, there will be enough storage for them.


>> One bad thing about cube topologies (at least for cubes bigger than
>> about 10 dimensions) is that they do not use their bandwidth very
>> effectively.  It has been claimed (with proof - references may be
>> available on demand) that a 14 cube sending an "average" set of messages
>> only uses about one fourth of the bandwidth because of collisions.

>I would like to see the reference list, please send it to me via email.
Ok, this may take a second...

>By the same token you might send my your SNAIL MAIL address if you are
>interested in a paper describing a solution to the bandwidth problem.
>Request the paper titled "Shared Memory Hypercube".

Please send the paper to

  Bradley C. Kuszmaul
  Thinking Machines Corporation
  245 First Street
  Cambridge, MA 02142

>The switching technque gives NlogN bandwidth, tested by simulation to N=1024,
>and it adaptively removes conflicts.
Ah, but there are on the order of 2^N wires in the machine.  Sounds like
the wires are not being used very effectively.

> My use of the network is to create a
>shared memory server for a hypercube machine with CRAY style vector processors.
>With simultaneous vector access by all processors the packet conflict problem
>is as severe as it can be.  The network provides FULL bandwidth to all
>processors simultaneously.

It sounds to me like the packet conflict problem is as nice as it could
possibly be for that problem.  (Presumably processor number $i$ is
accessing memory $f(i)$ where $f$ is one-to-one.  What happens if $f$ is
not one-to-one (e.g. if $f(i)=c$ for some constant $c$.)  If $f$ is
always the same function, I would just use local memory and flush the
topology.  Thus, for your application, $f$ must be a different function
on different memory accesses.  How many different functions $f$ are
there that you can get the bandwidth?  I doubt that you can do it for
every one-to-one function.

Random thoughts and questions:
Why do you have shared memory?  Presumably this is so that processors
can communicate:  Why don't you just use the cube network to communicate
directly?  What is the start up time for your vectors (i.e. how big does
a vector have to be before the vector processing part wins over the
scalar processing part.)  Typical vector processors are limited in their
speed by lack of memory bandwidth (this is true for a single processor
with high bandwidth memories (e.g. the CRAY uses a 16-way interleaved
memory in order to try to increase the memory bandwidth).  I don't see
how putting a cube topology between the memory and the processor, and
adding processors helps this situation.  Is it the case that your vector
processors are an order of magnitude slower than a CRAY (You mention
20MFLOPS earlier in the article)


>> It has been claimed, in this discussion, that cube topologies are nice
>> because they can simulate other topologies easily.  Actually, this is
>> not really a good argument, since most of the other seriously considered
>> topologies are equally good at simulating each other
>The cube topology is nice as it has other useful topologies IMBEDDED within it.
>We are not talking about easy simulation here, efficient execution is the
>target.  The wires is there! 

Too many wires, and not very effectivley used.  Suppose you spent the
same amount of money on a different network which gave you twice the
bandwidth for most of your applications.  Would the N-cube still make
sense?  The nice thing about the TOTALLY connected topology is that
EVERY other topology is embedded within it, but no one seriously
considers wiring up a totally conneccted network with a million
processors in it.  (10^12 wires?)

>> I suspect that the real reason that people are building N-cubes instead
>> of skewed rings, minimum spanning trees, or some other topology, is that
>> computer scientists have a bias towards binary representations of
>> things, and the cube topology is nice and binary, and because routing is
>> so easy.
>Admittedly the routing is easy, this is one of the reasons for the cubes
>attractiveness.  The real reason for the interest in it however is it has
>a genuine bandwidth edge over the other topologies.

That is not really true.  One of the topologies mentioned in this
discussion (somewhere in an earlier posting) is the $k$-shuffle.  This
network achieves better bandwidth by decreasing the distance between
nodes in the graph, by using the same degree of vertices, and
using about the same number of wires as a cube.  $k$-shuffles are just
harder to understand:

  Write your address in base $k$.

  Now take this address and shift it left one digit (i.e. $\log k$ bits)
Now add any number from $0$ to $k-1$ to the number, and your original
processor is connected to that processor.

  If there are $n$ processors in the system, then the maximum distance
between two processors is $log n / log k$.  For 64K procesors in a
16-shuffle (which uses the same number of wires as a 64K 16-cube), the
maximum distance between two processors is 4 jumps (compared to 16 jumps
for the cube).  This means that there are likey to be only a quarter as
many conflicts, which increases the bandwidth.  If you actually had some
problem which only did nearest neighbor communication in an n-cube, we
could run your program almost as quickly on the k-shuffle topology,
because at worst there would be a four processor jump to perform your
communication (instead of one jump).  Very few programs actually use the
n-cube directly (although there are a few (e.g. subcube induction to
count the number of processors in a given state - [D.P. Christman,
"Programming the Connection Machine", MIT masters thesis, January,
1984])) And for most programs the $k$-shuffle will be at least as good.
Routing can be tricky for a $k$-shuffle.  There are topologies with even
better bandwidth which do have easy routing.


Happy connections.
 -Bradley
-- 
bradley@think.uucp  (i.e. {decvax!cca,ihnp4!mit-eddie}!think!bradley)
bradley@THINK.ARPA