Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/17/84; site think.ARPA
Path: utzoo!watmath!clyde!burl!ulysses!mhuxr!ihnp4!think!bradley
From: bradley@think.ARPA (Bradley C. Kuszmaul)
Newsgroups: net.arch,net.math
Subject: Re: "The Shared Memory Hypercube"  Do you smell any smoke?
Message-ID: <1523@think.ARPA>
Date: Fri, 3-May-85 11:35:03 EDT
Article-I.D.: think.1523
Posted: Fri May  3 11:35:03 1985
Date-Received: Sat, 4-May-85 01:21:50 EDT
References: <2132@sun.uucp> <1447@think.ARPA> <551@lll-crg.ARPA> <1483@think.ARPA> <560@lll-crg.ARPA>
Reply-To: bradley@think.UUCP (Bradley C. Kuszmaul)
Distribution: net
Organization: Thinking Machines, Cambridge, MA
Lines: 203
Xref: watmath net.arch:1144 net.math:1973
Summary: 


In this posting, my latest entry in the "dualing hypercubists munch
Cray" story, I am going to "almost concede" several of Dr. Brooks' points.

For example, the assertion that a 5-10 MFLOP processor is more cost
effective than a smaller one may very well be true (one of the
percieved problems of the Connection Machine (CM) is that its
floating point performance (~60 to 600 MFLOPS, depending on how you
measure it) is not as good as one might expect from a machine with 4K
chips and 64K processors (four square feet of special purspose
silicon, plus memory and glue) If Thinking Machines (TMC) had used 4K
chips to maximize floating point speed (with, perhaps, 2K 5 Mflop
chips, and 2K network chips), the CM might have a 4 GFLOP
performance.

 The truth is that connection machine was not designed to be a high
speed number cruncher, but instead was designed to do something
called "logical inferences" (no one really knows how to measure
logical inferences per second, but everyone at TMC seems to believe
that the CM will do them faster than anything else).

 My research (apparently, with my ongoing master's research, I lose
in the "battle of the titles" to Brooks' completed doctorate)
involves trying to run applicative languages (languages without side
effects) on the CM.  One of the big arguments for running applicative
languages is that the parallelism is easy to exploit, and thus
programs running on an architecture well suited for such languages
should be able to run very quickly (hah!).  In order to protect my
reputation when the CM does not run my programs very quickly, I claim
to be "simulating applicative architectures".  :-)

 Sometimes, I believe that the CM was designed correctly after all,
since all the programs I have written are dominated by communication
time, so what's the point of spending more on processors if you can't
communicate fast enough to keep them busy.

 Sometimes, I believe that the CM was not designed to be fast enough
(it was conservatively designed so that it would work the first
time).  If the processors were bigger, maybe there would be less of a
communications bottleneck (hopefully by initiating less of it) .  If
the router were faster, or of a different topology, maybe the
communications traffic would run faster.  If we upped the clock to
10ns, it might be enough faster to make everyone happy.  (When I told
one of the designers that I want the model two to have a million
processors and a 10ns clock, he said I was on drugs, and elaborated
with "I don't want to design with ECL, if you look at it the wrong
way it fails.  How do you distribute and use a 10ns clock with CMOS?
How do you get bits across four meter wires at 10ns?  You design it,
we'll build it")


In article <560@lll-crg.ARPA> brooks@lll-crg.ARPA (Eugene D. Brooks III) writes:

>> I don't see how wiring a 3d cube with 32 processors on a side gives us a
>> 15 cube.
>Consider a n=5 hypercube with the processors aligned along a line spaced at say
>one foot intervals.  Each processor is labeled by a 5 bit number.  These 5 bits
>form the first 5 bits of node addresses of the n=15 cube we are constructing.

A machine in a box 32 feet on a side.  Wow!  I've been considering
building big machines (I've been thinking about machines the size of a
small building too), and I'm glad someone else thinks the way I do :-)

Of course the million processor machine will be something like 100
feet on a side.  (Hopefully that foot includes allowances for
accessing the machine to service it, since I can't imagine why else a
processor would want to be that far away from its neighbor :-)

Ok, so you are wiring up a 5-cube in a linear region, and the wiring
bundle will be about 16 wires at the thickest point.  That is not too
tough (in fact TMC wired up a 5-cube on one printed circuit board - no
doubt with an infinite number of layers :-)  To get up to the 2M
processor machine (sorry about jumping up to 2M from 1M, but 2M is a
cube in base 2) you need 128 processors lined up in a row, and the
wiring bundle will be 64 wires thick at the thickest point.  While it
may be possible to do,  I still claim that the packaging is very hard.
(I suppose it would help to have some custom wires too.  The CM uses
standard ribbon cable for its longer connections (i.e. the connections
along the highest dimensions), and printed circuit board for the lower
connectinos.  I guess pcb counts as custom wire.  (And to think that
TMC is getting a 64K machine into a 5 foot cube (apples and oranges of
course - 64K of really small processors is not the same as 32K of one
foot cubed processors).  And the speed of light across a 32
foot region is going to be fun to deal with (I know I am looking
forward to it:  Lets see now, what time is it anyway)  The speed of
light problem, can of course, be dealt with in lots of ways (decrease
the clock speed of the network, but keep the processors going faster,
etc.)

>I would be quite happy with a 1024 processor cube of 10 MFLOP nodes.
>Will any one sell me one?  I'll settle for 128 nodes as a start.

I understand that qualified buyers can obtain information about buying
a CM from TMC.  (Don't look at me that way... I only use TMC's
computers (at least during the school year).  The prototype is 64K
processors with from 1 to 10 Kflops per processor.  That's the best I
can offer :-)

>> >Deadlock free routing requires that a specific path be used for each
>> >sender/reciver pair.
>Sorry, I should have been more precise.  The only known proof for deadlock
>free routing, without assuming infinite diskpacks at each node or any
>other unacceptable bandwidth reducer like inhibiting a PE from sending
>another message until a return receipt is received

Such a bandwidth reducer is not neccessarily unacceptable.

> (see the further caviat
>below), requires that a specific path be chosen for each sender/receiver pair.
>This proof is documented in Dick Lang's 81(I think) Caltech Thesis.  There is
>a caviat not clearly documented in Langs thesis as well.  If each message is
>acknowledged with a response indicating the sucessful reception of the message
>(as part of the reception action) then you need two seperate networks for the
>messages and responses or infinite disk packs at each node.

I still am not sure I understand.  Suppose that each message is
acknowledge with a reciept before the next message is allowed to be
sent.  Then with an N processor machine, there can be at most N
messages in the network at one time, so you only need that much memory
at each node.  Thus I conclude that you are acknowledging
messages, but that you are allowed to send more messages before the
acknowledgement comes back.  I don't know what the acknowledgements do
for you, but it seems like you could create exactly the same deadlock
by using messages with no acknowledgement (by "simulating" the
acknowledgement in software:  Whenever a processor would have sent an
acknowledgement, it can just send a regular message.)


>>   Thinking Machines Corporation
>Have you guys over there taught any of your machines to "Think" yet?

Let's play the Turing test:  Can you prove that I am not one of those
machines (please don't respond by saying that my postings display a
lack of thought) :-

>Sorry, I know its old and you have heard it before but I couldn't resist. :-)

Could be worse.  It used to be International Thinking Machines.
(Imagine large blue letters with horizontal stripes spelling out ITM.)

>> Ah, but there are on the order of 2^N wires in the machine.
>There are NlogN/2 wires, there logN ports on each network node, there are
>N nodes and each wire has two ends.  Where does the 2^N come from?

My N was the number of dimensions, while your N is the number of
nodes.  Sorry about the confusion.  Still seems like a lot of wires
to me.

>The utilization of the wires is around 50% as packets travel half the maximal
>logN distance on average.

How big a cube is that?  (I am still trying to dig up the reference about
14 cubes only using 25% of their potential bandwidth...)  A handwaving
argument that larger cubes make less effective use of their wires than
smaller cubes goes like this.  There are NlogN/2 wires, but only N
messages could possibly be inserted into the net at a time, so there
is logN factor that is being wasted.  Furthermore, as the number of
processors grows th neumber of possible collisions increases, further
decreasing the percentage of used bandwidth.  (You don't have to
believe the argument if you don't want to...)

>> Why do you have shared memory?
>A shared memory machine can do anything a message passing machine can do.
>You can even efficiently run message passing software.  The shared memory
>machine is then more flexible and useful on a wider range of problems.

Conversely, a message passing machine can do anything a shared memory
machine can do.  You can even efficiently run shared memory software.
(Apparently that is how you are implementing shared memory) The
message passing machine is then more flexible and useful on a wider
range of problems.

>> Too many wires, and not very effectivley used.
>Wrong on both counts, see the comments made above.
Too many wires, and not very effectively used.  (Boring aren't I)

>> discussion (somewhere in an earlier posting) is the $k$-shuffle.  This
>> network achieves better bandwidth by decreasing the distance between
>The k shuffle does not achieve better bandwidth, it achieves a lower latency.
>Bandwith is bandwidth, latency is latency and parts is parts. :-)  The lower
>latency could translate to shorter vector lengths but as is shown in my paper
>the vector lengths required for the hypercube are fine.  The worry is that
>the k shuffle will have the start address pickyness of the butterfly which
>is unacceptible.  The answer to this question is not known at present.

But because the messages spend less time floating around in the
network, the network may be able to achieve higher throughput in the
$k$ shuffle than in the cube.

>> Very few programs actually use the n-cube directly
>Take a look at the FFT,  its an important problem, it blows smoke on the
>hypercube.  As I have said before, the wires is there.  The FFT uses all of
>them.
I was afraid you might bring up FFT.  Just as a matter of curiosity,
what percentag of total computing time do you use up on FFT's?

>Speaking of smoke, when it clears read the papers I sent you via SNAIL MAIL.

Oh well.  My response is wimpy this time.  I'll read the papers, and
then flame on...
 -Brad-- 
bradley@think.uucp  (i.e. {decvax!cca,ihnp4!mit-eddie}!think!bradley)
bradley@THINK.ARPA