Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/17/84; site think.ARPA
Path: utzoo!watmath!clyde!burl!ulysses!mhuxr!ihnp4!think!bradley
From: bradley@think.ARPA (Bradley C. Kuszmaul)
Newsgroups: net.arch
Subject: Re: "The Shared Memory Hypercube"  Do you smell any smoke?
Message-ID: <1590@think.ARPA>
Date: Fri, 10-May-85 09:54:20 EDT
Article-I.D.: think.1590
Posted: Fri May 10 09:54:20 1985
Date-Received: Sat, 11-May-85 02:26:34 EDT
References: <2132@sun.uucp> <1447@think.ARPA> <551@lll-crg.ARPA> <973@ames.UUCP>
Reply-To: bradley@think.UUCP (Bradley C. Kuszmaul)
Distribution: net
Organization: Thinking Machines, Cambridge, MA
Lines: 153
Summary: 


    Path: think!mit-eddie!genrad!panda!talcott!harvard!seismo!hao!ames!eugene
    From: eugene@ames.UUCP (Eugene Miya)
    Newsgroups: net.arch
    Subject: Re: "The Shared Memory Hypercube"  Do you smell any smoke?
    Date: 6 May 85 23:05:16 GMT
    Date-Received: 8 May 85 08:09:55 GMT
    References: <2132@sun.uucp> <1447@think.ARPA> <551@lll-crg.ARPA> <1483@think.ARPA> <560@lll-crg.ARPA>

    > >Some of my assumptions are:
    > >   - Lots and lots and lots of small processors are better than fewer big
    > > processors.
    > A very bad assumption,  you want as many of the most COST EFFECTIVE
    > processors
    > . . .
    > > I tend to think that it is possible to
    > > get more MIPS per dollar by using smaller, cheaper processing elements.
    > You get the most MIPS for your dollar by using the most COST EFFECTIVE
    > processing elements.  These do not happen to be the smallest and cheapest.

    We just had a SIG meeting with Joe Oliger (the CS chair at Stanford)
    recently.  Joe came to the conclusion [during the course of thinking] that
    "fewer, high performance CPU" proponents of multiprocessing had the
    advantage in being able to fit reasonable portions of problems into
    individual processor/memories.

The claimed advantage is predicated on the idea that each processor
solving more of the problem is better, either in programming ease,
communications overhead, or some other measure of cost.  This advantage
may go away again if small portions of the problems are the RIGHT thing
for each processor to handle.  Such an case might be some finite element
analysis problem, in which it might be both easy to program each
processor to simulate exactly one element, and there is not a lot of
communication between different parts of the problem (e.g. if all the
interaction is with the nearest neighbors of the finite element).  (Of
course, if it is too hard to write the program, then I lose, and if
every element wants to talk to every other element in every discrete
time unit, then	I lose.  I think it is actually easier to program lots
and lots of processors than "just" several processors, and I think that
most simulations have strong localality of communication properties.)

    Do not forget! We are not developing these machines in a vacuum.
    We have to look at the applications which may be run on these machines.
    Consider a 100 x 100 x 100 array with 30 variables (and increasing
    as our known of the natural sciences increases).

I claim that what you really want is 100^3 processors each with only
enough memory for the 30 variables that you need.  I think that the
problems that you are addressing are due to the fact that the natural
granularity of the problem does not match the natural granularity of the
machine you are using.

If you are iteratively solving PDE's on a 100^3 matrix, then there is
not that much communication between different parts of the matrix (there
is some local communication), so you might want to put each element of
the matrix in a different processor, where the processors are (at least
logically, if not physically) arranged 3d grid network.

(Handwaving cost argument:  If adding a processor to every couple of
thousand bits of memory increases the memory cost by only a few percent
(including all the costs of power supplies, boards, air conditioners
etc.), and furthermore, you can use an otherwise cheaper memory system
because you don't need all the interleave hardware, and the memory does
not have to be that fast, and you don't need caches between the
processor(s) and the memory, then it might very well be the case that
for essentially the cost of a large, (but relatively low cost per bit)
memory you can have a supercomputer.  Why put the cental processor(s)
in?)

    I can barely fit a fluid dynamics code on a 32-node Hypercube
    because the storage requirements per CPU are fierce [a different
    example, not the 100^3x30 example].

Details?

    Jack Dennis had to revise the way he thought dataflow machines need
    to be built after two weeks here: he needs more memory and faster
    I/O.

That is not quite how I interpreted the report of the RIACS study that
he gave to his research group.  He mostly seemed pleased that FORTRAN
programmers could actually learn to program in an applicative language.
Adding memory and faster I/O may be just as expensive as just adding
more processors.  (Of course, Jack thinks I am an extremist for
believing that a million processors is not enough, so my interpretation
of what he said is also subject to correction (e.g. I would not want to
try to explain to him how I ever got the idea that he said or meant the
things I am attributing to him. :-))

    If there is any one thing multiprocessors allow us to do, it is add yet
    more memory.  Hum? :-)

If there is one thing that adding more memory does , it is increase the
cost of the system, especially when you have to do more hacking to get
the processors to talk to the memory (e.g. shared memory,
interleaving...)

    > > What is the start up time for your vectors (i.e. how big does
    > > a vector have to be before the vector processing part wins over the
    > > scalar processing part.)

    I have just tested this recently on four different CRAY architectures.
    Short vector startup time is very good.

I am aware that the Cray is good at short vectors (that's one of the
reasons that the Cray so popular.)  The Cray is one of the few vector
machine architectures which is good at short vectors.  My question had
to do with how good Dr. Brook's computer is at vectors (he said that he
was connecting several "Cray class" processors together, which is not
the same thing as several Cray processors.)

    > > Typical vector processors are limited in their
    > > speed by lack of memory bandwidth (this is true for a single processor
    > > with high bandwidth memories (e.g. the CRAY uses a 16-way interleaved

    Our Cray has a much higher degree of interleave.  The new C-2 will have
    128-way.

  Which only reinforces the idea that vector processors are often
limited by their bandwidth.  I understand that the C-2 (Cray 2?) is
running a faster clock (~ 4ns) and slower memory (>100ns) than the Cray
1 (12.5 (or 9) ns and 50 (or faster) ns respectively.  This means that
an interleave of something like 25 is the minimum that you can get away
with without loss of performance (and just as the Cray 1 "has to have"
an interleave of four to keep the processor busy on memory operations,
there are a number of reasons for increasing the interleave far beyond
the minimum. (e.g. the processor hitting the memory interleave, multiple
processors with one memory, independent I/O processors.)  I have also
heard rumors to the effect that Cray has broken down and put multiport
memory in.
  These high bandwidth memories are very expensive, and I think there is
an argument for avoiding them.  Note that this argument, by itself, is
good for arguing for anything from a 10Kflop processor that the
Connection Machine uses up to the 10MIPS processors that Brooks
advocates, but that it becomes hard to justify a 1Gflop processor simply
because of the high cost of the memory system.

  I have some specific questions for Dr. Brooks (which I suspect he has
good answers for): 

  In the discussion of the architecture which you advocate, we have been
concentrating on the bandwidth considerations.  I am now wondering about
latency issues.

    What is the latency of your shared memory?  (If you are running a
10MIPS processor, does that mean that a memory operation on shared
memory can complete in time for the next memory operation to continue?
Do you use pipelining to help deal with any such latency problems?  If
you do, how "full" and "deep" does the pipeline have to be to keep the
processor busy.  Is there local memory for each processor, and does it
have lower latency than the global memory?-- 
bradley@think.uucp  (i.e. {decvax!cca,ihnp4!mit-eddie}!think!bradley)
bradley@THINK.ARPA