Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 8/28/84; site lll-crg.ARPA
Path: utzoo!linus!philabs!cmcl2!seismo!umcp-cs!gymble!lll-crg!brooks
From: brooks@lll-crg.ARPA (Eugene D. Brooks III)
Newsgroups: net.arch,net.math
Subject: "The Shared Memory Hypercube"  Do you smell any smoke?
Message-ID: <560@lll-crg.ARPA>
Date: Wed, 1-May-85 04:04:41 EDT
Article-I.D.: lll-crg.560
Posted: Wed May  1 04:04:41 1985
Date-Received: Fri, 3-May-85 08:06:20 EDT
References: <2132@sun.uucp> <1447@think.ARPA> <551@lll-crg.ARPA> <1483@think.ARPA>
Distribution: net
Organization: Lawrence Livermore Labs, CRG group
Lines: 197
Xref: linus net.arch:942 net.math:1601

> I am going to disagree a lot with brooks@lll-crg.ARPA, but don't take it
> personally.  It sounds like Mr. Brooks has given the topic a lot of thought
Not to mention a lot of work,
its Dr. Brooks. :-)

>Some of my assumptions are:
>   - Lots and lots and lots of small processors are better than fewer big
> processors.
A very bad assumption,  you want as many of the most COST EFFECTIVE processors
as you can afford.  To do anything else is to waste your computational dollar.
At the moment the most cost effective processor for vectorizable scientific
work is a VLSI vector processor of about 5-10 MFLOP capability.  The speed
figure is climbing steadly and will probably saturate at not too many factors of
two greater.  In the serial area the current 32bit micros are certainly very
cost effective.  Would you pay a dollar for an apple if you could buy it for
50 cents?

>   - Any particular connection topology will be not quite right for most
> applications, so we might be willing to sacrifice the best case behavior
> for more generality or better average case behavior (for example nearest
> neighbor communication in a 2 dimensional grid is very good, but not
> very general.  A cube is well connected, but may use too many wires to
> achieve its bandwidth)
This is points for the cube, it has very general utility.  It is one of the most
general topologies due to the many topologies imbedded in it.  Yes, it has
a lot of wires.  Whether a k connected network (4x4, 8x8, ... switching nodes
instead of 2x2 switch nodes) is better is an open research question.  You trade
wires for silicon.  You have to do the relevant simulations to find the answers
to this question.  The switching technique described in my recent work applies
to these systems as well.  You might consider writing the simulations and
publishing the papers for these!


> I tend to think that it is possible to
> get more MIPS per dollar by using smaller, cheaper processing elements.
You get the most MIPS for your dollar by using the most COST EFFECTIVE
processing elements.  These do not happen to be the smallest and cheapest.

> I don't see how wiring a 3d cube with 32 processors on a side gives us a
> 15 cube.
Consider a n=5 hypercube with the processors aligned along a line spaced at say
one foot intervals.  Each processor is labeled by a 5 bit number.  These 5 bits
form the first 5 bits of node addresses of the n=15 cube we are constructing.
Take 32 of these linear arrays. Place them side by side forming a plane.
Connect all of the 0 processors together in a line with the same wiring pattern
as used for the linear array.  Do the same for all of the other processors.
You now have a n=10 hypercube with 1024 processors.  The row index forms the
next 5 bits in the now 10 bit processor enumeration. All wires run in either
the x or y directions and the machine is a physical plane.  Now take 32 of these
planes and stack them in the z direction one foot apart.  Connect the processors
in each vertical line (there are 32x32 of them) with the same wiring pattern
as for the original line of 32 processors.  You now have a n=15 hypercube.
It is physically constructed in a 3 dimensional box 32'x32'x32'.  All wires
run in the x,y or z directions with a very regular pattern.  No wires run
diagonally.  The 1 meg processor is within reach but as I noted earlier
I would be quite happy with a 1024 processor cube of 10 MFLOP nodes.
Will any one sell me one?  I'll settle for 128 nodes as a start.
 
> >Deadlock free routing requires that a specific path be used for each
> >sender/reciver pair.
Sorry, I should have been more precise.  The only known proof for deadlock
free routing, without assuming infinite diskpacks at each node or any
other unacceptable bandwidth reducer like inhibiting a PE from sending
another message until a return receipt is received (see the further caviat
below), requires that a specific path be chosen for each sender/receiver pair.
This proof is documented in Dick Lang's 81(I think) Caltech Thesis.  There is
a caviat not clearly documented in Langs thesis as well.  If each message is
acknowledged with a response indicating the sucessful reception of the message
(as part of the reception action) then you need two seperate networks for the
messages and responses or infinite disk packs at each node.  I ran into a
deadlock condition in one of my hypercube simulations.  The "cure" was to use
seperate networks for the messages and return receipts.  Of course there may
be a better routing algorithm.  If you find it publish it and send me a copy
of the paper.  In fact send me a copy of the paper before you publish it,
I would like to take advantage of the algorithm as soon as possible in my
simulations and wouldn't want to wait for the publication lead time.


>   Thinking Machines Corporation
Have you guys over there taught any of your machines to "Think" yet?

Sorry, I know its old and you have heard it before but I couldn't resist. :-)


> Ah, but there are on the order of 2^N wires in the machine.
There are NlogN/2 wires, there logN ports on each network node, there are
N nodes and each wire has two ends.  Where does the 2^N come from?
The utilization of the wires is around 50% as packets travel half the maximal
logN distance on average.

> > My use of the network is to create a
> >shared memory server for a hypercube machine with CRAY style vector processors.
> >With simultaneous vector access by all processors the packet conflict problem
> >is as severe as it can be.  The network provides FULL bandwidth to all
> >processors simultaneously.
> 
> It sounds to me like the packet conflict problem is as nice as it could
> possibly be for that problem.  (Presumably processor number $i$ is
> accessing memory $f(i)$ where $f$ is one-to-one.
There are several addressing modes,  random start addresses with a common
stride among the processors, random scatter/gather and random start with
random stride.  Here the starting f(i) is not one to one as the start addresses
are unconstrained and may overlap.  The magic of the network is that it moves
toward an attractor in its motion, falls into lock step making f(i) one to one
and achieves full bandwidth.  The vector lengths required to do this are of
course a function of the machine size N and are documented in the papers
"The butterfly processor-memory interconnection in a vector processing
environment" and "The shared memory hypercube".  The Hypercube does well on
all of the addressing patterns, the butterfly does well on some of them.

> local memory
The shared memory hypercube has local memory.  Whether a segment is local or
global is useage determines how you arrange the addressing for the segment.
There are N local segments and one global segment in the simplest arrangement.
The ith processor has absolute priority and low latency for the ith local
segment.  It can still access all of the other local segments with lower
priority and higher latency.  This is one of the BIG advantages of the cube.
It can be N scalar machines, 2 N/2 parallel machines, ... or one big parallel
machine at the whim of the software.  You put the local scalar variables of
a task in a local segment.  Its very flexible!

> Why do you have shared memory?
A shared memory machine can do anything a message passing machine can do.
You can even efficiently run message passing software.  The shared memory
machine is then more flexible and useful on a wider range of problems.
It turns out that the type of codes run at LLNL are very difficult to run
on anything but a shared memory machine.  Message passing machines, to the
extent that they can live with a lower message bandwidth than aggregate memory
bandwidth , ie the processes are loosely coupled, are of course cheaper as the
network is cheaper.  If we could use one at LLNL it would have been bought a 
long time ago.


> What is the start up time for your vectors (i.e. how big does
> a vector have to be before the vector processing part wins over the
> scalar processing part.)
As you might guess it gets longer as the number of processors used in the
problem gets bigger.  Notice that I say the number of processors used
in the PROBLEM here.  For the butterfly its the number of processors in the
MACHINE which is potentially a different animal.  Luckily its a very slowly
growing function (like (logN)^2 with a suitably small constant out front).
   You would like it, its a large N winner.

> Typical vector processors are limited in their
> speed by lack of memory bandwidth (this is true for a single processor
> with high bandwidth memories (e.g. the CRAY uses a 16-way interleaved
> memory in order to try to increase the memory bandwidth).
Your choice is as to whether you want to interleave several hypercube topology
memory servers, interleave the memory modules or accept a 10-20 mip instruction
rate.  It is not yet known whether the interleaving will break the adaptive
nature of the network that is so crucial to vector performance.  This simulation
has not yet been done but if you know me, and I am assuming that you have
learned a little about me by now, you can be sure the simulation will be done
soon.  If the interleave does not break the adaptive nature of the network
then interleave, if it does break the adaptive nature stay with 10MFLOP
processors and 100ns ram.  Remember that we are following a line of maximum
cost effectivness and not maximum absolute speed per node.  A 10MFLOP node
is maximally cost effective at the present time and is what you would use
anyway.  Yes, the processor node is an order of magnitude slower than a Cray.
It is also an order of magnitude more cost effective.  A 128 node machine
seems like a good start.  It would cost the same amount as a Cray and be 10
times faster in terms of aggregate performance.

> Too many wires, and not very effectivley used.
Wrong on both counts, see the comments made above.

> Suppose you spent the
> same amount of money on a different network which gave you twice the
> bandwidth for most of your applications.
Find it, prove its better than the cube for most of LLNL's applications,
build it and we will buy it.  Thats the bottom line.  If after spending
some time studying the matter you don't find something better, build a
shared memory hypercube to the proper specs.  We would have bought one
of those yesterday.  Now isn't that better than Russian roulette!

> discussion (somewhere in an earlier posting) is the $k$-shuffle.  This
> network achieves better bandwidth by decreasing the distance between
The k shuffle does not achieve better bandwidth, it achieves a lower latency.
Bandwith is bandwidth, latency is latency and parts is parts. :-)  The lower
latency could translate to shorter vector lengths but as is shown in my paper
the vector lengths required for the hypercube are fine.  The worry is that
the k shuffle will have the start address pickyness of the butterfly which
is unacceptible.  The answer to this question is not known at present.

> Very few programs actually use the > n-cube directly (although there are
> a few (e.g. subcube induction to
> count the number of processors in a given state - [D.P. Christman,
> "Programming the Connection Machine", MIT masters thesis, January,
> 1984])) And for most programs the $k$-shuffle will be at least as good.
Take a look at the FFT,  its an important problem, it blows smoke on the
hypercube.  As I have said before, the wires is there.  The FFT uses all of
them.

Speaking of smoke, when it clears read the papers I sent you via SNAIL MAIL.

Any other takers for those papers? Pun intended. :-)
Send me your SNAIL MAIL address!