Path: utzoo!utgpu!water!watmath!clyde!rutgers!husc6!bloom-beacon!gatech!hubcap!"Robert
From: rmw6x@hudson.acc.virginia.edu (Robert M. Wise)
Newsgroups: comp.hypercube
Subject: Re: bandwidth balance
Message-ID: <975@hubcap.UUCP>
Date: 15 Feb 88 13:15:08 GMT
Sender: fpst@hubcap.UUCP
Lines: 72
Approved: hypercube@hubcap.clemson.edu

In article <964@hubcap.UUCP> Donald.Lindsay@K.GP.CS.CMU.EDU writes: >
>When building a parallel machine, a designer chooses the balancebetween 
>computational resources, and memory bandwidth. For example, both Intel and 
>Thinking Machines recently announced new hypercubes, which had about the 
>same memory bandwidth as the previous models, but with vector arithmetic 
>units spread through the cube.  
> 

Could you define what you been by memory bandwidth?  Do you mean the
width of the data bus, or do you mean the overall throughput (for lack
of a better term)?  Is the memory bandwidth on a hypercube found by
multiplying the bandwith of one node by the number of nodes in use?

>In general, today's hypercubes are bandwidth-heavy compared to conventional
>machines. A 256-node Butterfly has an 8K-bit-wide path to memory. (Yes, I
>know it's not quite a cube.) A 1024-node NCUBE has a 16K-bit-wide path to
>memory.  A 64K-processor Connection Machine has a 64K-bit-wide path. This is
>somewhat more than any Cray - regardless of where in the Cray you choose to
>measure.
>

By the way, do you know anyone that HAS a 1024 node Ncube?  I think
the largest one in use (at this time, and last I heard, etc etc
standard disclaimer...)  is a 512 node version at Ncube. Kind of their
in house development machine.

>I recently heard a talk by Gil Weigand of Sandia National Labs. He claims
>considerable success in getting near-linear scaleup on his NCUBE/10. In
>particular, he mentioned a Laplacian solver which was deliberately memory
>intensive. It used 128 times the memory ( 2MB --> 256MB ) in return for 300
>times less computation. He claimed his time-to-result was dramatically
>better than on the Sandia Cray, even though the Cray is the superior in
>MFLOPS.
>
How many nodes is his Ncube-10, and how much memory per node?

>This raises several interesting questions.
>- Could this algorithm work on the Cray, or is the massive memory bandwidth
>  the whole secret ?
>- Is a 64-processor Cray-4 going to compare more favorably with the (bigger)
>  cubes it will compete with ?
>- Can we find other problems that fall to such attacks ?
>
>I'd call this good news.
>
>

With architectures like the the hypercube, you can often get speedup
by using more memory.  Consider the matrix multiplication problem.
(AB=C) If every node computes a subset of the final matrix elements by
means of the standard inner product, then for it to compute a portion
of the final matrix that is N x N, it must look at (minimum) N rows
and N columns.  If dividing the problem among four nodes, each node
would have to look at half of the elements of A and half of the
elements of B.  If memory were very restricted, then the problem could
be solved by "pipelining" the matrix rows/columns (even elements) to
each node in a ring on mesh topology.  However, if every node has all
the elements that it needs to start with, then no communication
between nodes is ever done, all that need be done is put the results
somewhere.

I suspect that there are a lot of algorithms which benefit from this
approach, although not as much as the matrix multiplication kind of
thing.  Any thoughts on this?  Might make an interesting paper.
Hmmmmm.  Never mind, I didn't say that...


-Bob Wise

bitnet: rmw6x@virginia
internet: rmw6x@hudson.acc.virginia.edu