Path: utzoo!utgpu!water!watmath!clyde!rutgers!husc6!bloom-beacon!gatech!hubcap!"Robert From: rmw6x@hudson.acc.virginia.edu (Robert M. Wise) Newsgroups: comp.hypercube Subject: Re: bandwidth balance Message-ID: <975@hubcap.UUCP> Date: 15 Feb 88 13:15:08 GMT Sender: fpst@hubcap.UUCP Lines: 72 Approved: hypercube@hubcap.clemson.edu In article <964@hubcap.UUCP> Donald.Lindsay@K.GP.CS.CMU.EDU writes: > >When building a parallel machine, a designer chooses the balancebetween >computational resources, and memory bandwidth. For example, both Intel and >Thinking Machines recently announced new hypercubes, which had about the >same memory bandwidth as the previous models, but with vector arithmetic >units spread through the cube. > Could you define what you been by memory bandwidth? Do you mean the width of the data bus, or do you mean the overall throughput (for lack of a better term)? Is the memory bandwidth on a hypercube found by multiplying the bandwith of one node by the number of nodes in use? >In general, today's hypercubes are bandwidth-heavy compared to conventional >machines. A 256-node Butterfly has an 8K-bit-wide path to memory. (Yes, I >know it's not quite a cube.) A 1024-node NCUBE has a 16K-bit-wide path to >memory. A 64K-processor Connection Machine has a 64K-bit-wide path. This is >somewhat more than any Cray - regardless of where in the Cray you choose to >measure. > By the way, do you know anyone that HAS a 1024 node Ncube? I think the largest one in use (at this time, and last I heard, etc etc standard disclaimer...) is a 512 node version at Ncube. Kind of their in house development machine. >I recently heard a talk by Gil Weigand of Sandia National Labs. He claims >considerable success in getting near-linear scaleup on his NCUBE/10. In >particular, he mentioned a Laplacian solver which was deliberately memory >intensive. It used 128 times the memory ( 2MB --> 256MB ) in return for 300 >times less computation. He claimed his time-to-result was dramatically >better than on the Sandia Cray, even though the Cray is the superior in >MFLOPS. > How many nodes is his Ncube-10, and how much memory per node? >This raises several interesting questions. >- Could this algorithm work on the Cray, or is the massive memory bandwidth > the whole secret ? >- Is a 64-processor Cray-4 going to compare more favorably with the (bigger) > cubes it will compete with ? >- Can we find other problems that fall to such attacks ? > >I'd call this good news. > > With architectures like the the hypercube, you can often get speedup by using more memory. Consider the matrix multiplication problem. (AB=C) If every node computes a subset of the final matrix elements by means of the standard inner product, then for it to compute a portion of the final matrix that is N x N, it must look at (minimum) N rows and N columns. If dividing the problem among four nodes, each node would have to look at half of the elements of A and half of the elements of B. If memory were very restricted, then the problem could be solved by "pipelining" the matrix rows/columns (even elements) to each node in a ring on mesh topology. However, if every node has all the elements that it needs to start with, then no communication between nodes is ever done, all that need be done is put the results somewhere. I suspect that there are a lot of algorithms which benefit from this approach, although not as much as the matrix multiplication kind of thing. Any thoughts on this? Might make an interesting paper. Hmmmmm. Never mind, I didn't say that... -Bob Wise bitnet: rmw6x@virginia internet: rmw6x@hudson.acc.virginia.edu