Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.2 9/17/84; site think.ARPA Path: utzoo!watmath!clyde!burl!ulysses!mhuxr!ihnp4!think!bradley From: bradley@think.ARPA (Bradley C. Kuszmaul) Newsgroups: net.arch Subject: Re: "The Shared Memory Hypercube" Do you smell any smoke? Message-ID: <1590@think.ARPA> Date: Fri, 10-May-85 09:54:20 EDT Article-I.D.: think.1590 Posted: Fri May 10 09:54:20 1985 Date-Received: Sat, 11-May-85 02:26:34 EDT References: <2132@sun.uucp> <1447@think.ARPA> <551@lll-crg.ARPA> <973@ames.UUCP> Reply-To: bradley@think.UUCP (Bradley C. Kuszmaul) Distribution: net Organization: Thinking Machines, Cambridge, MA Lines: 153 Summary: Path: think!mit-eddie!genrad!panda!talcott!harvard!seismo!hao!ames!eugene From: eugene@ames.UUCP (Eugene Miya) Newsgroups: net.arch Subject: Re: "The Shared Memory Hypercube" Do you smell any smoke? Date: 6 May 85 23:05:16 GMT Date-Received: 8 May 85 08:09:55 GMT References: <2132@sun.uucp> <1447@think.ARPA> <551@lll-crg.ARPA> <1483@think.ARPA> <560@lll-crg.ARPA> > >Some of my assumptions are: > > - Lots and lots and lots of small processors are better than fewer big > > processors. > A very bad assumption, you want as many of the most COST EFFECTIVE > processors > . . . > > I tend to think that it is possible to > > get more MIPS per dollar by using smaller, cheaper processing elements. > You get the most MIPS for your dollar by using the most COST EFFECTIVE > processing elements. These do not happen to be the smallest and cheapest. We just had a SIG meeting with Joe Oliger (the CS chair at Stanford) recently. Joe came to the conclusion [during the course of thinking] that "fewer, high performance CPU" proponents of multiprocessing had the advantage in being able to fit reasonable portions of problems into individual processor/memories. The claimed advantage is predicated on the idea that each processor solving more of the problem is better, either in programming ease, communications overhead, or some other measure of cost. This advantage may go away again if small portions of the problems are the RIGHT thing for each processor to handle. Such an case might be some finite element analysis problem, in which it might be both easy to program each processor to simulate exactly one element, and there is not a lot of communication between different parts of the problem (e.g. if all the interaction is with the nearest neighbors of the finite element). (Of course, if it is too hard to write the program, then I lose, and if every element wants to talk to every other element in every discrete time unit, then I lose. I think it is actually easier to program lots and lots of processors than "just" several processors, and I think that most simulations have strong localality of communication properties.) Do not forget! We are not developing these machines in a vacuum. We have to look at the applications which may be run on these machines. Consider a 100 x 100 x 100 array with 30 variables (and increasing as our known of the natural sciences increases). I claim that what you really want is 100^3 processors each with only enough memory for the 30 variables that you need. I think that the problems that you are addressing are due to the fact that the natural granularity of the problem does not match the natural granularity of the machine you are using. If you are iteratively solving PDE's on a 100^3 matrix, then there is not that much communication between different parts of the matrix (there is some local communication), so you might want to put each element of the matrix in a different processor, where the processors are (at least logically, if not physically) arranged 3d grid network. (Handwaving cost argument: If adding a processor to every couple of thousand bits of memory increases the memory cost by only a few percent (including all the costs of power supplies, boards, air conditioners etc.), and furthermore, you can use an otherwise cheaper memory system because you don't need all the interleave hardware, and the memory does not have to be that fast, and you don't need caches between the processor(s) and the memory, then it might very well be the case that for essentially the cost of a large, (but relatively low cost per bit) memory you can have a supercomputer. Why put the cental processor(s) in?) I can barely fit a fluid dynamics code on a 32-node Hypercube because the storage requirements per CPU are fierce [a different example, not the 100^3x30 example]. Details? Jack Dennis had to revise the way he thought dataflow machines need to be built after two weeks here: he needs more memory and faster I/O. That is not quite how I interpreted the report of the RIACS study that he gave to his research group. He mostly seemed pleased that FORTRAN programmers could actually learn to program in an applicative language. Adding memory and faster I/O may be just as expensive as just adding more processors. (Of course, Jack thinks I am an extremist for believing that a million processors is not enough, so my interpretation of what he said is also subject to correction (e.g. I would not want to try to explain to him how I ever got the idea that he said or meant the things I am attributing to him. :-)) If there is any one thing multiprocessors allow us to do, it is add yet more memory. Hum? :-) If there is one thing that adding more memory does , it is increase the cost of the system, especially when you have to do more hacking to get the processors to talk to the memory (e.g. shared memory, interleaving...) > > What is the start up time for your vectors (i.e. how big does > > a vector have to be before the vector processing part wins over the > > scalar processing part.) I have just tested this recently on four different CRAY architectures. Short vector startup time is very good. I am aware that the Cray is good at short vectors (that's one of the reasons that the Cray so popular.) The Cray is one of the few vector machine architectures which is good at short vectors. My question had to do with how good Dr. Brook's computer is at vectors (he said that he was connecting several "Cray class" processors together, which is not the same thing as several Cray processors.) > > Typical vector processors are limited in their > > speed by lack of memory bandwidth (this is true for a single processor > > with high bandwidth memories (e.g. the CRAY uses a 16-way interleaved Our Cray has a much higher degree of interleave. The new C-2 will have 128-way. Which only reinforces the idea that vector processors are often limited by their bandwidth. I understand that the C-2 (Cray 2?) is running a faster clock (~ 4ns) and slower memory (>100ns) than the Cray 1 (12.5 (or 9) ns and 50 (or faster) ns respectively. This means that an interleave of something like 25 is the minimum that you can get away with without loss of performance (and just as the Cray 1 "has to have" an interleave of four to keep the processor busy on memory operations, there are a number of reasons for increasing the interleave far beyond the minimum. (e.g. the processor hitting the memory interleave, multiple processors with one memory, independent I/O processors.) I have also heard rumors to the effect that Cray has broken down and put multiport memory in. These high bandwidth memories are very expensive, and I think there is an argument for avoiding them. Note that this argument, by itself, is good for arguing for anything from a 10Kflop processor that the Connection Machine uses up to the 10MIPS processors that Brooks advocates, but that it becomes hard to justify a 1Gflop processor simply because of the high cost of the memory system. I have some specific questions for Dr. Brooks (which I suspect he has good answers for): In the discussion of the architecture which you advocate, we have been concentrating on the bandwidth considerations. I am now wondering about latency issues. What is the latency of your shared memory? (If you are running a 10MIPS processor, does that mean that a memory operation on shared memory can complete in time for the next memory operation to continue? Do you use pipelining to help deal with any such latency problems? If you do, how "full" and "deep" does the pipeline have to be to keep the processor busy. Is there local memory for each processor, and does it have lower latency than the global memory?-- bradley@think.uucp (i.e. {decvax!cca,ihnp4!mit-eddie}!think!bradley) bradley@THINK.ARPA