Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!linus!philabs!cmcl2!gottlieb From: gottlieb@cmcl2.UUCP (Allan Gottlieb) Newsgroups: net.arch Subject: Re: How Many ... (really NYU Ultracomputer) Message-ID: <5653@cmcl2.UUCP> Date: Wed, 30-Apr-86 10:31:02 EDT Article-I.D.: cmcl2.5653 Posted: Wed Apr 30 10:31:02 1986 Date-Received: Sat, 3-May-86 01:20:04 EDT References: <2089@peora.UUCP> <5100058@ccvaxa> <2120@peora.UUCP> Reply-To: gottlieb@cmcl2.UUCP (Allan Gottlieb) Organization: New York University, Ultracomputer project Lines: 58 In article <2120@peora.UUCP> jer@peora.UUCP (J. Eric Roskos) writes: >On the other hand, you can eliminate this problem by putting the translation >hardware out at the memory (which I believe is what was done by >Gottlieb et. al. in their supercomputer project, along with also putting >some adders and so on out there), but then you only have one of them, >which means it has to be very fast to avoid a bottleneck. I recall reading >a comment by Gottlieb about that fairly recently, where he was saying he >wished he'd put his memory management at the processors instead. > This is not quite right. It is certainly correct that the NYU Ultracomputer architecture specifies adders at the MMs (memory modules). This is done to provide an atomic implementation of our fetch-and-add operation (which we believe is an important coordination primitive). We are very concerned about bottlenecks and thus specify that the number of MMs grows with the number of PEs (processing elements). In the nominal design an omega network is used to connect 2^D PEs to 2^D MMs. The network is enhanced with VLSI switches we have designed that combine simultaneous references (including fetch-and-adds) directed at the same memory location. We have implemented bottleneck free algorithms for task management, memory management, and parts of the I/O system calls. Since the specified network is buffered, circuit switched, and pipelined, the bandwidth grows linearly with the number of PEs and thus will not prove to be limiting. However the latency grows as D (i.e. log #PE) and it is not trivial to do enough prefetching to mask the latency. For this reason it is essential to put the cache on the PE side of the network. Moreover, it is important to minimize network traffic as otherwise queues in the switches become nonempty, further increasing the latency. The memory management of our current prototype (8PEs bus based) is on the PE side of the network. Note that we do not support demand paging. A soon to be completed thesis by Pat Teller studies the demand paging issue, especially the problem of evicting shared pages. Since we consider it important to have the TLB on the PE side of the network, the design "locks" TLB resident pages in the MMs. Finally, let me add that IBM Research plans to build a 512 PE RP3 (Research Parallel Processor Prototype) whose architecture inlcudes all the (older) ultracomputer architecture as well as significant IBM enhancements, especially in memory management. In the RP3, memory management is also done on the PE side for those requests that traverse the network (their memory management enhancement permits certain cache misses to go directly to the MM physically associated with the issuing PE without using the network). RP3 does not specify demand paging either. The real problem here is that neither the RP3 nor Ultra projects have produced the I/O systems needed. That is, RP3 is 1000 MIPS for good (but reasonable) cache hit ratios but does not have anything close to 1000 times the I/O of a Vax-780. -- Allan Gottlieb GOTTLIEB@NYU {floyd,ihnp4}!cmcl2!gottlieb <---the character before the 2 is an el