Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!cis.ohio-state.edu!ucbvax!iWarp.intel.com!news From: jsutton@iWarp.intel.com (Jim Sutton) Newsgroups: comp.arch Subject: Re: iWarp Architecture Overview (LONG) Keywords: iWarp, systolic, parallel Message-ID: <1991Jun4.204251.4070@iWarp.intel.com> Date: 4 Jun 91 20:42:51 GMT References: <1991Jun3.172230.6901@iWarp.intel.com> Sender: news@iWarp.intel.com Organization: Intel iWarp, Beaverton, Oregon, USA Lines: 83 Nntp-Posting-Host: r3.iwarp.intel.com rfrench@neon.Stanford.EDU (Robert S. French) writes: > ... In addition, > considering that throughput is a major goal, it seems unfortunate that > the FP multiplier isn't pipelined. At the time the decision was made (mid '87?), pipelining the FP units presented some severe challenges: (1) Prior pipelined FP architectures (and ongoing work at that time) emphasized vector performance, but usually at the expense of scalar performance. We wanted high scalar performance as well. (2) Providing seamless send/receive constructs with full *invisible* synchronization was an imposing challenge even in scalar instructions. Meshing that into a pipelined FP architecture would have added massive complications. (3) The compiler development required to handle the integrated I/O would be enough of a challenge, without adding the complexity of pipelined manipulation. Note that in early '87 the entire iWarp FP design team was only 3 engineers! Given the knowledge and experience we have *today*, and given the proven send/receive interface mechanisms we have *today*, I would *now* be comfortable in specifying a pipelined FP. But that was not a viable choice at the time. > The arithmetic unit does most operations in 1 cycle, except that it > doesn't support integer multiply or divide. You have to use the FP > multiplier for integer multiply (3 cycles), and there doesn't appear > to be any way to do an integer divide at all (convert to FP, divide, > convert back?). This has the added problem that you can't do an > integer multiply (such as for a multi-dimensional array access) and an > FP multiply or divide at the same time, which I think severely limits > the applicability of the compute&access instruction. We found that virtually all of the multiplies required for multi-dimensional array accesses occur outside the innermost loop(s). As a consequence, for large data sets (which is iWarp's target), integer multiplies occur infrequently enough that the cost of adding a dedicated integer multiply unit could not be justified. Instead, we added a small amount of hardware to the FP multiplier to allow direct multiplication of integers. Integer divide is indeed implemented by converting to FP. Integer divide was found to occur so infrequently in our target applications that no special hardware cost could be justified. One point to keep in mind when examining the iWarp architecture is that all tradeoffs and optimizations center around the following target: * A tight loop (frequently a single C&A instruction) performing SP floating * point adds and multiplies, with 1-2 memory accesses, 1-2 sends and 1-2 * receives per iteration. > The iWarp has more support for byte and bit-level operations than any > processor I've seen in a long time. For example, ... > ... It seems to me that the effort and chip area > devoted to these functions would have been better used building an > integer multiplier, integer divider, and pipelining the FPM unit. The only bit-level instructions (other than ordinary logical operations) are bit-test/set/clear instructions. These were included to reduce the cycles required in manipulating the communication control and status registers. This helps (slightly) improve the software overhead associated with communications, at minimal silicon cost. The byte and half-word operations were provided to allow efficient support of C. Without these operations, we face two unpleasant alternatives: (1) Software must "promote" operands to 32-bit fields, perform the desired function, then "demote" the result. This adds substantial additional cycles, particularly if exact results are to be maintained. (2) Define the char/short/int data types as 32-bit fields. This consumes substantially more memory. In ordinary systems with large amounts of DRAM, this may not be an issue, but iWarp's design goals required a very-fast (and expensive) all-SRAM memory system, which means that efficient memory utilization is essential. ---------------------------------------------------------------------------- Jim Sutton, Sr Staff Engineer, intel/iWarp Program jsutton@iWarp.intel.com 5200 NE Elam Young Pky CO4-03, Hillsboro, OR 97124 (503)629-6345 -- ---------------------------------------------------------------------------- Jim Sutton, Sr Staff Engineer, intel/iWarp Program jsutton@iWarp.intel.com 5200 NE Elam Young Pky CO4-03, Hillsboro, OR 97124 (503)629-6345