Newsgroups: comp.arch Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!wuarchive!uunet!stanford.edu!neon.Stanford.EDU!rfrench From: rfrench@neon.Stanford.EDU (Robert S. French) Subject: Re: iWarp Architecture Overview (LONG) Message-ID: <1991Jun3.222759.19132@neon.Stanford.EDU> Keywords: iWarp, systolic, parallel Organization: Computer Science Department, Stanford University, Ca , USA References: <1991Jun3.172230.6901@iWarp.intel.com> Date: Mon, 3 Jun 1991 22:27:59 GMT Lines: 51 First of all, let me thank Jim for his (long) overview of the iWarp architecture. I'm sure it will help many people who don't know what in the heck we're talking about :-) There are some questions I have about the iWarp component, though, specifically about performance. The iWarp was designed as a high-powered systolic processor, and thus provides all sorts of neat communications capabilities. However, it also needs good integer and FP support in order to sustain processing rates. There are a number of oddities that I noticed in the iWarp specs: The FP adder takes 2 cycles (SP) or 4 cycles (DP) for all operations and isn't pipelined, which is pretty much OK considering the short cycle times. The FP multiplier takes the same for multiplication, but performance isn't nearly as impressive on operations such as division. For example, an SP division takes 15-16 clocks, and a DP division takes 31 clocks. If you'll forgive me for comparing apples and oranges, a MIPS R3010 can do the same in 12 and 19 cycles, respectively, and can maintain a higher clock rate. Likewise, a SP remainder takes "no more than 162 clocks", and a DP remainder takes "no more than 1,087 clocks", an incredibly long time, although I must admit I've never personally seen an application that uses FP remainder. In addition, considering that throughput is a major goal, it seems unfortunate that the FP multiplier isn't pipelined. The arithmetic unit does most operations in 1 cycle, except that it doesn't support integer multiply or divide. You have to use the FP multiplier for integer multiply (3 cycles), and there doesn't appear to be any way to do an integer divide at all (convert to FP, divide, convert back?). This has the added problem that you can't do an integer multiply (such as for a multi-dimensional array access) and an FP multiply or divide at the same time, which I think severely limits the applicability of the compute&access instruction. The iWarp has more support for byte and bit-level operations than any processor I've seen in a long time. For example, you can reference the individual bytes of a register as the source or destination for any arithmetic operation, and you can count bits, set/reset bits, find the first set bit, etc. These operations seem odd in a processor designed for high-powered floating point performance (this is, after all, why the C&A instruction can do one FPA and one FPM instruction and two memory ops). It seems to me that the effort and chip area devoted to these functions would have been better used building an integer multiplier, integer divider, and pipelining the FPM unit. Just some thoughts... Rob