Newsgroups: comp.arch
Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!wuarchive!uunet!stanford.edu!neon.Stanford.EDU!rfrench
From: rfrench@neon.Stanford.EDU (Robert S. French)
Subject: Re: iWarp Architecture Overview (LONG)
Message-ID: <1991Jun3.222759.19132@neon.Stanford.EDU>
Keywords: iWarp, systolic, parallel
Organization: Computer Science Department, Stanford University, Ca , USA
References: <1991Jun3.172230.6901@iWarp.intel.com>
Date: Mon, 3 Jun 1991 22:27:59 GMT
Lines: 51


First of all, let me thank Jim for his (long) overview of the iWarp
architecture.  I'm sure it will help many people who don't know what
in the heck we're talking about :-)

There are some questions I have about the iWarp component, though,
specifically about performance.  The iWarp was designed as a
high-powered systolic processor, and thus provides all sorts of neat
communications capabilities.  However, it also needs good integer and
FP support in order to sustain processing rates.  There are a number
of oddities that I noticed in the iWarp specs:

The FP adder takes 2 cycles (SP) or 4 cycles (DP) for all operations
and isn't pipelined, which is pretty much OK considering the short
cycle times.

The FP multiplier takes the same for multiplication, but performance
isn't nearly as impressive on operations such as division.  For
example, an SP division takes 15-16 clocks, and a DP division takes 31
clocks.  If you'll forgive me for comparing apples and oranges, a MIPS
R3010 can do the same in 12 and 19 cycles, respectively, and can
maintain a higher clock rate.  Likewise, a SP remainder takes "no more
than 162 clocks", and a DP remainder takes "no more than 1,087
clocks", an incredibly long time, although I must admit I've never
personally seen an application that uses FP remainder.  In addition,
considering that throughput is a major goal, it seems unfortunate that
the FP multiplier isn't pipelined.

The arithmetic unit does most operations in 1 cycle, except that it
doesn't support integer multiply or divide.  You have to use the FP
multiplier for integer multiply (3 cycles), and there doesn't appear
to be any way to do an integer divide at all (convert to FP, divide,
convert back?).  This has the added problem that you can't do an
integer multiply (such as for a multi-dimensional array access) and an
FP multiply or divide at the same time, which I think severely limits
the applicability of the compute&access instruction.

The iWarp has more support for byte and bit-level operations than any
processor I've seen in a long time.  For example, you can reference
the individual bytes of a register as the source or destination for
any arithmetic operation, and you can count bits, set/reset bits, find
the first set bit, etc.  These operations seem odd in a processor
designed for high-powered floating point performance (this is, after
all, why the C&A instruction can do one FPA and one FPM instruction
and two memory ops).  It seems to me that the effort and chip area
devoted to these functions would have been better used building an
integer multiplier, integer divider, and pipelining the FPM unit.

Just some thoughts...

			Rob