Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!cis.ohio-state.edu!ucbvax!iWarp.intel.com!news
From: jsutton@iWarp.intel.com (Jim Sutton)
Newsgroups: comp.arch
Subject: Re: iWarp Architecture Overview (LONG)
Keywords: iWarp, systolic, parallel
Message-ID: <1991Jun4.204251.4070@iWarp.intel.com>
Date: 4 Jun 91 20:42:51 GMT
References: <1991Jun3.172230.6901@iWarp.intel.com>
Sender: news@iWarp.intel.com
Organization: Intel iWarp, Beaverton, Oregon, USA
Lines: 83
Nntp-Posting-Host: r3.iwarp.intel.com

rfrench@neon.Stanford.EDU (Robert S. French) writes:
>                                                    ...  In addition,
> considering that throughput is a major goal, it seems unfortunate that
> the FP multiplier isn't pipelined.

At the time the decision was made (mid '87?), pipelining the FP units
presented some severe challenges:
(1) Prior pipelined FP architectures (and ongoing work at that time)
    emphasized vector performance, but usually at the expense of scalar
    performance.  We wanted high scalar performance as well.
(2) Providing seamless send/receive constructs with full *invisible*
    synchronization was an imposing challenge even in scalar instructions.
    Meshing that into a pipelined FP architecture would have added
    massive complications.
(3) The compiler development required to handle the integrated I/O would
    be enough of a challenge, without adding the complexity of pipelined
    manipulation.
Note that in early '87 the entire iWarp FP design team was only 3 engineers!

Given the knowledge and experience we have *today*, and given the proven
send/receive interface mechanisms we have *today*, I would *now* be
comfortable in specifying a pipelined FP.  But that was not a viable choice
at the time.


> The arithmetic unit does most operations in 1 cycle, except that it
> doesn't support integer multiply or divide.  You have to use the FP
> multiplier for integer multiply (3 cycles), and there doesn't appear
> to be any way to do an integer divide at all (convert to FP, divide,
> convert back?).  This has the added problem that you can't do an
> integer multiply (such as for a multi-dimensional array access) and an
> FP multiply or divide at the same time, which I think severely limits
> the applicability of the compute&access instruction.

We found that virtually all of the multiplies required for multi-dimensional
array accesses occur outside the innermost loop(s).  As a consequence, for
large data sets (which is iWarp's target), integer multiplies occur
infrequently enough that the cost of adding a dedicated integer multiply
unit could not be justified.  Instead, we added a small amount of hardware
to the FP multiplier to allow direct multiplication of integers.

Integer divide is indeed implemented by converting to FP.  Integer divide
was found to occur so infrequently in our target applications that no special
hardware cost could be justified.

One point to keep in mind when examining the iWarp architecture is that
all tradeoffs and optimizations center around the following target:
* A tight loop (frequently a single C&A instruction) performing SP floating
* point adds and multiplies, with 1-2 memory accesses, 1-2 sends and 1-2
* receives per iteration.


> The iWarp has more support for byte and bit-level operations than any
> processor I've seen in a long time.  For example, ...
>                  ...  It seems to me that the effort and chip area
> devoted to these functions would have been better used building an
> integer multiplier, integer divider, and pipelining the FPM unit.

The only bit-level instructions (other than ordinary logical operations)
are bit-test/set/clear instructions. These were included to reduce the
cycles required in manipulating the communication control and status
registers.  This helps (slightly) improve the software overhead associated
with communications, at minimal silicon cost.

The byte and half-word operations were provided to allow efficient support
of C.  Without these operations, we face two unpleasant alternatives:
(1) Software must "promote" operands to 32-bit fields, perform the desired
    function, then "demote" the result.  This adds substantial additional
    cycles, particularly if exact results are to be maintained.
(2) Define the char/short/int data types as 32-bit fields.
    This consumes substantially more memory.  In ordinary systems with large
    amounts of DRAM, this may not be an issue, but iWarp's design goals
    required a very-fast (and expensive) all-SRAM memory system, which means
    that efficient memory utilization is essential.

 ----------------------------------------------------------------------------
 Jim Sutton, Sr Staff Engineer, intel/iWarp Program   jsutton@iWarp.intel.com
 5200 NE Elam Young Pky CO4-03, Hillsboro, OR 97124             (503)629-6345

--
 ----------------------------------------------------------------------------
 Jim Sutton, Sr Staff Engineer, intel/iWarp Program   jsutton@iWarp.intel.com
 5200 NE Elam Young Pky CO4-03, Hillsboro, OR 97124             (503)629-6345