Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!cs.utexas.edu!oakhill!marvin
From: marvin@oakhill.UUCP (Marvin Denman)
Newsgroups: comp.arch
Subject: Re: Integer multiply and killer micros
Message-ID: <2810@yogi.oakhill.UUCP>
Date: 10 Jan 90 23:52:40 GMT
References: <158@csinc.UUCP> <787@stat.fsu.edu> <42701@lll-winken.LLNL.GOV> <5842@ncar.ucar.edu> <490@qusunl.queensu.CA> <sZcrr1G00hMNI3EF5i@cs.cmu.edu> <34259@mips.mips.COM> <USER.90Jan10090700@pmax27.osf.org>
Reply-To: cs.utexas.edu!oakhill!marvin (Marvin Denman)
Organization: Motorola Inc., Austin, Texas
Lines: 49

In article <USER.90Jan10090700@pmax27.osf.org> (Michael Meissner) writes:
>In article <34259@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
>|        R6000:     16 cycles      32x32 -> 64
>|        R3000:     12 cycles      32x32 -> 64
>|        M88000:     4 cycles      32x32 -> 32  **
>| 
>|   **88k computes the 32 lsb's of the 64b product (upper bits are discarded).
>
>Actually if I remember chapter 7 of the 88100 user's manual, a
>multiply 6 cycles (1 in FP1, 3 in the multiplier stage, 1 in FPLAST,
>and 1 writeback).  Logically, the writeback phase should be available
>to be feed forward, which logically shaves off 1 cycle.  However,
>since non of the floating point operations do feed forwarding, I
>wouldn't be surprised if integer multiply/divide don't feed forward
>either.  As alluded to in an earlier article, multiple multiplications
>can be done in parallel, since each cycle, the multiplier advances the
>pipeline.  Floating point adds can similarly be pipelined.

I think you may have a slight parity error in your memory.  Integer multiply
on the 88100 does take 4 cycles of latency just like Mark Johnson stated.  
Single precision floating point multiply however takes 6 cycles latency.  

Your comments about feed forward confuse me slightly.  The latency numbers 
that Motorola quotes are all in terms of when another instruction can use 
the result so they already account for feed forward.  Perhaps the source of 
your confusion is that double precision results have one extra clock of 
latency when feeding forward into store double instructions because of 
implementation considerations.  Double precision results do feed forward 
with no delay to other double precision operations.  

A key point of the 88100 design was its pipelined design.  Single precision
multiply and add along with integer multiply are fully pipelined.  Double
precision operations are pipelined except for stalling for one clock on
either end of the pipe.  This allows very high throughput on well scheduled
code.  Currently compilers do very little in the way of code scheduling for
floating point. (At least the code I have looked at)  When compilers begin to
take advantage of the 88100 floating point pipelines there will be a jump in
performance.  The current compilers are already making progress which will be
seen in the newest release of the SPEC benchmark numbers when they are published.
There is still much room for improvement.

Marvin Denman
Motorola 88000 Design

-- 

Marvin Denman
Motorola 88000 Design
cs.utexas.edu!oakhill!marvin