Path: utzoo!attcan!uunet!wuarchive!udel!rochester!pt.cs.cmu.edu!f.gp.cs.cmu.edu!ali
From: ali@f.gp.cs.cmu.edu (Ali-Reza Adl-Tabatabai)
Newsgroups: comp.arch
Subject: Re: standard extensions
Keywords: divide, multiply
Message-ID: <12180@pt.cs.cmu.edu>
Date: 2 Mar 91 15:43:49 GMT
References: <1991Feb25.135057.23667@linus.mitre.org> <1991Feb25.201406.18643@bingvaxu.cc.binghamton.edu> <2124@cluster.cs.su.oz.au> <1991Feb27.021435.11296@bingvaxu.cc.binghamton.edu> <1991Feb27.183718.638@jetsun.weitek.COM> <1991Feb27.220856.4067@oakhill.sps.mot.com
Organization: Carnegie-Mellon University, CS/RI
Lines: 74

In article <1991Feb27.220856.4067@oakhill.sps.mot.com> wca@oakhill.sps.mot.com (William Anderson) writes:
>weaver@jetsun.weitek.COM (Mike Weaver) writes:
>
>>Does anyone know of a pipelined divider that gives a result every cycle?
>
>I was under the impression that HP had designed and implemented a multi-
>chip FPU with a (partially?) pipelined divider.  It used a model division
>(similar to SRT) on every row.  See the proceedings of the 1986 ISSCC (p. 34)
>for a description (and die photo) of this divider as well as a clever tree
>multiplier.  Judging by this article, HP could cycle the divider only
>half as fast (420 nS) as the fully pipelined multiplier (210 nS).  My,
>how the clockrates have changed in 5 years....
>
>> ... Also, the actual number of transistors is horrendous -- it would
>>take perhaps ten times as many transistors as an array multiplier, which
>>is a large thing.
>
>The numbers listed for the (NMOS) parts mentioned above were ~153K
>transistors for the multiplier and ~160K for the divider - not an order
>of magnitude by any meams.

Let us compare a radix-4 division algorithm with residuals (partial 
remainders) in carry save form with a radix-4 multiplication algorithm 
that keeps it's partial products in CS form.  A division step consists
of 
	(1) quotient digit selection
	(2) divisor multiple selection
	(2) carry save addition to produce next partial remainder

A multiplication step consists of

	(1) multiplication of recoded multiplier digit with multiplicand
	(2) carry save addition to produce next partial  product

The only difference in logic between the two is the quotient
digit selection, which can be done based on an estimate of
the partial remainder.  This can be accomplished using a limited
precision adder (5-6 bits, independant of divider's full precision)
and a PLA.  Therefore if we unwind the steps to produce an array
implementation, there should hardly be a 10 times difference
in the number of transistors.  As the precision becomes larger,
the difference diminishes since the size of the quotient digit
selection logic will remain the same.

The speed, however, is a different story, since in the muliplier
the recoding and multiplicand selection can be done in parallel
followed by a propogation through the adder array.  In the divider,
each step depends on the previous partial remainder, so the path
length is longer.  Therefore, in a pipelined implementation, for
a given pipeline step time, you may do much more of the multiplication
than of the division.  Hence you may require more latches for
the divider.

Note that I did not take into account the last carry-assimilation
step for the muliplier and divider, and I did not discuss
recoding of the redundant quotient digits into conventional
form.  After the last multiplication step, the most significant
half of the product is in carry-save form and needs to be 
passed through a carry-assimilate adder.  The last partial 
remainder of the divider will also be in carry-save form.  This
introduces an error in the last bit of the quotient, so that
it must also be assimilated for precise rounding.  The conversion
of the redundant quotient to conventional form may be done
on-the-fly as the digits are produced.  See 'On-the-fly Conversion
of redundant into conventional representations', IEEE Trans. on
comp. July 1987 by Ercegovac and Lang.

>
>William Anderson
>Motorola 88K Design Group
>Motorola MMTG
>Austin, TX

Ali-Reza Adl-Tabatabai	(ali@cs.cmu.edu)