Path: utzoo!attcan!uunet!wuarchive!udel!rochester!pt.cs.cmu.edu!f.gp.cs.cmu.edu!ali From: ali@f.gp.cs.cmu.edu (Ali-Reza Adl-Tabatabai) Newsgroups: comp.arch Subject: Re: standard extensions Keywords: divide, multiply Message-ID: <12180@pt.cs.cmu.edu> Date: 2 Mar 91 15:43:49 GMT References: <1991Feb25.135057.23667@linus.mitre.org> <1991Feb25.201406.18643@bingvaxu.cc.binghamton.edu> <2124@cluster.cs.su.oz.au> <1991Feb27.021435.11296@bingvaxu.cc.binghamton.edu> <1991Feb27.183718.638@jetsun.weitek.COM> <1991Feb27.220856.4067@oakhill.sps.mot.com Organization: Carnegie-Mellon University, CS/RI Lines: 74 In article <1991Feb27.220856.4067@oakhill.sps.mot.com> wca@oakhill.sps.mot.com (William Anderson) writes: >weaver@jetsun.weitek.COM (Mike Weaver) writes: > >>Does anyone know of a pipelined divider that gives a result every cycle? > >I was under the impression that HP had designed and implemented a multi- >chip FPU with a (partially?) pipelined divider. It used a model division >(similar to SRT) on every row. See the proceedings of the 1986 ISSCC (p. 34) >for a description (and die photo) of this divider as well as a clever tree >multiplier. Judging by this article, HP could cycle the divider only >half as fast (420 nS) as the fully pipelined multiplier (210 nS). My, >how the clockrates have changed in 5 years.... > >> ... Also, the actual number of transistors is horrendous -- it would >>take perhaps ten times as many transistors as an array multiplier, which >>is a large thing. > >The numbers listed for the (NMOS) parts mentioned above were ~153K >transistors for the multiplier and ~160K for the divider - not an order >of magnitude by any meams. Let us compare a radix-4 division algorithm with residuals (partial remainders) in carry save form with a radix-4 multiplication algorithm that keeps it's partial products in CS form. A division step consists of (1) quotient digit selection (2) divisor multiple selection (2) carry save addition to produce next partial remainder A multiplication step consists of (1) multiplication of recoded multiplier digit with multiplicand (2) carry save addition to produce next partial product The only difference in logic between the two is the quotient digit selection, which can be done based on an estimate of the partial remainder. This can be accomplished using a limited precision adder (5-6 bits, independant of divider's full precision) and a PLA. Therefore if we unwind the steps to produce an array implementation, there should hardly be a 10 times difference in the number of transistors. As the precision becomes larger, the difference diminishes since the size of the quotient digit selection logic will remain the same. The speed, however, is a different story, since in the muliplier the recoding and multiplicand selection can be done in parallel followed by a propogation through the adder array. In the divider, each step depends on the previous partial remainder, so the path length is longer. Therefore, in a pipelined implementation, for a given pipeline step time, you may do much more of the multiplication than of the division. Hence you may require more latches for the divider. Note that I did not take into account the last carry-assimilation step for the muliplier and divider, and I did not discuss recoding of the redundant quotient digits into conventional form. After the last multiplication step, the most significant half of the product is in carry-save form and needs to be passed through a carry-assimilate adder. The last partial remainder of the divider will also be in carry-save form. This introduces an error in the last bit of the quotient, so that it must also be assimilated for precise rounding. The conversion of the redundant quotient to conventional form may be done on-the-fly as the digits are produced. See 'On-the-fly Conversion of redundant into conventional representations', IEEE Trans. on comp. July 1987 by Ercegovac and Lang. > >William Anderson >Motorola 88K Design Group >Motorola MMTG >Austin, TX Ali-Reza Adl-Tabatabai (ali@cs.cmu.edu)