Path: utzoo!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!brutus.cs.uiuc.edu!ux1.cso.uiuc.edu!ux1.cso.uiuc.edu!uxa.cso.uiuc.edu!afgg6490 From: afgg6490@uxa.cso.uiuc.edu Newsgroups: comp.arch Subject: Cyrix - Fast Divide, etc. Message-ID: <112400012@uxa.cso.uiuc.edu> Date: 12 Dec 89 10:38:37 GMT Lines: 67 Nf-ID: #N:uxa.cso.uiuc.edu:112400012:000:3454 Nf-From: uxa.cso.uiuc.edu!afgg6490 Dec 11 11:40:00 1989 comp.arch might be interested in a few details about the Cyrix math chip. This is a 387 / Weitek compatible chip, boasting significant speedups. (Biggest speedup, of course, is avoiding the coprocessor interface). FP add features a normalization estimator, that seems to be a signed-digit adder (carry free) followed by a leading zero counter. This is done in parallel with the carry propagate mantissa addition. The normalization estimate yields a shift count that is immediately used to set up a variable shifter for the mantissa sum, and gets it within 1 bit shift of the correctly normalized result. The biggest (literally) feature of the chip is a 69x17 multiplier array, Internally this uses signed digit representation; the article also implies that the 17 bits are signed, but a call to Cyrix did not confirm this. Using this array IEEE extended multiplication is done in 4 cycles, but quite a few more cycles are required to fetch and store data, denorm check, etc. Overhead almost dominates actual computation. Most interesting is Cyrix's divide algorithm. Instead of a 2 or 4 bit digit selection, such as have been described extensively elsewhere (usually with the 4 bit, radix 16, implemented as two cascaded radix 4 stages with differing amounts of overlap), Cyrix uses the 69x17 multiplier for a 17 bit (radix 128K) digit selection. To do this, they use a 20 bit (I think it's really only 19) bit approximation to 1/X. They multiply the partial remainder at each cycle to estimate the next 17 bit quotient digit Q=R*(1/X). Then they back-multiply to obtain a true partial remainder R' = R-X*Q (I conjecture that last step). The 20 bit approximation to 1/X is obtained by Newton Raphson, beginning with an 8 bit lookup value, and 3 iterations. Signed digit multiplication allows negative quotient bits to be used. The quotient digits are stored in redundant form, and subsequently converted to non-redundant form; Cyrix does not appear to use the on-the-fly conversion to non-redundant form that Lang, Ercegovac describe, and Fandrianto uses in his dividers. I think that Cyrix's use of the multiplier will soon be emulated (although not necessarily the same algorithm, which Matula is patenting). Hybrid Newton-Raphson/Digit Selectoiion algorithms may well be used, with NR up to the multiplier width, followed by a multiplicative digit selection thereafter. Alternatively, Lang and Ercegovac's division that uses rounding to select quotient digits, after a normalization process that is implicitly multiplication by a 1/X approximation, can be used. Or, with full multiplier arrays, NR to compute 1/X, followed by an evaluation of the remainder using a full multiplier, in order to get IEEE exact rounding, could be used. IE. the return to "long division" methods was in part motivated by IEEE exact rounding, which full multipliers make possible for faster division algorithms. Reference: "Advancing the Standard in Floating Point Performance" Tom Brightman, Cyrix Corp, Richardson Texas, High Performance Systems, November 1989 Cyrix makes reports on accuracy, etc. available on request; I am awaiting mine... --- PS. I have finally finished my "survey" of computer arithmetic. I'll put the bibliography up here over Xmas, after I've cleaned it up a bit. The report is very long and verbose, I haven't had time to clean it up, so I'm going to hang onto it for a while.