Path: utzoo!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!brutus.cs.uiuc.edu!ux1.cso.uiuc.edu!ux1.cso.uiuc.edu!uxa.cso.uiuc.edu!afgg6490
From: afgg6490@uxa.cso.uiuc.edu
Newsgroups: comp.arch
Subject: Cyrix - Fast Divide, etc.
Message-ID: <112400012@uxa.cso.uiuc.edu>
Date: 12 Dec 89 10:38:37 GMT
Lines: 67
Nf-ID: #N:uxa.cso.uiuc.edu:112400012:000:3454
Nf-From: uxa.cso.uiuc.edu!afgg6490    Dec 11 11:40:00 1989


comp.arch might be interested in a few details about the
Cyrix math chip.  This is a 387 / Weitek compatible chip,
boasting significant speedups.  (Biggest speedup, of course,
is avoiding the coprocessor interface).

FP add features a normalization estimator, that seems to
  be a signed-digit adder (carry free) followed by a 
  leading zero counter.  This is done in parallel with the 
  carry propagate mantissa addition. The normalization estimate
  yields a shift count that is immediately used to set up a
  variable shifter for the mantissa sum, and gets it within
  1 bit shift of the correctly normalized result.

The biggest (literally) feature of the chip is a 69x17 multiplier
  array,  Internally this uses signed digit representation;
  the article also implies that the 17 bits are signed, but a
  call to Cyrix did not confirm this.  Using this array IEEE
  extended multiplication is done in 4 cycles, but quite a few more
  cycles are required to fetch and store data, denorm check, etc.
  Overhead almost dominates actual computation.

Most interesting is Cyrix's divide algorithm.  Instead of a 2 or 4 bit
digit selection, such as have been described extensively elsewhere
(usually with the 4 bit, radix 16, implemented as two cascaded radix 4
stages with differing amounts of overlap), Cyrix uses the 69x17 multiplier
for a 17 bit (radix 128K) digit selection.   To do this, they
use a 20 bit (I think it's really only 19) bit approximation to 1/X.
They multiply the partial remainder at each cycle to estimate the
next 17 bit quotient digit Q=R*(1/X).  Then they back-multiply to
obtain a true partial remainder  R' = R-X*Q (I conjecture that last step).
The 20 bit approximation to 1/X is obtained by Newton Raphson,
beginning with an 8 bit lookup value, and 3 iterations.
Signed digit multiplication allows negative quotient bits to be used.
The quotient digits are stored in redundant form, and subsequently
converted to non-redundant form; Cyrix does not appear to use
the on-the-fly conversion to non-redundant form that Lang, Ercegovac
describe, and Fandrianto uses in his dividers.
    I think that Cyrix's use of the multiplier will soon be emulated
(although not necessarily the same algorithm, which Matula is patenting).
Hybrid Newton-Raphson/Digit Selectoiion algorithms may well be used, with NR
up to the multiplier width, followed by a multiplicative digit selection
thereafter.  Alternatively, Lang and Ercegovac's division that uses
rounding to select quotient digits, after a normalization process that
is implicitly multiplication by a 1/X approximation, can be used. Or,
with full multiplier arrays, NR to compute 1/X, followed by an evaluation
of the remainder using a full multiplier, in order to get IEEE exact
rounding, could be used.   IE. the return to "long division" methods
was in part motivated by IEEE exact rounding, which full multipliers
make possible for faster division algorithms.


Reference:

	"Advancing the Standard in Floating Point Performance"
	Tom Brightman, Cyrix Corp, Richardson Texas,
	High Performance Systems, November 1989

	Cyrix makes reports on accuracy, etc. available on request;
	I am awaiting mine...

---

PS. I have finally finished my "survey" of computer arithmetic.
   I'll put the bibliography up here over Xmas, after I've cleaned
   it up a bit.   The report is very long and verbose, I haven't had
   time to clean it up, so I'm going to hang onto it for a while.