Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!cs.utexas.edu!usc!rpi!sarah!bingnews!kym From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) Newsgroups: comp.arch Subject: Re: Divide in 1 cycle (was Re: standard extensions) Keywords: Reciprocal,Algorithm Message-ID: <1991Mar10.201021.22058@bingvaxu.cc.binghamton.edu> Date: 10 Mar 91 20:10:21 GMT References: <1991Mar7.043931.13552@bingvaxu.cc.binghamton.edu> <777@spim.mips.COM> <1991Mar8.110801.20042@bingvaxu.cc.binghamton.edu> Organization: State University of New York at Binghamton Lines: 181 My earlier mumblings about ``interpolation'' & ``table-lookups not being quite as expensive as all that'' have resulted in a modest amount of email asking for clarification. So after a bit of experimentation I am posting yet _another_ reciprocal routine which attempts to illustrate some of the ideas to which I alluded. This present routine uses linear interpolation (after some initial adjustment of the argument) to find the reciprocal of an unsigned 16-bit number. It uses 512 words of table storage. It is based, again, on an approximate expansion for 1/(L a+b) = 1/L 1/a (1 - 1/L 1/a b) = 1/L (A[a] - B[a] (1/L b)) where A[a] = 1/a and B[a] = 1/a^2 It is convenient to choose L = 2^k :-). Over the argument range the absolute error in the result is small enough to make any division accurate (i.e. the largest abs err * largest numerator < 1 bit). Since the scheme only involves (c) a `small' multiply (a) a subtract (b) two `small' shifts it is slightly :-) cheaper in terms of h/w than the scheme I posted previously. It again could perform at the rate of one result per cycle with a suitable tree multiplier and shifters. The accuracy of the calculation can be improved somewhat if the parameters A[a] and B[a] are displaced; I have tried a simple regression to minimize the square error, and that seems satisfactory; but for the numerical purists a minimax procedure might be better. Although obviously a toy, expansion to your preference in wordsize is straightforward (if expensive). -kym Since the abs error was originally largest for a power of two, I have included a shift in the program to remove trailing 0's in the initial argument; this would probably be reduced to a single-bit shift in a hardware version, or eliminated entirely if the table entries A[a] and B[a] were adjusted appropriately. The program also includes a possible single Newton-Raphson correction step; but it does not seem to be required for the purposes of implementing integer division. C Code ahead. --- program --- #include #include #include /* 1/(La+b) = 1/L 1/a (1 + 1/L 1/a b)^-1 =about 1/L 1/a (1 - 1/L 1/a b) = 1/L 1/a - 1/L^2 1/a^2 b = 1/L A(a) - 1/L^2 B(a) b Good if L = 2^k. */ #define MAX 0xffff #define NTAB (1<<8) float atab[NTAB],btab[NTAB]; int mask[10]; float shift[16]; float recip0(x) { int a; unsigned short b; unsigned char m=0; unsigned char n; /* trim trailing 0's from initial argument */ for(a=x; (a&1)==0; a>>=1, m++); /* top `nbits' of x -> a ; rest -> b */ for(n=0; a>=NTAB; a>>=1, n++); assert(n<=10); b=x & mask[n]; m += n; assert(m<=15); return shift[m]*(atab[a] - shift[n]*btab[a]*b); } #ifdef NEWTON float recip(x) { /* perform single Newton-Raphson correction step */ float r = recip0(x); r *= 2 - r*x; return r; } #else /* perform no Newton-Raphson correction step */ #define recip(x) recip0(x) #endif main() { int x; float mine,std; float abs_err,rel_err; float prod_rel_err=1,sum_abs_err=0; float max_abs_err=0,min_abs_err=1e38; int min_abs_err_x= -1,max_abs_err_x= -1; float max_rel_err=0,min_rel_err=1e38; int min_rel_err_x= -1,max_rel_err_x= -1; init(); for(x=1; x<=MAX; x++) { std=1.0/x; mine=recip(x); assert(mine>=0); /* just in case... */ abs_err=fabs(mine-std); sum_abs_err+=abs_err; rel_err=abs_err/std; prod_rel_err *=rel_err; if(abs_err>max_abs_err) max_abs_err=abs_err,max_abs_err_x=x; else if(abs_errmax_rel_err) max_rel_err=rel_err,max_rel_err_x=x; else if(rel_err