Path: utzoo!news-server.csri.toronto.edu!cs.utexas.edu!usc!rpi!sarah!bingnews!kym From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) Newsgroups: comp.arch Subject: Re: Divide in 1 cycle (was Re: standard extensions) Message-ID: <1991Mar8.110801.20042@bingvaxu.cc.binghamton.edu> Date: 8 Mar 91 11:08:01 GMT References: <1991Mar6.183927.16256@news.arc.nasa.gov> <1991Mar7.043931.13552@bingvaxu.cc.binghamton.edu> <777@spim.mips.COM> Organization: State University of New York at Binghamton Lines: 136 In article <777@spim.mips.COM> mash@mips.com (John Mashey) writes: >You can make integer divison very expensive by throwing zero hardware at it >(SPARC), somewhat expensive by throwing a little bit of hardware >at it, or somewhat less expensive by throwing more hardware at it >(R3000, 88K, etc). >No matter what you do with CPU logic, you are NOT likely to make >integer division "very cheap" or even "somewhat cheap"; >once you have a reasonable implementaion, it's HARD to speed it >up by allocating it more and more space.... >(Of course, if you could afford the appropriately-sized lookup table, >you could do it by tabe lookup, but that is a LARGE table not likely >to be placed onto a CPU.:-) A table lookup, sigh... Maybe! What with all these sneaky compression techniques around these days -- not that I've looked very closely, mind you -- maybe you _could_ fit a whole division table in there. Beside the actual bits needed to store all the results (which works out at about 1 bit per result due to divides nice nature (e.g. >1/2 results are either 0 or 1)), we need to also handle all those address lines. Maybe some kind of `perfect hashing' scheme? (I remember Knuth remarked somewhere that perfect hashing was going to be very difficult -- and then some smartys came _up_ with some fairly good approximations to it in several domains.) I am not sure what has/can be done with a table and (very) simple hardware to do interpolation. And then again -- maybe you can just live with a (relatively) fast reciprocal as in the program below. (Of course this _still_ isn't cheap in terms of gates). Although the algorithm appears to bristle with multiples (apart from the one we'll need to actually get a division going), they aren't all general. One is simply a shift. (It turns out playing around with exponents via C's standard library functions is not worth the effort). The other is a `small' multiply (although indicated as a short x double in the program it could be implemented in a relatively small amount of dedicated hardware). The only remaining multiply is a full one -- but doesn't need to be performed in the majority of cases (given uniform operand to recip). (Some extant hardware relegates similar final `correction steps' to a separate function unit.) Although it's similar to a one-srep Newton-Raphson, it isn't. It's a simpler scheme based on the expansion: (a+b)^-1 = a^-1 (1 + a^-1 b)^-1 =about a^-1 (1 - a^-1 b) when b< #include #define NIT 100000 #define NTAB 1048 int mask[32]; double shift[32]; double rtab[NTAB]; #define recip(x) {\ register y;\ register n;\ for(y=x,n=0; y>=NTAB; y>>=1,n++);\ res = rtab[y];\ if(n) {\ register short b;\ res *= shift[n];\ b = x & mask[n];\ if(b) res *= 1-res*b;\ }} double test1(){ register i; register double ans=0; register double res; for(i=1;i<=NIT;i++){ recip(i); ans+= res; } return ans; } double test2() { register i; register double ans=0; for(i=1;i<=NIT;i++) ans+= 1.0/i; return ans; } main(){ double t0,t1,t2; init(); t0=clock(); printf("%g\n", test1()); t1=clock(); printf("%g\n", test2()); t2=clock(); printf("%g\n",(t2-t1)/(t1-t0)); } init(){ int i; rtab[0]=1e38; for(i=1;i