Path: utzoo!news-server.csri.toronto.edu!cs.utexas.edu!usc!rpi!sarah!bingnews!kym
From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell)
Newsgroups: comp.arch
Subject: Re: Divide in 1 cycle (was Re: standard extensions)
Message-ID: <1991Mar8.110801.20042@bingvaxu.cc.binghamton.edu>
Date: 8 Mar 91 11:08:01 GMT
References: <1991Mar6.183927.16256@news.arc.nasa.gov> <1991Mar7.043931.13552@bingvaxu.cc.binghamton.edu> <777@spim.mips.COM>
Organization: State University of New York at Binghamton
Lines: 136

In article <777@spim.mips.COM> mash@mips.com (John Mashey) writes:
>You can make integer divison very expensive by throwing zero hardware at it
>(SPARC), somewhat expensive by throwing a little bit of hardware
>at it, or somewhat less expensive by throwing more hardware at it
>(R3000, 88K, etc).
>No matter what you do with CPU logic, you are NOT likely to make
>integer division "very cheap" or even "somewhat cheap";
>once you have a reasonable implementaion, it's HARD to speed it
>up by allocating it more and more space....
>(Of course, if you could afford the appropriately-sized lookup table,
>you could do it by tabe lookup, but that is a LARGE table not likely
>to be placed onto a CPU.:-)

A table lookup, sigh... Maybe!

What with all these sneaky compression techniques around these days --
not that I've looked very closely, mind you -- maybe you _could_ fit a
whole division table in there. Beside the actual bits needed to store
all the results (which works out at about 1 bit per result due to
divides nice nature (e.g. >1/2 results are either 0 or 1)), we need to
also handle all those address lines. Maybe some kind of `perfect
hashing' scheme? (I remember Knuth remarked somewhere that perfect
hashing was going to be very difficult  -- and then some smartys came _up_ 
with some fairly good approximations to it in several domains.)

I am not sure what has/can be done with a table and (very) simple
hardware to do interpolation.

And then again -- maybe you can just live with a (relatively) fast
reciprocal as in the program below. (Of course this _still_ isn't cheap
in terms of gates).

Although the algorithm appears to bristle with multiples (apart from the
one we'll need to actually get a division going), they aren't all
general. One is simply a shift. (It turns out playing around with
exponents via C's standard library functions is not worth the effort).
The other is a `small' multiply (although indicated as a short x double
in the program it could be implemented in a relatively small amount of
dedicated hardware). The only remaining multiply is a full one -- but
doesn't need to be performed in the majority of cases (given uniform
operand to recip). (Some extant hardware relegates similar final 
`correction steps' to a separate function unit.)

Although it's similar to a one-srep Newton-Raphson, it isn't. It's a
simpler scheme based on the expansion:
	(a+b)^-1 = a^-1 (1 + a^-1 b)^-1
		=about a^-1 (1 - a^-1 b)
	when b<<a.
We ensure b<<a by the appropriate FFO-like manipulations at the start of
the routine.

The software recip compares with the built-in (or not :-) versions on several 
machines as follows:

Machine		ratio of 
		software/`builtin'

VAX 6440	24%
Mips Magnum	30%	
VAX 8530	30%	
Sun 3/60	55%	
Sun SS1		70%	?!

I shall not comment further.

In summary: it seems as though there are some mechanisms for obtaining divide 
pipes that return a result per clock, but at the relative low proportion of 
overall dynamic opcodes (<=3% as pixie & I measure) and possibly not much more 
not matter HOW cheap and/or fast it could be made, further effort is probably 
wasted.

-kym

C code ahead -- BEWARE.

--------------------------------------------------------------------------------
#include <stdio.h>
#include <math.h>

#define NIT 100000
#define NTAB 1048

int mask[32];
double shift[32];
double rtab[NTAB];

#define recip(x) {\
	register y;\
	register n;\
	for(y=x,n=0; y>=NTAB; y>>=1,n++);\
	res = rtab[y];\
	if(n) {\
		register short b;\
		res *= shift[n];\
		b = x & mask[n];\
		if(b) res *= 1-res*b;\
	}}

double test1(){
	register i;
	register double ans=0;
	register double res;

	for(i=1;i<=NIT;i++){ recip(i); ans+= res; }
	return ans;
	}

double test2() {
	register i;
	register double ans=0;

	for(i=1;i<=NIT;i++) ans+= 1.0/i;
	return ans;
	}

main(){
	double t0,t1,t2;
	init();
	t0=clock();
	printf("%g\n", test1());
	t1=clock();
	printf("%g\n", test2());
	t2=clock();
	printf("%g\n",(t2-t1)/(t1-t0));
	}

init(){
	int i;

	rtab[0]=1e38;
	for(i=1;i<NTAB;i++) rtab[i]=1.0/i;
	for(i=0;i<32;i++) mask[i]=(1<<i) - 1;
	for(i=0;i<32;i++) shift[i]=1.0/(1<<i);
	}

--------------------------------------------------------------------------------