Newsgroups: comp.arch
Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!sarah!bingnews!kym
From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell)
Subject: Re: RISC integer multiply/divide (was Re: Snake)
Message-ID: <1991Apr5.191136.21806@bingvaxu.cc.binghamton.edu>
Organization: State University of New York at Binghamton
References: <1991Apr3.232757.8020@bingvaxu.cc.binghamton.edu> <9538@mentor.cc.purdue.edu> <1991Apr4.213550.8106@bingvaxu.cc.binghamton.edu>
Date: Fri, 5 Apr 1991 19:11:36 GMT

Various people have responded via email regarding the ``3-multiply
trick''.

Mostly the feeling is that any extra trouble required by the method is
not worth the results. I must concur in the majority of cases. However,
we can ask the converse question ``on which architectures will the trick
work?'' As I indicated in a previous post, I had in mind DSP-type
chips that have fancy multiply-and-add and add-and-shift instructions.

The trick is obviously worthwhile on machines with a very expensive
multiply as well.

I've written another little ``benchmark'' program to compare a 32x32->64
(yuk!) unsigned multiply written using both the naive 4-multiply method:

	(a+b)*(c+d) = a*c + b*c + a*d + b*d

and the 3-multiply trick:

	(a+b)*(c+d) = (a-b)*(c-d) + 2*(b*c + a*d)

where a, b, c and d are the obvious high & low parts of 2 xp unsigned
numbers.

The results from pixie show that unsigned multiplies account for about 10% 
of all opcodes (all subroutine calls were removed by the optimizer and lost 
delay slots counted for relatively few cycles).

Specifically in the 4-multiply case we find:

	andi	13.56
	addu	10.17
	multu	10.17
	srl	6.78
	sw	8.47
	addiu	8.48
	sll	3.39

and in the 3-multiply case we find:

	andi	13.63
	addu	9.09
	multu	7.57
	srl	6.06
	sw	9.09
	addiu	7.58
	sll	3.03

as percentages of dynamic instructions.

We can then ask ``how slow would multiply have to be to make an overall 
significant difference here?''

We set up the equation:

	10.17 T + (100-10.17) = (7.57 T + (100-7.57)) * 1.1

I.e. ``when will the total runtime for the 4-mult routine be 10% larger
than for the 3-mult routine?''. (I am assuming here that all other
instructions apart from multiply take about the same time).

We get T=
	(100-10.17) - 1.1*(100-7.57)
	----------------------------
	7.57 * 1.1 - 10.17

	= 6.43 approx.

The Sparc would therefore be a good architecture to use the trick on.

Running the benchmark on a Sun SS1 we have the output:

	3700550176	3700550176
	minfact=.75
	maxfact=1
	sumfact=8.71429
	arithavfact=.871429
	prodfact=.237305
	geoavfact=.237305

The first couple of numbers merely check that both routines produce
the same result (it totals the high-order parts of a large number of 
random 64-bit products -- a sorta ``signature''). The best the
3-multiply routine can do is 25% faster than the naive method -- as
expected. Some cases break even, however. The arithmetic average has
the 3-mult routine running a little more than 10% faster.

The next question we might ask is ``what is the situation on a machine with 
multiply-and-add?''. I can only approximate this at present (not having
my hands on a DSP chip that also has a good optimizing C compiler :-).

By combining the addu+multu figures above we arrive at:

	2*10.17 T + (100-2*10.17) = ((9.09+7.57) T + (100-9.09-7.57))*1.1

where T is the time of a synthetic multiply-and-add instruction.
(This is obviously only approximate -- not EVERY add is associated with
a multiply although vv).

We find: T =  6 approx.

Thus if a combined m&a instruction is 6 times slower ON AVERAGE than other 
instructions in a given architecture it may be worthwhile doing the 3-mult 
trick. 

If m&a is implemented by simply chaining the o/p of a multiplier to the input 
of an adder then it may be likely that the average cost of multiply-and-add
exceeds this figure.

It might also be true even taking into account a more sophisticated 
pipelined h/w. The algorithm in question is subject to lost ``m&a slots'' 
due to the high density of m&a instructions; the result of one m&a
is typically required for a subsequent operation. While the alu
is busy we can't use it for the add/subtract phase of an m&a instruction
(and we rely on the compiler therefore NOT to schedule one in cases
where this would happen).

To summarize: using the 3-multiply trick will not get you anything
on machines where multiplies are not expensive (i.e. < 6.5 average 
instructions).  This lets out VAX's, 68k's, Mips's but doesn't seem to include 
the SS1 (I don't have any later SS's to hand, sorry). In the world of smaller 
machines (:-) 386/486 without co-processors are likely another.

On machines with instructions such as multiply-and-add, etc, the 3-multiply 
trick may be more likely to pay off, as m&a will be subject to scheduling 
collisions and other slot losses for the algorithms in question.  The trick 
apparently requires a multiply-and-add to cost about 6 average instructions 
to pay off on these architectures.

-kym

BTW, I would be interested in any multiply-and-add figures people have to hand. 
If there is sufficient interest, I will collate and post a summary at a future 
date.

C code:

===compare 3-m and 4-m 32x32->64 unsigned multiply using 16x16->32 arithmetic===
#include	<stdio.h>
#include	<math.h>
#include	<assert.h>

#define	BIT(N)	(1L<<(N))
#define	MASK(N)	(BIT(N)-1)

#define	SHORTBITS	16
#define	MAX	(1L<<SHORTBITS)
#define	NSAMP	10000
#define	NIT	10

typedef unsigned short UShort;
typedef short Short;
typedef long Long;
typedef unsigned long ULong;
typedef struct {
	ULong lo,hi;
	} Bignum;

Bignum *umul3(x,y) register ULong x,y; {
	register UShort a,b,c,d;
	register ULong ac,bd;
	register ULong mid;
	static Bignum res;

	a=x>>SHORTBITS;
	b=(UShort)x;
	c=y>>SHORTBITS;
	d=(UShort)y;

/***
	(a+b)(c+d)

	32	16	16

	ac	
		bd
		ac	
		-(a-b)(c-d)
			bd
***/
	ac=a*c;
	bd=b*d;

	mid=bd+ac+(bd>>SHORTBITS);
	if(a>=b) if(c>=d) mid-=(a-b)*(c-d);
		else mid+=(a-b)*(-c+d);
	else if(c>=d) mid+=(-a+b)*(c-d);
		else mid-=(-a+b)*(-c+d);

	res.lo=((UShort)bd)+(mid<<SHORTBITS);
	res.hi=((UShort)mid)+((ac+(mid>>SHORTBITS))<<SHORTBITS);

	return &res;
	}

Bignum *umul4(x,y) register ULong x,y; {
	register UShort a,b,c,d;
	register ULong lo,mid;
	static Bignum res;

	a=x>>SHORTBITS;
	b=(UShort)x;
	c=y>>SHORTBITS;
	d=(UShort)y;

/***
	(a+b)(c+d)	=	ac + ad + bc + bd

	32	16	16
	ac
		bc
		ad
			bd
***/
	
	lo=b*d;
	mid=b*c+a*d+(lo>>SHORTBITS);

	res.lo=((UShort)lo)+(mid<<SHORTBITS);
	res.hi=((UShort)mid)+((a*c+(mid>>SHORTBITS))<<SHORTBITS);

	return &res;
	}

ULong x[NSAMP],y[NSAMP];

void init() {
	unsigned i;

	for(i=0;i<NSAMP;i++) {
		x[i]=rand();
		y[i]=rand();
		}
	}

void main(argc,argv) char *argv[]; {
	unsigned it,n;
	double t0,t1,t2;
	ULong sum3=0,sum4=0;
	double sumfact=0,prodfact=1;
	double minfact=1e38,maxfact= -1e38;

	for(it=0; it<NIT; it++) {
		double fact;

		init();

		t0=clock();
		for(n=0; n<NSAMP; n++) {
			sum4+=umul4(x[it],y[it])->hi;
			}
		t1=clock();

		for(n=0; n<NSAMP; n++) {
			sum3+=umul3(x[it],y[it])->hi;
			}
		t2=clock();

		fact=(t2-t1)/(t1-t0);
		prodfact*=fact;
		sumfact+=fact;
		if(fact>maxfact) maxfact=fact;
		if(fact<minfact) minfact=fact;
		}

	printf("%lu	%lu\n",sum3,sum4);/* in case of clever compiler */
	printf("minfact=%g\n",minfact);
	printf("maxfact=%g\n",maxfact);
	printf("sumfact=%g\n",sumfact);
	printf("arithavfact=%g\n",sumfact/NIT);
	printf("prodfact=%g\n",prodfact);
	printf("geoavfact=%g\n",pow(prodfact,1.0/NIT));

	exit(0);
	}
===end end end===