Newsgroups: comp.arch
Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!sarah!bingnews!kym
From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell)
Subject: Re: RISC integer multiply/divide (was Re: Snake)
Message-ID: <1991Apr4.213550.8106@bingvaxu.cc.binghamton.edu>
Organization: State University of New York at Binghamton
References: <3237@charon.cwi.nl> <1991Apr3.232757.8020@bingvaxu.cc.binghamton.edu> <9538@mentor.cc.purdue.edu>
Date: Thu, 4 Apr 1991 21:35:50 GMT

In article <9538@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
[discussion of double-prec multiply, esp 32x32->64 (yuk!)]
>This is true, but you would need either 17x17 -> 34 multiply, or 
>(16+sign)*(16+sign)->(32+sign), as well as a considerable use of
>keeping track of signs/carries, or substantially more additions.

Well I'm not sure I understand this -- it's probably a slightly
different implementation of the what I have in mind.

One of the reasons for keeping the 3-multiply trick in mind is that some
chips include a dedicated fast multiplier. It's not unknown to link up this 
guy with the rest of the alu so that simultaneous add and mults can be done 
(after all, we spent a lot of money to get that multiplier IN there so let's 
use it for everything :-). Instructures like multiply-and-add, 
multiply-shift-and-add are therefore not unheard of. (I have in mind esp the 
TI 320 series, I presume those 6000 thingies can do similar things, but I 
haven't looked at the gory details).

So for such machines we can both make use of the trick and try to use as
many fast instruction combinations as possible; the trouble _may_ be worth it 
in some applications (although I haven't actually measured anything).

The appended programs show how to obtain single (i.e. 32x32->32) and
double (32x32->64) results for multiply using a 16x16->32 multiplier. The 
routines are pretty rough (written on the fly), so forgive any glaring (or 
subtle) errors.

>BTW, if unsigned multiplication is not available, additional problems
>arise in any case.

I'm not sure I understand this either -- my routines use signed stuff. I can 
appreciate the problems regarding missing unsigned arithmetic, however.

-kym

C code:

===single precision (32x32->32) multiply using (16x16->32) multiplier===
/***

Taking:
	(aL+b)(cL+d) = LLac+Lad+Lbc+bd
	L(a-b)(c-d) = Lac-Lad-Lbc+Lbd
where L=2^k we find:
	L(a-b)(c-d) + (aL+b)(cL+d) = LLac+Lac+Lbd+bd

Thereore:
	mul(x,y) = hix hiy <<2k + hix hiy<<k + lox loy<<k + lox loy 
		- (hix-lox)(hiy-loy)<<k

If you only need the low-order 2k bits of the product (i.e. the same size as 
the operands):
	mul(x,y) = hix hiy<<k + lox loy<<k + (-hix+lox)(hiy-loy)<<k + lox loy

We could also reformulate this as:
	mul(x,y) = hix hiy<<k + lox loy<<k + (-hix+lox)(hiy-loy)<<k + loxyhi<<k
		+ loxylo
that requires only k-bit adds. 2k-bit adds are more convenient.

***/

#define SHORTBITS 16	/* size of ``short'' */

typedef int Num;	/* regular size */
typedef short SNum;	/* 1/2 size */

#define BIT(N) (1L<<(N))
#define MASK(N) (BIT(N)-1)

Num mul(x,y) Num x,y; {
	SNum hix,hiy,lox,loy;
	Num t1,t2;

/*** I'll rely on a smart compiler to sort out the following 
	mess into something reasonable. ***/

	hix=x>>SHORTBITS;
	lox=x&MASK(SHORTBITS);
	hiy=y>>SHORTBITS;
	loy=y&MASK(SHORTBITS);

	t2 = t1 = (Num)lox*loy;
	t2 += (Num)hix*hiy;		/* multiply-and-add */
	t2 += ((Num)-hix+lox)*((Num)hiy-loy); /* multiply-and-add */
	t1 += t2<<SHORTBITS;	/* shift-and-add */

	return t1;
	}

===double prec multiply (32x32->64) using (16x16->32) multiplier===

#define SHORTBITS 16

typedef long LNum;
typedef int Num;
typedef short SNum;

#define BIT(N) (1L<<(N))
#define MASK(N) (BIT(N)-1)

LNum mul(x,y) Num x,y; {
	SNum hix,hiy,lox,loy;
	LNum t1;

	hix=x>>SHORTBITS;
	lox=x&MASK(SHORTBITS);
	hiy=y>>SHORTBITS;
	loy=y&MASK(SHORTBITS);

	t1 = (Num)lox*loy;
	t1 += ((Num)hix*hiy)<<SHORTBITS; /* multiply-shift-and-add */
	t1 += t1<<SHORTBITS;	/* shift-and-add */
	t1 += (((Num)-hix+lox)*((Num)hiy-loy))<<SHORTBITS; /* multiply-shift-and-add */

	return t1;
	}
===end end end===