Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!uwm.edu!rutgers!att!cbnewsh!beyer
From: beyer@cbnewsh.ATT.COM (jean-david.beyer)
Newsgroups: comp.arch
Subject: Re: Integer Multiply/Divide on Sparc
Summary: Hardware and Software integer multiply instructions.
Message-ID: <6908@cbnewsh.ATT.COM>
Date: 28 Dec 89 14:06:02 GMT
References: <84768@linus.UUCP> <8840004@hpfcso.HP.COM> <1804@l.cc.purdue.edu> <1535@cbnewsi.ATT.COM>
Organization: AT&T Bell Laboratories
Lines: 41

In article <1535@cbnewsi.ATT.COM>, reha@cbnewsi.ATT.COM (reha.gur) writes:
> In article <1804@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes:
> 
> > To multiply
> > two 32 bit numbers to get a 64 bit product on a 32x32 -> 32 machine,
> > the 32 bit numbers must be divided into 16 bit parts.  The whole operation
> > takes about 20 operations (count them).  Shift and add are far slower.
> 
> The numbers I get (from looking at the data sheets and other info) for
> two machines: a 25Mhz i486 and a 25Mhz SPARC are as below:
> 
> Assuming no cache hits and various other items:
> 
> i486: 	18-31 cycles for signed 32 x 32 bit multipication (reg to reg)
> SPARC:	48-52 cycles for same (including subroutine call and return time)
> 
> i486:	32 cycles for signed 32 bit division (acc by reg)
> SPARC:	41 (approximate best case) to 211 (approximate worst case)
> 	(depends on bits in dividend and divisor)

I do not have the numbers for SPARC handy, but would tend to trust reha.

However, having spent some time working on optimizers that work at the
assembly level for machines, I notice that, for the benchmarks we use,
anyway, the result of running our C compilers, the most integer multiplies
seem to be when dealing with the subscripts of arrays or some kinds of
pointer dereferencing (those pointing to structures). In these cases,
the multiply instructions are mostly multiplications by constants. These
constants frequently have a small number of 1's. Consequently, instead of
calling a general purpose multiply subroutine, it suffices to insert a
special purpose inline code that multiplies by the desired constant.
(it might even pay to do a strength reduction optimization). This can be
much faster than the general purpose multiply subroutine, and may be faster
than a hardware multiply instruction, depending on the design of the
hardware multiply.

-- 
Jean-David Beyer
AT&T Bell Laboratories
Holmdel, New Jersey, 07733
attunix!beyer