Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!uwm.edu!rutgers!att!cbnewsh!beyer From: beyer@cbnewsh.ATT.COM (jean-david.beyer) Newsgroups: comp.arch Subject: Re: Integer Multiply/Divide on Sparc Summary: Hardware and Software integer multiply instructions. Message-ID: <6908@cbnewsh.ATT.COM> Date: 28 Dec 89 14:06:02 GMT References: <84768@linus.UUCP> <8840004@hpfcso.HP.COM> <1804@l.cc.purdue.edu> <1535@cbnewsi.ATT.COM> Organization: AT&T Bell Laboratories Lines: 41 In article <1535@cbnewsi.ATT.COM>, reha@cbnewsi.ATT.COM (reha.gur) writes: > In article <1804@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes: > > > To multiply > > two 32 bit numbers to get a 64 bit product on a 32x32 -> 32 machine, > > the 32 bit numbers must be divided into 16 bit parts. The whole operation > > takes about 20 operations (count them). Shift and add are far slower. > > The numbers I get (from looking at the data sheets and other info) for > two machines: a 25Mhz i486 and a 25Mhz SPARC are as below: > > Assuming no cache hits and various other items: > > i486: 18-31 cycles for signed 32 x 32 bit multipication (reg to reg) > SPARC: 48-52 cycles for same (including subroutine call and return time) > > i486: 32 cycles for signed 32 bit division (acc by reg) > SPARC: 41 (approximate best case) to 211 (approximate worst case) > (depends on bits in dividend and divisor) I do not have the numbers for SPARC handy, but would tend to trust reha. However, having spent some time working on optimizers that work at the assembly level for machines, I notice that, for the benchmarks we use, anyway, the result of running our C compilers, the most integer multiplies seem to be when dealing with the subscripts of arrays or some kinds of pointer dereferencing (those pointing to structures). In these cases, the multiply instructions are mostly multiplications by constants. These constants frequently have a small number of 1's. Consequently, instead of calling a general purpose multiply subroutine, it suffices to insert a special purpose inline code that multiplies by the desired constant. (it might even pay to do a strength reduction optimization). This can be much faster than the general purpose multiply subroutine, and may be faster than a hardware multiply instruction, depending on the design of the hardware multiply. -- Jean-David Beyer AT&T Bell Laboratories Holmdel, New Jersey, 07733 attunix!beyer