Xref: utzoo sci.math:9420 comp.arch:13348 comp.lang.c:25315 comp.sources.wanted:10240
Path: utzoo!utgpu!jarvis.csri.toronto.edu!clyde.concordia.ca!mcgill-vision!bloom-beacon!snorkelwacker!think!samsung!caesar.cs.montana.edu!milton!whit
From: whit@milton.acs.washington.edu (John Whitmore)
Newsgroups: sci.math,comp.arch,comp.lang.c,comp.sources.wanted
Subject: Re: Integer Multiply/Divide on Sparc
Summary: use shortcuts
Message-ID: <1543@milton.acs.washington.edu>
Date: 24 Jan 90 08:53:36 GMT
References: <84768@linus.UUCP> <15418@vlsisj.VLSI.COM>
Reply-To: whit@milton.acs.washington.edu (John Whitmore)
Organization: University of Washington, Seattle
Lines: 45

In article <15418@vlsisj.VLSI.COM> davidc@vlsisj.UUCP (David Chapman) writes:
>In article <84768@linus.UUCP> bs@linus.mitre.org (Robert D. Silverman) writes:
>>Does any have, of know of software for the SPARC [SUN-4] that will
>>perform the following:
>>
>> [standard multiply and divide]
> . . .
>There should be instructions on the order of "multiply step" and "divide 
>step", each of which will do one of the 32 adds/subtracts and then shift.  
>
	I strongly disagree.  Smarter routines can optimize a lot of 
that kind of "microcode" away, and should do so given the opportunity.

>Thus they provide you with the tools to do your own multiply and divide.  
>One of the benefits is that a compiler can optimize small multiplies and 
>divides to make them execute quicker (i.e. multiply by 10 takes 4 steps 
>instead of 32).
	Yes, EXACTLY.  So extend the principle; take four-bit nibbles
of the argument and do a 16-way JUMP (whatever the equivalent is
on a SPARC) to sixteen cases like
CASE x0000:	RTN
CASE x0001:	add (to accumulator)
CASE x0010:	shift +1, add
CASE x0011:	subtract, shift+3, add
CASE x0100:	shift+3, add
CASE x0101:	add, shift+3, add
CASE x0110:	shift+2, add, shift+3, add
CASE x0111:	subtract, shift+4, add
	and if I can figure it out, you experts are getting bored
by now.  Four operations MAX for a four-bit multiplicand, opposed
to 12 operations (estimated) for the one-bit "MULSTEP" approach.
>
>P.S.  Don't write a loop on the order of "MULSTEP, DEC, BNZ" or it will be
>      incredibly slow.  Unroll the loop 4 or 8 times (MULSTEP, MULSTEP,
>      MULSTEP, MULSTEP, SUB 4, BNZ).  Branches are expensive.

	Hm.  The principle is good, but don't think small.  Unroll it
to really large chunks of code.  The "BNZ" is a bottleneck that
shouldn't be employed when really large fanout of the code path
can be done in ONE step.
	I seem to recall that the trick (above) was employed by a
hardware multiplier IBM made, some decade or more ago.  It still works.

I am known for my brilliance,                 John Whitmore
 by those who do not know me well.