Xref: utzoo sci.math:9420 comp.arch:13348 comp.lang.c:25315 comp.sources.wanted:10240 Path: utzoo!utgpu!jarvis.csri.toronto.edu!clyde.concordia.ca!mcgill-vision!bloom-beacon!snorkelwacker!think!samsung!caesar.cs.montana.edu!milton!whit From: whit@milton.acs.washington.edu (John Whitmore) Newsgroups: sci.math,comp.arch,comp.lang.c,comp.sources.wanted Subject: Re: Integer Multiply/Divide on Sparc Summary: use shortcuts Message-ID: <1543@milton.acs.washington.edu> Date: 24 Jan 90 08:53:36 GMT References: <84768@linus.UUCP> <15418@vlsisj.VLSI.COM> Reply-To: whit@milton.acs.washington.edu (John Whitmore) Organization: University of Washington, Seattle Lines: 45 In article <15418@vlsisj.VLSI.COM> davidc@vlsisj.UUCP (David Chapman) writes: >In article <84768@linus.UUCP> bs@linus.mitre.org (Robert D. Silverman) writes: >>Does any have, of know of software for the SPARC [SUN-4] that will >>perform the following: >> >> [standard multiply and divide] > . . . >There should be instructions on the order of "multiply step" and "divide >step", each of which will do one of the 32 adds/subtracts and then shift. > I strongly disagree. Smarter routines can optimize a lot of that kind of "microcode" away, and should do so given the opportunity. >Thus they provide you with the tools to do your own multiply and divide. >One of the benefits is that a compiler can optimize small multiplies and >divides to make them execute quicker (i.e. multiply by 10 takes 4 steps >instead of 32). Yes, EXACTLY. So extend the principle; take four-bit nibbles of the argument and do a 16-way JUMP (whatever the equivalent is on a SPARC) to sixteen cases like CASE x0000: RTN CASE x0001: add (to accumulator) CASE x0010: shift +1, add CASE x0011: subtract, shift+3, add CASE x0100: shift+3, add CASE x0101: add, shift+3, add CASE x0110: shift+2, add, shift+3, add CASE x0111: subtract, shift+4, add and if I can figure it out, you experts are getting bored by now. Four operations MAX for a four-bit multiplicand, opposed to 12 operations (estimated) for the one-bit "MULSTEP" approach. > >P.S. Don't write a loop on the order of "MULSTEP, DEC, BNZ" or it will be > incredibly slow. Unroll the loop 4 or 8 times (MULSTEP, MULSTEP, > MULSTEP, MULSTEP, SUB 4, BNZ). Branches are expensive. Hm. The principle is good, but don't think small. Unroll it to really large chunks of code. The "BNZ" is a bottleneck that shouldn't be employed when really large fanout of the code path can be done in ONE step. I seem to recall that the trick (above) was employed by a hardware multiplier IBM made, some decade or more ago. It still works. I am known for my brilliance, John Whitmore by those who do not know me well.