Newsgroups: comp.arch Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!sarah!bingnews!kym From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) Subject: Re: RISC integer multiply/divide (was Re: Snake) Message-ID: <1991Apr5.191136.21806@bingvaxu.cc.binghamton.edu> Organization: State University of New York at Binghamton References: <1991Apr3.232757.8020@bingvaxu.cc.binghamton.edu> <9538@mentor.cc.purdue.edu> <1991Apr4.213550.8106@bingvaxu.cc.binghamton.edu> Date: Fri, 5 Apr 1991 19:11:36 GMT Various people have responded via email regarding the ``3-multiply trick''. Mostly the feeling is that any extra trouble required by the method is not worth the results. I must concur in the majority of cases. However, we can ask the converse question ``on which architectures will the trick work?'' As I indicated in a previous post, I had in mind DSP-type chips that have fancy multiply-and-add and add-and-shift instructions. The trick is obviously worthwhile on machines with a very expensive multiply as well. I've written another little ``benchmark'' program to compare a 32x32->64 (yuk!) unsigned multiply written using both the naive 4-multiply method: (a+b)*(c+d) = a*c + b*c + a*d + b*d and the 3-multiply trick: (a+b)*(c+d) = (a-b)*(c-d) + 2*(b*c + a*d) where a, b, c and d are the obvious high & low parts of 2 xp unsigned numbers. The results from pixie show that unsigned multiplies account for about 10% of all opcodes (all subroutine calls were removed by the optimizer and lost delay slots counted for relatively few cycles). Specifically in the 4-multiply case we find: andi 13.56 addu 10.17 multu 10.17 srl 6.78 sw 8.47 addiu 8.48 sll 3.39 and in the 3-multiply case we find: andi 13.63 addu 9.09 multu 7.57 srl 6.06 sw 9.09 addiu 7.58 sll 3.03 as percentages of dynamic instructions. We can then ask ``how slow would multiply have to be to make an overall significant difference here?'' We set up the equation: 10.17 T + (100-10.17) = (7.57 T + (100-7.57)) * 1.1 I.e. ``when will the total runtime for the 4-mult routine be 10% larger than for the 3-mult routine?''. (I am assuming here that all other instructions apart from multiply take about the same time). We get T= (100-10.17) - 1.1*(100-7.57) ---------------------------- 7.57 * 1.1 - 10.17 = 6.43 approx. The Sparc would therefore be a good architecture to use the trick on. Running the benchmark on a Sun SS1 we have the output: 3700550176 3700550176 minfact=.75 maxfact=1 sumfact=8.71429 arithavfact=.871429 prodfact=.237305 geoavfact=.237305 The first couple of numbers merely check that both routines produce the same result (it totals the high-order parts of a large number of random 64-bit products -- a sorta ``signature''). The best the 3-multiply routine can do is 25% faster than the naive method -- as expected. Some cases break even, however. The arithmetic average has the 3-mult routine running a little more than 10% faster. The next question we might ask is ``what is the situation on a machine with multiply-and-add?''. I can only approximate this at present (not having my hands on a DSP chip that also has a good optimizing C compiler :-). By combining the addu+multu figures above we arrive at: 2*10.17 T + (100-2*10.17) = ((9.09+7.57) T + (100-9.09-7.57))*1.1 where T is the time of a synthetic multiply-and-add instruction. (This is obviously only approximate -- not EVERY add is associated with a multiply although vv). We find: T = 6 approx. Thus if a combined m&a instruction is 6 times slower ON AVERAGE than other instructions in a given architecture it may be worthwhile doing the 3-mult trick. If m&a is implemented by simply chaining the o/p of a multiplier to the input of an adder then it may be likely that the average cost of multiply-and-add exceeds this figure. It might also be true even taking into account a more sophisticated pipelined h/w. The algorithm in question is subject to lost ``m&a slots'' due to the high density of m&a instructions; the result of one m&a is typically required for a subsequent operation. While the alu is busy we can't use it for the add/subtract phase of an m&a instruction (and we rely on the compiler therefore NOT to schedule one in cases where this would happen). To summarize: using the 3-multiply trick will not get you anything on machines where multiplies are not expensive (i.e. < 6.5 average instructions). This lets out VAX's, 68k's, Mips's but doesn't seem to include the SS1 (I don't have any later SS's to hand, sorry). In the world of smaller machines (:-) 386/486 without co-processors are likely another. On machines with instructions such as multiply-and-add, etc, the 3-multiply trick may be more likely to pay off, as m&a will be subject to scheduling collisions and other slot losses for the algorithms in question. The trick apparently requires a multiply-and-add to cost about 6 average instructions to pay off on these architectures. -kym BTW, I would be interested in any multiply-and-add figures people have to hand. If there is sufficient interest, I will collate and post a summary at a future date. C code: ===compare 3-m and 4-m 32x32->64 unsigned multiply using 16x16->32 arithmetic=== #include #include #include #define BIT(N) (1L<<(N)) #define MASK(N) (BIT(N)-1) #define SHORTBITS 16 #define MAX (1L<>SHORTBITS; b=(UShort)x; c=y>>SHORTBITS; d=(UShort)y; /*** (a+b)(c+d) 32 16 16 ac bd ac -(a-b)(c-d) bd ***/ ac=a*c; bd=b*d; mid=bd+ac+(bd>>SHORTBITS); if(a>=b) if(c>=d) mid-=(a-b)*(c-d); else mid+=(a-b)*(-c+d); else if(c>=d) mid+=(-a+b)*(c-d); else mid-=(-a+b)*(-c+d); res.lo=((UShort)bd)+(mid<>SHORTBITS))<>SHORTBITS; b=(UShort)x; c=y>>SHORTBITS; d=(UShort)y; /*** (a+b)(c+d) = ac + ad + bc + bd 32 16 16 ac bc ad bd ***/ lo=b*d; mid=b*c+a*d+(lo>>SHORTBITS); res.lo=((UShort)lo)+(mid<>SHORTBITS))<hi; } t1=clock(); for(n=0; nhi; } t2=clock(); fact=(t2-t1)/(t1-t0); prodfact*=fact; sumfact+=fact; if(fact>maxfact) maxfact=fact; if(fact