Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!iuvax!purdue!tut.cis.ohio-state.edu!cs.utexas.edu!uunet!pdn!oz!alan From: alan@oz.nm.paradyne.com (Alan Lovejoy) Newsgroups: comp.arch Subject: Re: Integer Multiply/Divide on Sparc Message-ID: <6904@pdn.paradyne.com> Date: 4 Jan 90 20:29:18 GMT References: <84768@linus.UUCP< <8840004@hpfcso.HP.COM> <1804@l.cc.purdue.edu> <1535@cbnewsi.ATT.COM> <6903@pdn.paradyne.com> Sender: usenet@pdn.paradyne.com Reply-To: alan@oz.paradyne.com (Alan Lovejoy) Organization: AT&T Paradyne, Largo, Florida Lines: 89 Excuuuuuuuuuuse me! Three errors, one very slight, two not so slight! In article <6903@pdn.paradyne.com> alan@oz.paradyne.com (Alan Lovejoy) writes: >Sounds like an interesting contest to me! Here's my try (for multiply, >anyway) using the mc88k instruction set: > >(NOTE: r0 is the constant 0 (a hardware protocol); jsr sets r1 with the >return addrss) ;dmulu -- multiply unsigned 32x32=<64 ;r2 and r3 contain two 32-bit unsigned integers to be multiplied ;r12 will contain the high word (32 bits) of the product (r2 * r3) ;r13 will contain the low word (32 bits) of the product (r2 * r3) .proc dmulu dmulu: ;divide the two 32-bit numbers into 4 16-bit parts: extu r4, r2, 16<16>; r4 = r2 >> 16; (1 cycle) *** errror #1: The original posting consistently had "16<16<" instead of "16<16>". The wonders of global replace... (The very slight error). extu r5, r3, 16<16>; r5 = r3 >> 16; (1 cycle) mask r2, r2, $FFFF; r2 = r2 & 0xFFFF (1 cycle) mask r3, r3, $FFFF; r3 = r3 & 0xFFFF (1 cycle) ;calculate partial products: >mul r6, r3, r3; r6 = r2 * r3; (4 cycles!) *** error #2: One of the r3's in the above instruction must be an r2 (as the comment shows). >mul r7, r3, r4; r7 = r3 * r4; (4 cycles!) >mul r8, r2, r5; r8 = r2 * r5; (4 cycles!) >mul r9, r4, r5; r9 = r4 * r5; (4 cycles!) *** error #3: I completely forgot that the mul instruction is fully pipelined! It takes 4 cycles to complete, yes, but a new instruction can be issued on the very next cycle. Upto six mul and/or fmul instructions can be in the pipeline at one time. So, (after some reordering to avoid stalls) we have instead: ;calculate partial products: mul r6, r2, r3; r6 = r2 * r3; (1 cycle) mul r8, r2, r5; r8 = r2 * r5; (1 cycle) mul r7, r3, r4; r7 = r3 * r4; (1 cycle) mul r9, r4, r5; r9 = r4 * r5; (1 cycle) ;sum partial products: extu r10, r6, 16<16>; r10 = r6 >> 16; (1 cycle) addu r8, r8, r10; r8 = r8 + r10; (1 cycle) *** note that r7 changed to r8 to avoid a stall addu.co r10, r7, r8; r10 = r7 + r8; generate carry bit (1 cycle) addu.ci r12, r0, r0; r12 = carry from previous addu (1 cycle) mak r12, r12, 16<16>; r12 = r12 << 16; (1 cycle) addu r12, r12, r9; r12 = r12 + r9; (1 cycle) mak r10, r10, 16<16>; r10 = r10 << 16; (1 cycle) mask r13, r6, $FFFF; r13 = r6 & 0xFFFF; (1 cycle) jmp.n r1; return to caller after next instruction (1cycle) or r13, r13, r10; r13 = r13 | r10; (1 cycle) ; done: 18 cycles total (without short circuit code) .end ; invoking dmulu: ; it is the caller's responsibility to save registers r2-13, which ; the caller may or may not need to do... or r25, r0, #dmuluLo16; r25 = low 16 bits of dmulu address (1 cycle) or.u r25, r25, #dmuluHi16; r25 = r25 | high 16 bits of dmulu address (1 c) ld r2, r30, #factor1; r2 = *(framePtr + offset of factor1) (1 cycle) jsr.n r25; call dmulu after next instruction (1 cycle) ld r3, r30, #factor2; r3 = *(framePtr + offset of factor2) (1 cycle) st.d r12, r30, #product; *(frametPtr + offset of product) = r12, r13 ; register to register: 1 cycle call excluding execution time for dmulu ; memory to memory: 6 cycle call excluding execution time for dmulu ; GRAND TOTAL register to register: 19 cycles (was 30) ; GRAND TOTAL memory to memory: 24 cycles (was 36) *** A significant performance improvement of 33 per cent. ____"Congress shall have the power to prohibit speech offensive to Congress"____ Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL. Disclaimer: I do not speak for AT&T Paradyne. They do not speak for me. Mottos: << Many are cold, but few are frozen. >> << Frigido, ergo sum. >>