Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!cs.utexas.edu!uunet!pdn!oz!alan From: alan@oz.nm.paradyne.com (Alan Lovejoy) Newsgroups: comp.arch Subject: Re: Integer Multiply/Divide on Sparc Message-ID: <6903@pdn.paradyne.com> Date: 4 Jan 90 16:47:39 GMT References: <84768@linus.UUCP< <8840004@hpfcso.HP.COM> <1804@l.cc.purdue.edu> <1535@cbnewsi.ATT.COM> Sender: usenet@pdn.paradyne.com Reply-To: alan@oz.paradyne.com (Alan Lovejoy) Organization: AT&T Paradyne, Largo, Florida Lines: 90 In article <1535@cbnewsi.ATT.COM< reha@cbnewsi.ATT.COM (reha.gur) writes: , cik@l.cc.purdue.edu (Herman Rubin) writes: < <> It is clear that you are not to be trusted (see above). To multiply <> two 32 bit numbers to get a 64 bit product on a 32x32 -> 32 machine, <> the 32 bit numbers must be divided into 16 bit parts. The whole operation <> takes about 20 operations (count them). Shift and add are far slower. <> Divide is even worse. Also, there is considerable overhead in a <> subroutine call; there are registers to save and restore. Open <> subroutines (in-line functions) are a way around it, but they still <> have the problem. <> <> Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 <> Phone: (317)494-6054 <> hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP) < > 16; (1 cycle) extu r5, r3, 16<16<; r5 = r3 >> 16; (1 cycle) mask r2, r2, $FFFF; r2 = r2 & 0xFFFF (1 cycle) mask r3, r3, $FFFF; r3 = r3 & 0xFFFF (1 cycle) ;calculate partial products: mul r6, r3, r3; r6 = r2 * r3; (4 cycles!) mul r7, r3, r4; r7 = r3 * r4; (4 cycles!) mul r8, r2, r5; r8 = r2 * r5; (4 cycles!) mul r9, r4, r5; r9 = r4 * r5; (4 cycles!) ;sum partial products: extu r10, r6, 16<16<; r10 = r6 >> 16; (1 cycle) addu r7, r7, r10; r7 = r7 + r10; (1 cycle) addu.co r10, r7, r8; r10 = r7 + r8; generate carry bit (1 cycle) addu.ci r12, r0, r0; r12 = carry from previous addu (1 cycle) mak r12, r12, 16<16<; r12 = r12 << 16; (1 cycle) addu r12, r12, r9; r12 = r12 + r9; (1 cycle) mak r10, r10, 16<16<; r10 = r10 << 16; (1 cycle) mask r13, r6, $FFFF; r13 = r6 & 0xFFFF; (1 cycle) jmp.n r1; return to caller after next instruction (1cycle) or r13, r13, r10; r13 = r13 | r10; (1 cycle) ; done: 30 cycles total (without short circuit code) .end ; invoking dmulu: ; it is the caller's responsibility to save registers r2-13, which ; the caller may or may not need to do... or r25, r0, #dmuluLo16; r25 = low 16 bits of dmulu address (1 cycle) or.u r25, r25, #dmuluHi16; r25 = r25 | high 16 bits of dmulu address (1 c) ld r2, r30, #factor1; r2 = *(framePtr + offset of factor1) (1 cycle) jsr.n r25; call dmulu after next instruction (1 cycle) ld r3, r30, #factor2; r3 = *(framePtr + offset of factor2) (1 cycle) st.d r12, r30, #product; *(frametPtr + offset of product) = r12, r13 ; register to register: 1 cycle call excluding execution time for dmulu ; memory to memory: 6 cycle call excluding execution time for dmulu ; GRAND TOTAL register to register: 31 cycles ; GRAND TOTAL memory to memory: 36 cycles ____"Congress shall have the power to prohibit speech offensive to Congress"____ Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL. Disclaimer: I do not speak for AT&T Paradyne. They do not speak for me. Mottos: << Many are cold, but few are frozen. >> << Frigido, ergo sum. >>