Path: utzoo!bnr-vpa!bnr-rsc!drago From: drago@bnr-rsc.UUCP (Drago Goricanec) Newsgroups: comp.arch Subject: Re: John von Neumann, sqrt instr Message-ID: <911@bnr-rsc.UUCP> Date: 21 Aug 89 01:48:12 GMT References: <21353@cup.portal.com> <25643@obiwan.mips.COM> <1513@l.cc.purdue.edu> <2376@wyse.wyse.com> <3312@blake.acs.washington.edu> <12640@pur-ee.UUCP> Reply-To: drago@bnr-rsc.UUCP (Drago Goricanec) Distribution: na Organization: Bell-Northern Research, Ottawa, Canada Lines: 91 In article <12640@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: >In article <3312@blake.acs.washington.edu> lgy@newton.phys.washington.edu (Laurence Yaffe) writes: >>5) I'm not a big user of square roots. As long as square roots cost >> less than 100 multiplies, I don't think I'd trade faster square roots >> for slower performance in other ops. >... >Doing this all in software (and restricting ourselves to integer for >simplicity), we managed to get divide to be FASTER than multiply... but it >doesn't give you modulus/remainder for free. Yeah, it surprised me too. I >still haven't been able to find a software technique to make square root >comparably fast. Anybody got a screamer of a software technique for >computing square root? Here's a method that computes the square root directly (ie. this is not an approximation method). I don't remember what it's called, but I originally saw the method done on decimal numbers. The algorithm is much simpler on binary numbers. It resembles long division on paper. I'll present it in 68030 code, then describe it. I apologize for not using a higher level language, but this algorithm is most suited for an environment where bits can be manipulated directly. The code here is for discussion purposes only, and can be optimized. ; D1 contains the 31-bit argument whose square root is computed ; D0 will contain the 16-bit square root on exit ; D2, D3 are scratch registers ; D4 is a loop counter mov.b #15, D4 clr.l D2 ; initialize clr.l D3 clr.l D0 loop lsl.l #1, D1 ; shift D1 into D2 left 2 rol.l #1, D2 lsl.l #1, D1 rol.l #1, D2 add.l D3, D3 ; same as shift D3 left 1 add.l D0, D0 cmp.l D2, D3 ; is D3 < D2? bcc skip ; skip if not sub.l D3, D2 ; D2 := (D2 - D3 - 1) subq.l #1, D2 addq.l #2, D3 addq.l #1, D0 ; set bit in result skip subq.b #1, D4 bpl loop ; while D4 >= 0 ; at this point, D0 contains sqrt( D1 ) ; original contents of D1 lost The reason D1 contains a 31-bit, and not 32-bit result, is that the most significant bit could be lost in the shifts above. The code above was not tested, but has been derived from the C program below, which does work. #include main() { unsigned int D2, D3, D0, num; int i; scanf( "%d", &num ); D2 = D3 = D0 = 0; for( i=0; i<16; i++ ) { D2 <<= 2; D2 |= (num >> 30) & 0x03; num <<= 2; D3 <<= 1; D0 <<= 1; if( D3