Path: utzoo!bnr-vpa!bnr-rsc!drago
From: drago@bnr-rsc.UUCP (Drago Goricanec)
Newsgroups: comp.arch
Subject: Re: John von Neumann, sqrt instr
Message-ID: <911@bnr-rsc.UUCP>
Date: 21 Aug 89 01:48:12 GMT
References: <21353@cup.portal.com> <25643@obiwan.mips.COM> <1513@l.cc.purdue.edu> <2376@wyse.wyse.com> <3312@blake.acs.washington.edu> <12640@pur-ee.UUCP>
Reply-To: drago@bnr-rsc.UUCP (Drago Goricanec)
Distribution: na
Organization: Bell-Northern Research, Ottawa, Canada
Lines: 91

In article <12640@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
>In article <3312@blake.acs.washington.edu> lgy@newton.phys.washington.edu (Laurence Yaffe) writes:
>>5) I'm not a big user of square roots.  As long as square roots cost
>>   less than 100 multiplies, I don't think I'd trade faster square roots
>>   for slower performance in other ops.
>...
>Doing this all in software (and restricting ourselves to integer for
>simplicity), we managed to get divide to be FASTER than multiply...  but it
>doesn't give you modulus/remainder for free.  Yeah, it surprised me too.  I
>still haven't been able to find a software technique to make square root
>comparably fast.  Anybody got a screamer of a software technique for
>computing square root?

Here's a method that computes the square root directly (ie. this is not
an approximation method).  I don't remember what it's called, but I
originally saw the method done on decimal numbers.  The algorithm is
much simpler on binary numbers.  It resembles long division on paper.

I'll present it in 68030 code, then describe it.  I apologize for not
using a higher level language, but this algorithm is most suited for an
environment where bits can be manipulated directly.  The code here is
for discussion purposes only, and can be optimized.

; D1 contains the 31-bit argument whose square root is computed
; D0 will contain the 16-bit square root on exit 
; D2, D3 are scratch registers
; D4 is a loop counter
	mov.b	#15, D4
        clr.l	D2	; initialize
	clr.l	D3
	clr.l	D0

loop	lsl.l	#1, D1	; shift D1 into D2 left 2
	rol.l	#1, D2
	lsl.l	#1, D1
	rol.l	#1, D2
	add.l	D3, D3	; same as shift D3 left 1
	add.l	D0, D0
	cmp.l	D2, D3	; is D3 < D2?
	bcc	skip	; skip if not
	sub.l	D3, D2	; D2 := (D2 - D3 - 1)
	subq.l	#1, D2
	addq.l	#2, D3
	addq.l	#1, D0	; set bit in result

skip	subq.b	#1, D4
	bpl	loop	; while D4 >= 0
	
; at this point, D0 contains sqrt( D1 )
; original contents of D1 lost

The reason D1 contains a 31-bit, and not 32-bit result, is that the most
significant bit could be lost in the shifts above.

The code above was not tested, but has been derived from the C program below,
which does work.

#include <stdio.h>

main()
{
	unsigned int D2, D3, D0, num;
	int i;

	scanf( "%d", &num );
	D2 = D3 = D0 = 0;
	for( i=0; i<16; i++ ) {
		D2 <<= 2;
		D2 |= (num >> 30) & 0x03;
		num <<= 2;
		D3 <<= 1;
		D0 <<= 1;
		if( D3<D2 ) {
			D2 -= (D3 + 1);
			D3 += 2;
			D0 += 1;
		}
	}
	printf( "%d\n", D0 );
}

Essentially, this algorithm will compute the square root of an n-bit
number in approximately n/2 iterations of the loop.  The bits are
shifted into the result as necessary.  To round the result, you need
to iterate one more time (taking care not to lose the most significant
bits in D2), and round up or down, based on the last computed bit in D0.

Drago
-- 
Drago Goricanec                              (...!bnr-vpa!bnr-rsc!drago)
Bell-Northern Research Ltd., Ottawa, Ontario              (613) 763-8957