Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!mcsun!hp4nl!charon!dik
From: dik@cwi.nl (Dik T. Winter)
Newsgroups: comp.arch
Subject: Re: int x int -> long for * (or is it 32x32->64)
Keywords: arithmetic,arbitrary precision,benchmark,modular arithmetic
Message-ID: <2118@charon.cwi.nl>
Date: 14 Sep 90 00:41:25 GMT
References: <3984@bingvaxu.cc.binghamton.edu> <41425@mips.mips.COM> <353@kaos.MATH.UCLA.EDU>
Sender: news@cwi.nl
Organization: CWI, Amsterdam
Lines: 93

In article <353@kaos.MATH.UCLA.EDU> pmontgom@euphemia.math.ucla.edu (Peter Montgomery) writes:
 > [If someone can significantly improve my SPARC assembly code, 
 > let me know.]
Not significantly, but improvement is possible, see below.
 > 
(Leaving out Sun3 and gcc timings the latter are only really relevant
for MUL1, MUL2 and MUL3).
 > #	Approximate times:
 > #
 > #	       SUN 4/260	   (both with FPUs)
 > #      	   cc
 > # MUL1	  2.92     (simple integer arithmetic)
 > # MUL2	  2.05	   (floating point arithmetic)
 > # MUL3	  6.32     (break into 16-bit pieces)
 > # MUL4	  1.83     (assembly code)
(One big note: the verification of results requires MUL4.  So large improvements
in MUL4 will have effects on timings for other MUL's too!)

Below the results I obtained on a SPARCstation 1 (4/60), SPARCstation SLC (4/?)
and a SPARCstation 1+ (4/65).  Even the ratios are very different!
For MUL4 I give three timings, see further down for the reason.
	   SPARC-1	   SPARC-SLC	   SPARC-1+
MUL1	     3.98	     3.97	     3.18
MUL2	     2.98	     3.22	     2.40
MUL3	    23.27	    10.02	     8.09
MUL4	2.47/2.24/2.14	2.45/2.23/2.14	1.97/1.78/1.71

Note that for all three machines the same compiler was used (the SLC compiler),
but that the libraries for the individual machines were linked in.  The SPARC-1
still has SunOS 4.0.3, the others have SunOS 4.1; that might make a difference.

Now the reason for three timings for MUL4.
1.	The first timing is for the assembly code of Peter Montgomery.
	A quick analysis reveals that it uses 1 cycle per bit for the
	multiply and 5 cycles per bit for the remaindering.  (Plus overhead
	of course.)  (And I use 7 cycles for division plus remainder, which
	cannot be improved in my opinion.)
2.	In Peter Montgomery's code there are a lot of annulled branches.
	The instruction in the delay slot is always executed (and takes time),
	although the result is voided at times.  By a rearrangement of the
	instructions this can be avoided in most cases, resulting in an
	expected 4.5 cycles per bit for remaindering.  That gives the
	second timings.
3.	It is possible to do the remaindering in 4 cycles per bit, hence the
	third timings.  (The program jumps wildly through the code, but that
	is no problem.)

Now I am going to do something that is not done: comparing apples to oranges.
(And back to the architecture issue!)
SPARC:	Optimum if remainder only is required: 4 cycles per bit; if also
	the quotient is required: add 3 cycles per bit.  Multiply requires
	1 cycle per bit.
MIPS:	Multiply is 10 cycles (I found this somewhere in Kanes book, but
	can not find it now).  Divide for small numbers (i.e. 32/32) is a
	single instruction, for which I cannot find the timing now.  Larger
	numbers require more cycles.  My code uses 9 cycles per bit, which
	can be reduced to 6 if only the remainder is required.  Using hardware
	divide might give some benefits here.  Note that hardware multiply and
	hardware divide are free running, i.e. other operations can be
	performed, but the benefit would be very limited in this case.
RTPC:	(Not really RISC in this aspect, just like mips.)  Multiply 2 cycles
	per bit (4 cycles per multiply step) and divide 3 cycles per bit (a
	single divide step).

My third timings were based on the (well-known) process that is also used in
the RTPC divide step instruction.  You need a single status bit (RTPC uses
carry).  The divide step is as follows:
1.	If status bit set:
		subtract divisor, clear status bit if borrow is generated.
		subtract one from quotient.
	else
		add divisor, set status bit if carry is generated.
		add one to quotient.
2.	Shift dividend and quotient one bit.
To use it you put the dividend and divisor in the proper registers, set the
status bit and execute the required number of divide steps.  Finally, if the
status bit is clear you add the divisor to the dividend and increment
the quotient.

Now some (my) opinions:
Sun would have done well to include a divide step as outlined above in its
architecture.  If done in a single cycle it would have reduced Peter
Montgomery's timings by about 60%.  (And the operation is not very dissimilar
from the multiply step.)
Mips would have done well to include a 64/32 bit divide in addition to the
32/32 bit divide.  I do not think that would have added much complexity to
the processor.  (The fact that mips has no status bits, and so special
instructions have to be performed to obtain status makes the mips divide code
slower than the sparc divide code.)
IBM with their RTPC did very well in this aspect.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl