Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!mcsun!hp4nl!charon!dik From: dik@cwi.nl (Dik T. Winter) Newsgroups: comp.arch Subject: Re: int x int -> long for * (or is it 32x32->64) Keywords: arithmetic,arbitrary precision,benchmark,modular arithmetic Message-ID: <2118@charon.cwi.nl> Date: 14 Sep 90 00:41:25 GMT References: <3984@bingvaxu.cc.binghamton.edu> <41425@mips.mips.COM> <353@kaos.MATH.UCLA.EDU> Sender: news@cwi.nl Organization: CWI, Amsterdam Lines: 93 In article <353@kaos.MATH.UCLA.EDU> pmontgom@euphemia.math.ucla.edu (Peter Montgomery) writes: > [If someone can significantly improve my SPARC assembly code, > let me know.] Not significantly, but improvement is possible, see below. > (Leaving out Sun3 and gcc timings the latter are only really relevant for MUL1, MUL2 and MUL3). > # Approximate times: > # > # SUN 4/260 (both with FPUs) > # cc > # MUL1 2.92 (simple integer arithmetic) > # MUL2 2.05 (floating point arithmetic) > # MUL3 6.32 (break into 16-bit pieces) > # MUL4 1.83 (assembly code) (One big note: the verification of results requires MUL4. So large improvements in MUL4 will have effects on timings for other MUL's too!) Below the results I obtained on a SPARCstation 1 (4/60), SPARCstation SLC (4/?) and a SPARCstation 1+ (4/65). Even the ratios are very different! For MUL4 I give three timings, see further down for the reason. SPARC-1 SPARC-SLC SPARC-1+ MUL1 3.98 3.97 3.18 MUL2 2.98 3.22 2.40 MUL3 23.27 10.02 8.09 MUL4 2.47/2.24/2.14 2.45/2.23/2.14 1.97/1.78/1.71 Note that for all three machines the same compiler was used (the SLC compiler), but that the libraries for the individual machines were linked in. The SPARC-1 still has SunOS 4.0.3, the others have SunOS 4.1; that might make a difference. Now the reason for three timings for MUL4. 1. The first timing is for the assembly code of Peter Montgomery. A quick analysis reveals that it uses 1 cycle per bit for the multiply and 5 cycles per bit for the remaindering. (Plus overhead of course.) (And I use 7 cycles for division plus remainder, which cannot be improved in my opinion.) 2. In Peter Montgomery's code there are a lot of annulled branches. The instruction in the delay slot is always executed (and takes time), although the result is voided at times. By a rearrangement of the instructions this can be avoided in most cases, resulting in an expected 4.5 cycles per bit for remaindering. That gives the second timings. 3. It is possible to do the remaindering in 4 cycles per bit, hence the third timings. (The program jumps wildly through the code, but that is no problem.) Now I am going to do something that is not done: comparing apples to oranges. (And back to the architecture issue!) SPARC: Optimum if remainder only is required: 4 cycles per bit; if also the quotient is required: add 3 cycles per bit. Multiply requires 1 cycle per bit. MIPS: Multiply is 10 cycles (I found this somewhere in Kanes book, but can not find it now). Divide for small numbers (i.e. 32/32) is a single instruction, for which I cannot find the timing now. Larger numbers require more cycles. My code uses 9 cycles per bit, which can be reduced to 6 if only the remainder is required. Using hardware divide might give some benefits here. Note that hardware multiply and hardware divide are free running, i.e. other operations can be performed, but the benefit would be very limited in this case. RTPC: (Not really RISC in this aspect, just like mips.) Multiply 2 cycles per bit (4 cycles per multiply step) and divide 3 cycles per bit (a single divide step). My third timings were based on the (well-known) process that is also used in the RTPC divide step instruction. You need a single status bit (RTPC uses carry). The divide step is as follows: 1. If status bit set: subtract divisor, clear status bit if borrow is generated. subtract one from quotient. else add divisor, set status bit if carry is generated. add one to quotient. 2. Shift dividend and quotient one bit. To use it you put the dividend and divisor in the proper registers, set the status bit and execute the required number of divide steps. Finally, if the status bit is clear you add the divisor to the dividend and increment the quotient. Now some (my) opinions: Sun would have done well to include a divide step as outlined above in its architecture. If done in a single cycle it would have reduced Peter Montgomery's timings by about 60%. (And the operation is not very dissimilar from the multiply step.) Mips would have done well to include a 64/32 bit divide in addition to the 32/32 bit divide. I do not think that would have added much complexity to the processor. (The fact that mips has no status bits, and so special instructions have to be performed to obtain status makes the mips divide code slower than the sparc divide code.) IBM with their RTPC did very well in this aspect. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl