Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!ucla-cs!math.ucla.edu!euphemia!pmontgom From: pmontgom@euphemia.math.ucla.edu (Peter Montgomery) Newsgroups: comp.arch Subject: Re: int x int -> long for * (or is it 32x32->64) Keywords: arithmetic,arbitrary precision,benchmark,modular arithmetic Message-ID: <353@kaos.MATH.UCLA.EDU> Date: 12 Sep 90 08:20:00 GMT References: <3755@osc.COM> <4513@taux01.nsc.com> <119244@linus.mitre.org> <6837.26e7ee92@vax1.tcd.ie> <119612@linus.mitre.org> <3977@bingvaxu.cc.binghamton.edu> <119733@linus.mitre.org> <3984@bingvaxu.cc.binghamton.edu> <41425@mips.mips.COM> Sender: news@MATH.UCLA.EDU Organization: UCLA Mathematics Dept. Lines: 395 In article <41425@mips.mips.COM> mash@mips.COM (John Mashey) writes: >I'm confused about the following exchange: > >In article <3984@bingvaxu.cc.binghamton.edu> vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes: >>In article <119733@linus.mitre.org> bs@gauss.UUCP (Robert D. Silverman) writes: >... >>>amount of it. Typically, only having 32 x 32 --> 32 slows down >>>multi-precision arithmetic by about a factor of 10. Not having it > >>Well that ``factor of 10'', in my simple-minded way, sounded a bit >>too far off the mark (of course it only _asymptote_ to less than 3 for sure). >>So I ran my littul benchmark. The results for VAX 8750 and Sparc, >>using a variety of compilers and optimization levels are as >>follows: >.... >>The difference? Benchmark 2 performed all calculations using >>int x int -> int while benchmark 1 used both int x int -> int >>and long x long -> long. > > >I'm confused. Since some fairly strong results are being stated >(that Silverman is way off; anyone can be off, but on the other hand, >Silverman certainly knows this problem domain), it would help >to understand exactly what is being measured. In particular, >I'm confused becasue Silverman clearly wished for 32x32->64, >and the posted results talk about int x int -> int, long x long -> long. > >Unless I completely misunderstand the posting, I can't find any >connection whatsoever between the measurements of an apparently portable >C program on these machines, and Silverman's clear statement of wishing for >32 x 32 -> 64 bit product for a 10X difference in his work. >How about posting the benchmark, so we can understand what's happening? >It might especially be good to post a sample of the disassembled code that's >supposed to be doing 32x32->64, if that is truly supposed to be happening. I do many of the same types of computations as Silverman (and, like he, some of my processes use a month of CPU time or more). But I dispute his 10X figure. Here is a benchmark which tries to compute the remainders i*j mod p where 0 <= i, j < p, using three algorithms expressible in C and also an assembly code algorithm (which, on the SUN 3, uses the 64-bit multiply and divide instructions). Those instructions perform five times as fast as the best of the C algorithms, but my SPARC assembly code is only slightly bettter than the best of the C algorithms. [If someone can significantly improve my SPARC assembly code, let me know.] #! /bin/csh -f # # This benchmark is intended to measure how much 64-bit # integer multiplication and division hardware affect # the computation of i*j mod p, where 0 <= i, j, < p < 2**31. # The task is done four times, via C routines MUL1, MUL2, MUL3, # and assembly routine MUL4. A probable primality test # (based on Fermat's little Theorem) is used to rate the algorithms. # # MUL1 uses additions, subtractions, and bit-wise operations, # (no multiplication or division), analyzing one bit at a time. # It provides a benchmark for rating the machine's raw speed. # # MUL2 uses floating point arithmetic to estimate the quotient # i*j/p, followed by unsigned long integer arithmetic (modulo 2^32 or # other power of 2) to compute the remainder and possibly adjust it. # # MUL3 breaks the operands up into 16-bit pieces. # # MUL4 is a machine-dependent, assembly-language program. # Presently SUN 3 and SUN 4 (SPARC) versions are included. # If the machine has 64-bit integer multiply and divide instructions, # the MUL1/MUL4 ratio estimates the benefit achieved by # using these instructions (when adjusted for compiler's # level of sophistication). # # Approximate times: # # SUN 3/280 SUN 4/260 (both with FPUs) # gcc 1.37.1 cc gcc 1.37.1 cc # MUL1 10.40 13.52 3.67 2.92 (simple integer arithmetic) # MUL2 9.00 10.38 2.03 2.05 (floating point arithmetic) # MUL3 10.14 11.30 6.11 6.32 (break into 16-bit pieces) # MUL4 1.88 1.98 1.82 1.83 (assembly code) # # My optimized assembly code for the SUN 4 does little better # than my straightforward assembly code for the SUN3, # even though the SUN 4's raw CPU speed is 3.5 - 5 times # that of the SUN 3 (as estimated by the MUL1 and MUL2 times). # This is an application on which the SPARC does not do # as well as we want. Residue (modular) arithmetic is # important to some of my applications -- see, for example, # "An FFT extension to the P-1 factoring algorithm" # by Robert D. Silverman and me, in the April, 1990 Math. Comp. # # Peter L. Montgomery # Department of Mathematics # University of California # Los Angeles, CA 90024-1555 # pmontgom@math.ucla.edu # September, 1990 cat << end_program >! modprd.c #include #include #include typedef unsigned long ulong; typedef long slong; long lrand48(); /* Random number generator (SUN-specific) */ ulong MUL4(); /* Assembly routine */ ulong MUL1(i, j, p) ulong i, j, p; { register ulong ii, jj, rem = 0; if (i < j) {ii = i; jj = j;} else {ii = j; jj = i;} /* ii = MIN(i, j), jj = MAX(i,j) */ while (ii != 0) { /* Answer is (ii*jj + rem) mod p */ if ((ii & 1) != 0) { rem += jj; if (rem >= p) rem -= p; } jj <<= 1; if (jj >= p) jj -= p; ii >>= 1; } return rem; } ulong MUL2(i, j, p) ulong i, j, p; { register ulong quot, rem; quot = (ulong)(slong) ( (double)(slong)i * (double)(slong)j /(double)(slong)p + 0.5); /* true quotient, or one too high */ rem = i*j - p*quot; /* remainder as signed integer */ if ((slong)rem <= -1) rem += p; /* The above test could be "< 0", but some compilers optimize "< 0" incorrectly, using the condition code from the subtract. */ return rem; } ulong MUL3(i, j, p) ulong i, j, p; { if (p < 65536) { return (i*j) % p; /* simple case */ } else { ulong itop = i >> 16, jtop = j >> 16; /* fit in 15 bits */ ulong ibot = i & 0xffff, jbot = j & 0xffff; register ulong prdtop = itop*jtop, prdbot = ibot*jbot; register ulong prdmid = prdtop + prdbot - (itop - ibot)*(jtop - jbot); /* i*j = prdtop*2^32 + prdmid*2^16 + prdbot */ /* 0 <= prdtop < p^2/2^32 < 2^30 */ /* 0 <= prdbot, prdmid <= 2^32 - 2^17 + 1 */ register ulong pnorm = p; /* Multiple of p */ register ulong pntop, pnbot; while (pnorm < (1L<<30)) pnorm <<= 1; /* 2^30 <= pnorm < 2^31 */ pntop = pnorm >> 16; pnbot = pnorm & 0xffff; prdmid += (prdbot >> 16); prdbot &= 0xffff; /* Accumulate carries */ prdtop += (prdmid >> 16); prdmid &= 0xffff; while (prdtop > pntop) { register ulong quot = prdtop/(pntop + 1); /* Fits in 16 bits */ register ulong tmpmid = quot*pnbot; prdtop -= quot*pntop; if (prdmid < tmpmid) prdtop -= 65536; prdmid -= tmpmid; prdtop += (prdmid >> 16); prdmid &= 0xffff; } prdmid += prdtop << 16; while (prdmid > pntop) { register ulong quot = prdmid/(pntop + 1); register ulong tmpbot = quot*pnbot; prdmid -= quot*pntop; if (prdbot < tmpbot) prdmid -= 65536; prdbot -= tmpbot; prdmid += (prdbot >> 16); prdbot &= 0xffff; } prdbot += (prdmid << 16); return prdbot%p; } } double CP_TIME() { /* Return CP time used so far, in seconds */ long sec, usec; struct rusage inf; getrusage(RUSAGE_SELF, &inf); sec = inf.ru_utime.tv_sec + inf.ru_stime.tv_sec; usec = inf.ru_utime.tv_usec + inf.ru_stime.tv_usec; return (double)sec + 1.0e-6*(double)usec; } ulong MODEXP(base, expon, p, func) ulong base, expon, p; ulong (*func)(); { /* Modular exponentiation, base^expon mod p */ if (expon == 0) { return 1; } else { register ulong b = base, ipow2 = 1; while ((ipow2 <<1) <= expon) ipow2 <<= 1; /* highest power of 2 <= expon */ while ((ipow2 >>= 1) != 0) { /* Answer = [b^(2*ipow2) * base^(e % (2*ipow2))] mod p */ b = func(b, b, p); if (expon & ipow2) b = func(b, base, p); } return b; } } void TIME(func, name) ulong (*func)(); /* MUL1, MUL2, MUL3, or MUL4 */ char *name; { double t1 = CP_TIME(), t2; ulong p, nprime = 0; /* Search for primes in the interval (10^9, 10^9 + 10000), using a pseudoprime test to base 1234567. Count the number of successes. */ for (p = 1000000001; p <= 1000010000; p += 2) { ulong test = MODEXP(1234567L, (p-1)/2, p, func); if (test == 1 || test == p-1) nprime++; } t2 = CP_TIME(); printf("%s found %ld probable primes in %10.2f seconds.\n", name, nprime, t2-t1); } int main(argc, argv) int argc; char **argv; { int itest; srand48(1990); /* Set seed */ for (itest = 1; itest <= 200; itest++) { ulong n1, n2, p; ulong ans1, ans2, ans3, ans4; while ((p = lrand48()) < 1000) {}; n1 = lrand48() % p; n2 = lrand48() % p; ans1 = MUL1(n1, n2, p); ans2 = MUL2(n1, n2, p); ans3 = MUL3(n1, n2, p); ans4 = MUL4(n1, n2, p); if (ans1 != ans2 || ans2 != ans3 || ans3 != ans4) { printf("Inconsistent results: n1 = %ld, n2 = %ld, p = %ld\n", n1, n2, p); printf(" ans1 = %ld, ans2 = %ld, ans3 = %ld, ans4 = %ld\n", ans1, ans2, ans3, ans4); } } /* for itest */ TIME(MUL1, "MUL1"); TIME(MUL2, "MUL2"); TIME(MUL3, "MUL3"); TIME(MUL4, "MUL4"); return 0; } end_program cat << end_sun3assem >! sun3assem.s | iprod = MUL4(i, j, p) Returns i*j mod p .globl _MUL4 _MUL4: | Uses only d0, d1 movl sp@(4),d1 | d1 = i mulul sp@(8),d0:d1 | Get 64-bit product i*j divul sp@(12),d0:d1 | Divide by p, get remainder in d0 rts | Return end_sun3assem cat << end_sun4assem >! sun4assem.s ! SPARC assembly version of MUL4(i, j, p) (i*j mod p) #define MULT1(prod,src) mulscc prod,src,prod #define MULT2(prod,src) MULT1(prod,src);MULT1(prod,src) #define MULT4(prod,src) MULT2(prod,src);MULT2(prod,src) #define MULT8(prod,src) MULT4(prod,src);MULT4(prod,src) #define MULT16(prod,src) MULT8(prod,src);MULT8(prod,src) /* When dividing divhi*2^n + divlo by divsor, where 0 <= divhi < divsor and 0 <= divlo < 2^n, the bottom half of the dividend (divlo) should be left-justified (i.e., pre-multiplied by 2^(32-n)). After n executions of the REMAINDER_STEP macro, register "divhi" will have the remainder. */ #define REMAINDER_STEP(divhi,divlo,divsor) \ addcc divlo,divlo,divlo /* Shift low dividend */ ;\ addx divhi,divhi,divhi /* Shift high dividend */ ;\ cmp divhi,divsor /* Compare dividend and divisor */ ;\ bgeu,a 1f /* if ge, branch */ ;\ sub divhi,divsor,divhi /* but first subtract divisor */; \ 1: #define MUL4_step REMAINDER_STEP(%o5,%o4,%o2) .global _MUL4 ! iprod = MUL4(i, j, p) (leaf procedure) _MUL4: ! %o0 = i, %o1 = j, %o2 = p subcc %o0,%o1,%o3 ! i-j bgeu 1f ! Branch if i >= j nop add %o1,%o3,%o1 ! %o1 = j + (i-j) = i = MIN(i,j) sub %o0,%o3,%o0 ! %o0 = i - (i-j) = j = MAX(i,j) 1: mov %o1,%y ! set multiplier (y reg) to MIN(i,j) srl %o1,16,%o4 tst %o4 bz mul4_16 ! Branch if multiplier fits in 16 bits andncc %o4,0xff,%g0 bz mul4_24 orcc %g0,%g0,%o5 ! Initialize upper product (delay) MULT16(%o5,%o0) MULT16(%o5,%o0) mulscc %o5,%g0,%o5 ! Align results mov %y,%o4 ! Get lower product MUL4_step; MUL4_step; MUL4_step; MUL4_step MUL4_step; MUL4_step; MUL4_step; MUL4_step mul4_x24: MUL4_step; MUL4_step; MUL4_step; MUL4_step MUL4_step; MUL4_step; MUL4_step; MUL4_step mul4_x16: MUL4_step; MUL4_step; MUL4_step; MUL4_step MUL4_step; MUL4_step; MUL4_step; MUL4_step mul4_x8: MUL4_step; MUL4_step; MUL4_step; MUL4_step MUL4_step; MUL4_step; MUL4_step; MUL4_step retl ! Return from leaf mov %o5,%o0 ! Move remainder (delay) mul4_24: MULT16(%o5,%o0) MULT8(%o5,%o0) mulscc %o5,%g0,%o5 ! Align results b mul4_x24 mov %y,%o4 ! and get lower product (delay) mul4_16: andncc %o1,0xff,%g0 ! Test if one operand is 8 bits or less bz mul4_8 orcc %g0,%g0,%o5 ! Initialize upper product (delay) MULT16(%o5,%o0) mulscc %o5,%g0,%o5 ! Align results b mul4_x16 mov %y,%o4 ! and get lower product mul4_8: MULT8(%o5,%o0) mulscc %o5,%g0,%o5 ! Align results b mul4_x8 mov %y,%o4 ! and get lower product end_sun4assem # ********************** Compile and execute the program ******************** if ("`arch`" == "sun3") then as -o assem.o sun3assem.s gcc -O modprd.c assem.o echo "SUN 3 gcc times" a.out echo "SUN 3 cc times" cc -O4 modprd.c assem.o a.out else if("`arch`" == "sun4") then as -o assem.o -P sun4assem.s echo "SUN 4 gcc times" gcc -O modprd.c assem.o a.out echo "SUN 4 cc times" cc -O4 modprd.c assem.o a.out else echo "Unknown architecture - `arch`" endif -- Peter L. Montgomery pmontgom@MATH.UCLA.EDU Department of Mathematics, UCLA, Los Angeles, CA 90024-1555 If I spent as much time on my dissertation as I do reading news, I'd graduate.