Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!ucla-cs!math.ucla.edu!euphemia!pmontgom
From: pmontgom@euphemia.math.ucla.edu (Peter Montgomery)
Newsgroups: comp.arch
Subject: Re: int x int -> long for * (or is it 32x32->64)
Keywords: arithmetic,arbitrary precision,benchmark,modular arithmetic
Message-ID: <353@kaos.MATH.UCLA.EDU>
Date: 12 Sep 90 08:20:00 GMT
References: <3755@osc.COM> <4513@taux01.nsc.com> <119244@linus.mitre.org> <6837.26e7ee92@vax1.tcd.ie> <119612@linus.mitre.org> <3977@bingvaxu.cc.binghamton.edu> <119733@linus.mitre.org> <3984@bingvaxu.cc.binghamton.edu> <41425@mips.mips.COM>
Sender: news@MATH.UCLA.EDU
Organization: UCLA Mathematics Dept.
Lines: 395

In article <41425@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>I'm confused about the following exchange:
>
>In article <3984@bingvaxu.cc.binghamton.edu> vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes:
>>In article <119733@linus.mitre.org> bs@gauss.UUCP (Robert D. Silverman) writes:
>...
>>>amount of it. Typically, only having 32 x 32 --> 32 slows down
>>>multi-precision arithmetic by about a factor of 10. Not having it
>
>>Well that ``factor of 10'', in my simple-minded way, sounded a bit
>>too far off the mark (of course it only _asymptote_ to less than 3 for sure).
>>So I ran my littul benchmark. The results for VAX 8750 and Sparc,
>>using a variety of compilers and optimization levels are as
>>follows:
>....
>>The difference? Benchmark 2 performed all calculations using
>>int x int -> int while benchmark 1 used both int x int -> int
>>and long x long -> long.
>
>
>I'm confused.  Since some fairly strong results are being stated
>(that Silverman is way off; anyone can be off, but on the other hand,
>Silverman certainly knows this problem domain), it would help
>to understand exactly what is being measured. In particular,
>I'm confused becasue Silverman clearly wished for 32x32->64,
>and the posted results talk about int x int -> int, long x long -> long.
>
>Unless I completely misunderstand the posting, I can't find any
>connection whatsoever between the measurements of an apparently portable
>C program on these machines, and Silverman's clear statement of wishing for
>32 x 32 -> 64 bit product for a 10X difference in his work.
>How about posting the benchmark, so we can understand what's happening?
>It might especially be good to post a sample of the disassembled code that's
>supposed to be doing 32x32->64, if that is truly supposed to be happening.

	I do many of the same types of computations as Silverman
(and, like he, some of my processes use a month of CPU time or more).
But I dispute his 10X figure.  Here is a benchmark which tries
to compute the remainders i*j mod p where 0 <= i, j < p, using three
algorithms expressible in C and also an assembly code
algorithm (which, on the SUN 3, uses the 64-bit multiply and divide
instructions).  Those instructions perform five times as fast as
the best of the C algorithms, but my SPARC assembly code
is only slightly bettter than the best of the C algorithms.
[If someone can significantly improve my SPARC assembly code, 
let me know.]

#! /bin/csh -f
#
#	This benchmark is intended to measure how much 64-bit 
#	integer multiplication and division hardware affect
#	the computation of i*j mod p, where 0 <= i, j, < p < 2**31.  
#	The task is done four times, via C routines MUL1, MUL2, MUL3,
#	and assembly routine MUL4.  A probable primality test
#	(based on Fermat's little Theorem) is used to rate the algorithms.
#
#	MUL1 uses additions, subtractions, and bit-wise operations,
#	(no multiplication or division), analyzing one bit at a time. 
#	It provides a benchmark for rating the machine's raw speed.
#
#	MUL2 uses floating point arithmetic to estimate the quotient
#	i*j/p, followed by unsigned long integer arithmetic (modulo 2^32 or
#	other power of 2) to compute the remainder and possibly adjust it.
#
#	MUL3 breaks the operands up into 16-bit pieces.
#
#	MUL4 is a machine-dependent, assembly-language program.
#	Presently SUN 3 and SUN 4 (SPARC) versions are included.
#	If the machine has 64-bit integer multiply and divide instructions,
#	the MUL1/MUL4 ratio estimates the benefit achieved by
#	using these instructions (when adjusted for compiler's
#	level of sophistication).
#
#	Approximate times:
#
#	       SUN 3/280	     SUN 4/260	   (both with FPUs)
#      gcc 1.37.1     cc     gcc 1.37.1	    cc
# MUL1	    10.40  13.52           3.67	  2.92     (simple integer arithmetic)
# MUL2	     9.00  10.38           2.03   2.05	   (floating point arithmetic)
# MUL3	    10.14  11.30           6.11	  6.32     (break into 16-bit pieces)
# MUL4	     1.88   1.98 	   1.82	  1.83     (assembly code)
#
#	My optimized assembly code for the SUN 4 does little better
#	than my straightforward assembly code for the SUN3,
#	even though the SUN 4's raw CPU speed is 3.5 - 5 times
#	that of the SUN 3 (as estimated by the MUL1 and MUL2 times).
#	This is an application on which the SPARC does not do
#	as well as we want.  Residue (modular) arithmetic is 
#	important to some of my applications -- see, for example,
#	"An FFT extension to the P-1 factoring algorithm"
#	by Robert D. Silverman and me, in the April, 1990 Math. Comp.
#
#		Peter L. Montgomery
#		Department of Mathematics
#		University of California
#		Los Angeles, CA 90024-1555
#		pmontgom@math.ucla.edu
#		September, 1990

cat << end_program >! modprd.c

#include <stdio.h>
#include <sys/time.h>
#include <sys/resource.h>

typedef unsigned long ulong;
typedef          long slong; 

long lrand48();		/* Random number generator (SUN-specific) */
ulong MUL4();		/* Assembly routine */

ulong MUL1(i, j, p)
ulong i, j, p;
{
    register ulong ii, jj, rem = 0;

    if (i < j) {ii = i; jj = j;} else {ii = j; jj = i;}
		/* ii = MIN(i, j), 	jj = MAX(i,j)	*/

    while (ii != 0) {		/* Answer is (ii*jj + rem) mod p */
	if ((ii & 1) != 0) {
	    rem += jj;
	    if (rem >= p) rem -= p;
        }
        jj <<= 1;
        if (jj >= p) jj -= p;
	ii >>= 1;
    } 
    return rem;
}

ulong MUL2(i, j, p)
ulong i, j, p;
{
    register ulong quot, rem;

    quot = (ulong)(slong) (  (double)(slong)i * (double)(slong)j 
		            /(double)(slong)p + 0.5);
				/* true quotient, or one too high */
    rem = i*j - p*quot;		/* remainder as signed integer */
    if ((slong)rem <= -1) rem += p;
	/* The above test could be "< 0", but some compilers optimize 
	   "< 0" incorrectly, using the condition code from the subtract.  */
    return rem;
}

ulong MUL3(i, j, p) 
ulong i, j, p;
{
    if (p < 65536) {
	return (i*j) % p;	/* simple case */
    } else {
   	ulong itop = i >> 16, jtop = j >> 16;  /* fit in 15 bits */
   	ulong ibot = i & 0xffff, jbot = j & 0xffff;

   	register ulong prdtop = itop*jtop, prdbot = ibot*jbot;
   	register ulong prdmid = prdtop + prdbot - (itop - ibot)*(jtop - jbot);
		/* i*j = prdtop*2^32 + prdmid*2^16 + prdbot */
		/* 0 <= prdtop < p^2/2^32  < 2^30 */
		/* 0 <= prdbot, prdmid <= 2^32 - 2^17 + 1 */
	register ulong pnorm = p;	/* Multiple of p */
	register ulong pntop, pnbot;

 	while (pnorm < (1L<<30)) pnorm <<= 1;  /* 2^30 <= pnorm < 2^31 */
	pntop = pnorm >> 16;
	pnbot = pnorm & 0xffff;

	prdmid += (prdbot >> 16); prdbot &= 0xffff;	/* Accumulate carries */
	prdtop += (prdmid >> 16); prdmid &= 0xffff;
        while (prdtop > pntop) {
	    register ulong quot = prdtop/(pntop + 1);	/* Fits in 16 bits */
	    register ulong tmpmid = quot*pnbot;
	    prdtop -= quot*pntop;
	    if (prdmid < tmpmid) prdtop -= 65536;
	    prdmid -= tmpmid;
	    prdtop += (prdmid >> 16); prdmid &= 0xffff;
	}
	prdmid += prdtop << 16;
	while (prdmid > pntop) {
	    register ulong quot = prdmid/(pntop + 1);
	    register ulong tmpbot = quot*pnbot;
	    prdmid -= quot*pntop;
	    if (prdbot < tmpbot) prdmid -= 65536;
	    prdbot -= tmpbot;
	    prdmid += (prdbot >> 16); prdbot &= 0xffff;
	}	
	prdbot += (prdmid << 16);
	return prdbot%p;
    }
}

double CP_TIME()
{ 		/* Return CP time used so far, in seconds */
    long sec, usec;
    struct rusage inf;

    getrusage(RUSAGE_SELF, &inf);
    sec	 = inf.ru_utime.tv_sec	+ inf.ru_stime.tv_sec;
    usec = inf.ru_utime.tv_usec + inf.ru_stime.tv_usec;
    return (double)sec + 1.0e-6*(double)usec;
}

ulong MODEXP(base, expon, p, func)
ulong base, expon, p;
ulong (*func)();	
{		/* Modular exponentiation, base^expon mod p */
    if (expon == 0) {
	return 1;
    } else {
	register ulong b = base, ipow2 = 1;
	while ((ipow2 <<1) <= expon) ipow2 <<= 1;
			/* highest power of 2 <= expon */
	while ((ipow2 >>= 1) != 0) {
	    /* Answer = [b^(2*ipow2) * base^(e % (2*ipow2))] mod p */
	    b = func(b, b, p);
	    if (expon & ipow2) b = func(b, base, p);
	}	
   	return b;
    } 
}

void TIME(func, name)
ulong (*func)();	/* MUL1, MUL2, MUL3, or MUL4 */
char *name;
{
    double t1 = CP_TIME(), t2;
    ulong p, nprime = 0;
/*
	Search for primes in the interval (10^9, 10^9 + 10000),
	using a pseudoprime test to base 1234567.
	Count the number of successes.
*/
    for (p = 1000000001; p <= 1000010000; p += 2) {
	ulong test = MODEXP(1234567L, (p-1)/2, p, func);
	if (test == 1 || test == p-1) nprime++;
    }
	
    t2 = CP_TIME();
    printf("%s found %ld probable primes in %10.2f seconds.\n", 
		name, nprime, t2-t1);
}

int main(argc, argv)
int argc;
char **argv;
{
    int itest;
    srand48(1990);		/* Set seed */
    for (itest = 1; itest <= 200; itest++) {
        ulong n1, n2, p;
	ulong ans1, ans2, ans3, ans4;

	while ((p = lrand48()) < 1000) {};
	n1 = lrand48() % p;	
	n2 = lrand48() % p;	
        ans1 = MUL1(n1, n2, p);
        ans2 = MUL2(n1, n2, p);
        ans3 = MUL3(n1, n2, p);
        ans4 = MUL4(n1, n2, p);
	if (ans1 != ans2 || ans2 != ans3 || ans3 != ans4) {
	    printf("Inconsistent results: n1 = %ld, n2 = %ld, p = %ld\n", 
				n1, n2, p);
	    printf("  ans1 = %ld, ans2 = %ld, ans3 = %ld, ans4 = %ld\n",
			        ans1, ans2, ans3, ans4);
	}
    }	/* for itest */
    TIME(MUL1, "MUL1");
    TIME(MUL2, "MUL2");
    TIME(MUL3, "MUL3");
    TIME(MUL4, "MUL4");
    return 0;
}
end_program
cat << end_sun3assem >! sun3assem.s
| 	iprod = MUL4(i, j, p)		Returns i*j mod p
	.globl	_MUL4			
_MUL4:					| Uses only d0, d1
	movl	sp@(4),d1		| d1 = i
	mulul   sp@(8),d0:d1		| Get 64-bit product i*j
	divul   sp@(12),d0:d1		| Divide by p, get remainder in d0
	rts				| Return
end_sun3assem
cat << end_sun4assem >! sun4assem.s
!		SPARC assembly version of MUL4(i, j, p)   (i*j mod p)
#define MULT1(prod,src) mulscc prod,src,prod
#define MULT2(prod,src) MULT1(prod,src);MULT1(prod,src)
#define MULT4(prod,src) MULT2(prod,src);MULT2(prod,src)
#define MULT8(prod,src) MULT4(prod,src);MULT4(prod,src)
#define MULT16(prod,src) MULT8(prod,src);MULT8(prod,src)

/*
	When dividing  divhi*2^n + divlo  by  divsor,
	where 0 <= divhi < divsor  and	0 <= divlo < 2^n,
	the bottom half of the dividend (divlo) should
	be left-justified (i.e., pre-multiplied by 2^(32-n)).
	After n executions of the REMAINDER_STEP macro,
	register "divhi" will have the remainder.
*/

#define REMAINDER_STEP(divhi,divlo,divsor) \
	addcc	divlo,divlo,divlo  /* Shift low dividend */ ;\
	addx	divhi,divhi,divhi  /* Shift high dividend */ ;\
	cmp	divhi,divsor	   /* Compare dividend and divisor */ ;\
	bgeu,a	1f		   /* if ge, branch */ ;\
	sub	divhi,divsor,divhi /* but first subtract divisor */; \
1:

#define MUL4_step REMAINDER_STEP(%o5,%o4,%o2)

	.global _MUL4		! iprod = MUL4(i, j, p)		(leaf procedure)

_MUL4: 				! %o0 = i, %o1 = j, %o2 = p
	subcc	%o0,%o1,%o3	! i-j
	bgeu	1f		! Branch if i >= j
	nop
	add	%o1,%o3,%o1	! %o1 = j + (i-j) = i = MIN(i,j)
	sub	%o0,%o3,%o0	! %o0 = i - (i-j) = j = MAX(i,j)
1:
	mov	%o1,%y		! set multiplier (y reg) to MIN(i,j)
	srl	%o1,16,%o4
	tst	%o4
	bz	mul4_16		! Branch if multiplier fits in 16 bits
	andncc	%o4,0xff,%g0
	bz	mul4_24
	orcc	%g0,%g0,%o5	! Initialize upper product (delay)

	MULT16(%o5,%o0)
	MULT16(%o5,%o0)
	mulscc	%o5,%g0,%o5	! Align results
	mov	%y,%o4		! Get lower product

	MUL4_step; MUL4_step; MUL4_step; MUL4_step
	MUL4_step; MUL4_step; MUL4_step; MUL4_step
mul4_x24:
	MUL4_step; MUL4_step; MUL4_step; MUL4_step
	MUL4_step; MUL4_step; MUL4_step; MUL4_step
mul4_x16:
	MUL4_step; MUL4_step; MUL4_step; MUL4_step
	MUL4_step; MUL4_step; MUL4_step; MUL4_step
mul4_x8:
	MUL4_step; MUL4_step; MUL4_step; MUL4_step
	MUL4_step; MUL4_step; MUL4_step; MUL4_step
	retl			! Return from leaf
	mov	%o5,%o0		! Move remainder (delay)

mul4_24:
	MULT16(%o5,%o0)
	MULT8(%o5,%o0)
	mulscc	%o5,%g0,%o5	! Align results
	b	mul4_x24
	mov	%y,%o4		! and get lower product (delay)

mul4_16:
	andncc	%o1,0xff,%g0	! Test if one operand is 8 bits or less
	bz	mul4_8
	orcc	%g0,%g0,%o5	! Initialize upper product (delay)

	MULT16(%o5,%o0)
	mulscc	%o5,%g0,%o5	! Align results
	b	mul4_x16		
	mov	%y,%o4		! and get lower product

mul4_8:
	MULT8(%o5,%o0)
	mulscc	%o5,%g0,%o5	! Align results
	b	mul4_x8
	mov	%y,%o4		! and get lower product
end_sun4assem

# ********************** Compile and execute the program ********************

if ("`arch`" == "sun3") then
	as -o assem.o sun3assem.s
	gcc -O modprd.c assem.o
	echo "SUN 3 gcc times"
	a.out
	echo "SUN 3 cc times"
	cc -O4 modprd.c assem.o
	a.out
else if("`arch`" == "sun4") then
	as -o assem.o -P sun4assem.s
	echo "SUN 4 gcc times"
	gcc -O modprd.c assem.o
	a.out
	echo "SUN 4 cc times"
	cc -O4 modprd.c assem.o
	a.out
else 
	echo "Unknown architecture - `arch`"
endif
--
        Peter L. Montgomery 
        pmontgom@MATH.UCLA.EDU 
        Department of Mathematics, UCLA, Los Angeles, CA 90024-1555
If I spent as much time on my dissertation as I do reading news, I'd graduate.