Path: utzoo!utgpu!jarvis.csri.toronto.edu!clyde.concordia.ca!uunet!cs.utexas.edu!usc!orion.oac.uci.edu!uci-ics!ucla-cs!math.ucla.edu!oak!pmontgom
From: pmontgom@oak.math.ucla.edu (Peter Montgomery)
Newsgroups: comp.arch
Subject: New benchmark (was: Integer Multiply/Divide)
Message-ID: <2131@sunset.MATH.UCLA.EDU>
Date: 7 Jan 90 03:16:25 GMT
References: <84768@linus.UUCP> <4057@brazos.Rice.edu>
Sender: news@MATH.UCLA.EDU
Reply-To: pmontgom@math.ucla.edu (Peter Montgomery)
Organization: UCLA Mathematics Department
Lines: 206

In article <4057@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
>
>So what's your favorite application that does lots of
>integer multiplies and divides?  Code it up and pass it out
>for people to run.
>
	During the recent controversy about the SPARC and integer 
multiply/divide, multiple individuals have requested benchmarks 
which compare machine performances on applications needing these
operations.  In another posting is my response to that challenge, 
written primarily in FORTRAN 77.  It times eleven algorithms 
(some integer, some floating point, but is biased towards ones needing 
integer multiplication and division).  A machine's rating is defined 
to be the geometric mean of the times on these algorithms.
So far I have run it on the SUN 3/50, SUN 3/280, SPARC, and Alliant.
Here are the outputs (each row corresponds to a machine and compiler;
each column corresponds to an algorithm):
---------------------------------------------------------------------------
Gener  = Generate matrices (integer arithmetic, trigonometry)
Dble   = Double precision floating point matrix multiplication
Sngl   = Single precision floating point matrix multiplication
Cmplx  = Complex (single precision) matrix multiplication
Intgr  = Integer matrix multiplication
Modlr  = Matrix multiplication modulo a prime number
TrlDiv = Primality test, using trial division through square root
PrbPrm = Probable prime test, using Fermat's Little Theorem
GCD    = Euclidean GCD algorithm
BinDec = Binary-to-Decimal conversion
Cfrac  = Continued fraction expansion of quadratic irrational
Geo mean = Geometric mean of above times

SUN 3/50 with 68881, optimized (f77 -O -f68881)
 Gener  Dble  Sngl Cmplx Intgr Modlr TrlDiv PrbPrm   GCD BinDec Cfrac  Geo mean
  0.86  3.76  3.54 43.70  1.56  1.86  12.56   1.48  3.30   4.82  2.60  3.55 sec
SUN 3/280 with 68881, optimized (f77 -O -f68881)
 Gener  Dble  Sngl Cmplx Intgr Modlr TrlDiv PrbPrm   GCD BinDec Cfrac  Geo mean
  0.56  2.84  2.56 30.16  0.94  1.04   7.44   1.12  1.96   2.38  1.54  2.24 sec
SUN 3/280 with inline 68881, optimized (f77 -O -f68881 /usr/lib/f68881.il)
 Gener  Dble  Sngl Cmplx Intgr Modlr TrlDiv PrbPrm   GCD BinDec Cfrac  Geo mean
  0.48  2.82  2.54 18.54  0.86  1.10   7.58   1.16  1.94   2.38  1.54  2.11 sec
SUN 3/280 with inline FPA, optimized (f77 -O -ffpa /usr/lib/ffpa.il)
 Gener  Dble  Sngl Cmplx Intgr Modlr TrlDiv PrbPrm   GCD BinDec Cfrac  Geo mean
  0.34  0.66  0.62  4.66  0.90  1.06   7.38   0.38  1.92   2.32  1.54  1.25 sec
SPARC station optimized (f77 -O)
 Gener  Dble  Sngl Cmplx Intgr Modlr TrlDiv PrbPrm   GCD BinDec Cfrac  Geo mean
  0.30  0.38  0.25  3.25  1.06  0.86  17.79   0.44  1.24   1.56  1.07  1.03 sec
SPARC station optimized, inline (f77 -O /usr/lib/libm.il)
 Gener  Dble  Sngl Cmplx Intgr Modlr TrlDiv PrbPrm   GCD BinDec Cfrac  Geo mean
  0.37  0.35  0.27  2.00  1.05  0.87  17.76   0.44  1.24   1.56  1.07  1.00 sec
Alliant FX/80, one processor, optimized, unvectorized (fortran -Og)
 Gener  Dble  Sngl Cmplx Intgr Modlr TrlDiv PrbPrm   GCD BinDec Cfrac  Geo mean
  0.09  0.91  0.85  1.47  0.86  1.21   8.14   0.37  1.90   1.14  1.72  1.02 sec
Alliant FX/80, one processor, optimized, vectorized (fortran -Ovg -AS)
 Gener  Dble  Sngl Cmplx Intgr Modlr TrlDiv PrbPrm   GCD BinDec Cfrac  Geo mean
  0.11  0.18  0.16  0.58  0.20  0.37   8.04   0.37  1.84   1.07  1.44  0.54 sec
Alliant FX/80, six processors, optimized, vectorized (fortran -O -AS)
 Gener  Dble  Sngl Cmplx Intgr Modlr TrlDiv PrbPrm   GCD BinDec Cfrac  Geo mean
  0.02  0.04  0.03  0.11  0.04  0.07   8.02   0.37  1.84   1.07  1.38  0.22 sec

-------------------------------------------------------------------------------

Comments about results:

	The SUN 3s have OS version 4.2.  The SPARCs have SUN OS 4.0.
The Alliant has Concentrix version 5.0 and FX/Fortran version V4.2.40.
The Alliant has eight advanced computational elements (ACEs),
configured into one cluster of six and two detached processors.

	The timings were made during the daytime, and may be affected 
by the load averages.  In particular, our Alliant is rarely idle.

	The entries are approximately in decreasing order by geometric 
mean time.  Geometric mean is fairer than arithmetic mean, because the 
relative order of the geometric means will remain unchanged if one 
application is run twice as far (and takes twice as long on all machines), 
whereas the relative order of the arithmetic means may change.

	When we compare the SUN 3/50 and SUN 3/280 with 68881 
(first two entries), the improvements for the eleven timings are rather
uniform: each takes 50%-75% as long on the 3/280 as on the 3/50.  
The use of an inline expansion template (/usr/lib/f68881.il)
speeds the complex matrix multiplication by 30% and helps the
trigonometric calculations (Gener), while making little change elsewhere.
Using the FPA on the 280 speeds the non-trigonometric floating point 
computations by a factor of 4, while leaving integer times almost unchanged 
(N.B. PrbPrm uses floating point arithmetic to estimate an integer quotient).

	Next compare the SPARC station to the SUN 3/280 with FPA.
The floating point times are further improved, as are the Euclidean GCD,
binary-to-decimal conversion, and continued fraction times.
The integer matrix multiplication is 20% worse but the modular matrix
multiplication is 20% faster than on the 3/280.
The primality test which uses repeated division (TrlDiv) is much worse 
on the SPARC, taking 30% longer than it did on the SUN 3/50.  

	On the SPARC, modular matrix multiplication is faster than 
integer matrix multiplication even though the former inner loop is more 
complicated (it has an IF); this is presumably because the entries in 
the integer matrix are pseudorandom values modulo 2**32 whereas
those in the modular matrices require only 15 bits.
Both the TrlDiv and GCD codes have two remainder operations in the 
inner loop; the TrlDiv/GCD ratio is between 3.5 and 4.5 on SUN 3 and 
Alliant, but is 14 on a SPARC.  A possible explanation is the 
high frequency of quotients of 1 and 2 (59%) in the Euclidean 
algorithm [Cfrac also has a high frequency of low quotients].

	As on the SUN 3, the use of an inline expansion template affects
primarily the SPARC time for complex multiplication (the trigonometric 
times get worse rather than better).  In another comp.arch posting, 
Bo Thide' reports a 6-fold speedup in integer multiply time
using /usr/lib/libm.il, but .mul, .div, and .rem are missing from 
my /usr/lib/libm.il (which is dated 88/02/08).

        The unvectorized Alliant (with one processor) is faster than 
the SPARC on six of the eleven tests, and is slower on five tests;
the overall rating is too close to call.
The unvectorized Alliant beats the SUN 3/280 on three tests, loses
on five, with three (Intgr, PrbPrm, GCD) too close to call; the Alliant
beats the SUN 3/280 overall.

	The Alliant has an -autoinline option to insert user procedures 
inline.  Its use had minimal effect on these tests.  When -autoinline 
was used with multiple processors on a draft version of this
benchmark, the program aborted with arithmetic exception.

	It is interesting to compare the ratios Sngl/Intgr for
real vs. integer matrix multiplication on various machines.  
All matrix entries have the same size, and the indexing is identical, 
so the memory subsystem should be equally fast with both, 
and the compiler should be equally good at optimizing the loops 
(e.g., assigning variables to registers).  On the SUN 3s with 68881, 
the Sngl/Intgr ratio is between 2 and 3, with integer being faster.  
On the 3/280 with FPA, the ratio is 0.7, with real being faster.
On the SPARC, the ratio is about 0.25, with real being faster.
On the Alliant, the ratio is about 0.9, with real being slightly faster.

	Another interesting ratio Cmplx/Sngl.  A complex multiplication 
uses 4 real multiplication and 2 real additions; a complex addition uses 
2 real additions.  The inner loop of the complex matrix multiplication 
will therefore have 4 real multiplications and 4 real additions, 
whereas the single precision matrix multiplication has one of each.
The ratio Cmplx/Sngl should be less than 4, since loop overhead, 
indexing, and memory references are not four times as complicated; 
also there are many opportunities for parallelism and pipelining. 
The SUN 3 and SPARC ratios are between 7 and 8 with inline expansion 
templates, and between 11 and 13 without.  The Alliant ratios 
(1.6 to 1.9 unvectorized, 3 to 4 vectorized) seem more reasonable.

	All of these machines have hardware double precision arithmetic,
so the Dble/Sngl ratios are all between 1.0 and 1.5.  In many cases,
the penalty for double precision is very slight.

	The ratio of worst-to-best single-processor time is 70
for complex matrix multiplication and about 20 for single and double 
precision matrix multiplication.  On the other hand, this ratio is 
between 2 and 3 for TrlDiv, Cfrac and GCD.

	Some comments on the object code of these compilers:

	(1) The code printed by the Alliant compiler is much
	    easier to read than the others.  For example, it
	    retains the user variable names.  It also prints the 
	    hexadecimal equivalent of the object code, which is 
	    unimportant here but useful on other occasions.

	(2) All recognizes the common subexpression in
	    MOD(n,6).ne.1 .and. MOD(n,6).ne.5 
	    within TrlDiv, computing the remainder only once.

	(3) In GCD, we need both MIN(ABS(n1), ABS(n2)) and
	    MAX(ABS(n1), ABS(n2)).  None of these compilers
	    detects that MIN and MAX have identical arguments
	    (although all compiled ABS, MIN, and MAX inline).
	    Indeed, the Alliant computes ABS(n1) and ABS(n2) twice.

	(4) All generate inline code for statement function 
	    MODMUL within PrbPrm (function MODEXP). 

	(5) The lines  string(place:place) = DIGITS(MOD(left,10)) 
            and string(place:place) = DIGITS(left) in BINDEC transfer
            a single character.  The Alliant compiler recognizes these 
            and generates movb instructions.
	    The SUN 3 and SPARC compilers generate a subroutine call.
	    The Alliant also compiles the LEN (character string length)
	    intrinsic function inline; SUN 3 and SPARC do not.

	(6) Subroutine BINDEC needs both the quotient left/10
	    and the remainder MOD(left,10).  The SUN 3 compiler 
	    generates two divide instructions (one for quotient, 
	    one for remainder), even though the hardware makes 
	    both available at once.  The SPARC calls both .div and .rem .
	    The Alliant (which is also MC68020-based) generates only
	    one divide instruction, but computes the remainder from the 
	    quotient via a multiply and subtract (here and elsewhere).

	(7) The SUN 3 and SPARC compilers generate special code for
	    MOD(n,2) and n/2 within MODEXP.  The Alliant uses the same
	    instructions as for other divisors, except to optimize
	    the multiply by 2 into an add when computing the remainder.
	    None recognize that the MOD(n,2).ne.0 test (i.e., the ODD
	    function of PASCAL) requires testing only the bottom bit of n,
	    whether n is positive or negative (on a twos' complement machine).  
--------
        Peter Montgomery
        pmontgom@MATH.UCLA.EDU 
	Department of Mathematics, UCLA, Los Angeles, CA 90024