Path: utzoo!utgpu!jarvis.csri.toronto.edu!clyde.concordia.ca!uunet!cs.utexas.edu!usc!orion.oac.uci.edu!uci-ics!ucla-cs!math.ucla.edu!oak!pmontgom From: pmontgom@oak.math.ucla.edu (Peter Montgomery) Newsgroups: comp.arch Subject: New benchmark (was: Integer Multiply/Divide) Message-ID: <2131@sunset.MATH.UCLA.EDU> Date: 7 Jan 90 03:16:25 GMT References: <84768@linus.UUCP> <4057@brazos.Rice.edu> Sender: news@MATH.UCLA.EDU Reply-To: pmontgom@math.ucla.edu (Peter Montgomery) Organization: UCLA Mathematics Department Lines: 206 In article <4057@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes: > >So what's your favorite application that does lots of >integer multiplies and divides? Code it up and pass it out >for people to run. > During the recent controversy about the SPARC and integer multiply/divide, multiple individuals have requested benchmarks which compare machine performances on applications needing these operations. In another posting is my response to that challenge, written primarily in FORTRAN 77. It times eleven algorithms (some integer, some floating point, but is biased towards ones needing integer multiplication and division). A machine's rating is defined to be the geometric mean of the times on these algorithms. So far I have run it on the SUN 3/50, SUN 3/280, SPARC, and Alliant. Here are the outputs (each row corresponds to a machine and compiler; each column corresponds to an algorithm): --------------------------------------------------------------------------- Gener = Generate matrices (integer arithmetic, trigonometry) Dble = Double precision floating point matrix multiplication Sngl = Single precision floating point matrix multiplication Cmplx = Complex (single precision) matrix multiplication Intgr = Integer matrix multiplication Modlr = Matrix multiplication modulo a prime number TrlDiv = Primality test, using trial division through square root PrbPrm = Probable prime test, using Fermat's Little Theorem GCD = Euclidean GCD algorithm BinDec = Binary-to-Decimal conversion Cfrac = Continued fraction expansion of quadratic irrational Geo mean = Geometric mean of above times SUN 3/50 with 68881, optimized (f77 -O -f68881) Gener Dble Sngl Cmplx Intgr Modlr TrlDiv PrbPrm GCD BinDec Cfrac Geo mean 0.86 3.76 3.54 43.70 1.56 1.86 12.56 1.48 3.30 4.82 2.60 3.55 sec SUN 3/280 with 68881, optimized (f77 -O -f68881) Gener Dble Sngl Cmplx Intgr Modlr TrlDiv PrbPrm GCD BinDec Cfrac Geo mean 0.56 2.84 2.56 30.16 0.94 1.04 7.44 1.12 1.96 2.38 1.54 2.24 sec SUN 3/280 with inline 68881, optimized (f77 -O -f68881 /usr/lib/f68881.il) Gener Dble Sngl Cmplx Intgr Modlr TrlDiv PrbPrm GCD BinDec Cfrac Geo mean 0.48 2.82 2.54 18.54 0.86 1.10 7.58 1.16 1.94 2.38 1.54 2.11 sec SUN 3/280 with inline FPA, optimized (f77 -O -ffpa /usr/lib/ffpa.il) Gener Dble Sngl Cmplx Intgr Modlr TrlDiv PrbPrm GCD BinDec Cfrac Geo mean 0.34 0.66 0.62 4.66 0.90 1.06 7.38 0.38 1.92 2.32 1.54 1.25 sec SPARC station optimized (f77 -O) Gener Dble Sngl Cmplx Intgr Modlr TrlDiv PrbPrm GCD BinDec Cfrac Geo mean 0.30 0.38 0.25 3.25 1.06 0.86 17.79 0.44 1.24 1.56 1.07 1.03 sec SPARC station optimized, inline (f77 -O /usr/lib/libm.il) Gener Dble Sngl Cmplx Intgr Modlr TrlDiv PrbPrm GCD BinDec Cfrac Geo mean 0.37 0.35 0.27 2.00 1.05 0.87 17.76 0.44 1.24 1.56 1.07 1.00 sec Alliant FX/80, one processor, optimized, unvectorized (fortran -Og) Gener Dble Sngl Cmplx Intgr Modlr TrlDiv PrbPrm GCD BinDec Cfrac Geo mean 0.09 0.91 0.85 1.47 0.86 1.21 8.14 0.37 1.90 1.14 1.72 1.02 sec Alliant FX/80, one processor, optimized, vectorized (fortran -Ovg -AS) Gener Dble Sngl Cmplx Intgr Modlr TrlDiv PrbPrm GCD BinDec Cfrac Geo mean 0.11 0.18 0.16 0.58 0.20 0.37 8.04 0.37 1.84 1.07 1.44 0.54 sec Alliant FX/80, six processors, optimized, vectorized (fortran -O -AS) Gener Dble Sngl Cmplx Intgr Modlr TrlDiv PrbPrm GCD BinDec Cfrac Geo mean 0.02 0.04 0.03 0.11 0.04 0.07 8.02 0.37 1.84 1.07 1.38 0.22 sec ------------------------------------------------------------------------------- Comments about results: The SUN 3s have OS version 4.2. The SPARCs have SUN OS 4.0. The Alliant has Concentrix version 5.0 and FX/Fortran version V4.2.40. The Alliant has eight advanced computational elements (ACEs), configured into one cluster of six and two detached processors. The timings were made during the daytime, and may be affected by the load averages. In particular, our Alliant is rarely idle. The entries are approximately in decreasing order by geometric mean time. Geometric mean is fairer than arithmetic mean, because the relative order of the geometric means will remain unchanged if one application is run twice as far (and takes twice as long on all machines), whereas the relative order of the arithmetic means may change. When we compare the SUN 3/50 and SUN 3/280 with 68881 (first two entries), the improvements for the eleven timings are rather uniform: each takes 50%-75% as long on the 3/280 as on the 3/50. The use of an inline expansion template (/usr/lib/f68881.il) speeds the complex matrix multiplication by 30% and helps the trigonometric calculations (Gener), while making little change elsewhere. Using the FPA on the 280 speeds the non-trigonometric floating point computations by a factor of 4, while leaving integer times almost unchanged (N.B. PrbPrm uses floating point arithmetic to estimate an integer quotient). Next compare the SPARC station to the SUN 3/280 with FPA. The floating point times are further improved, as are the Euclidean GCD, binary-to-decimal conversion, and continued fraction times. The integer matrix multiplication is 20% worse but the modular matrix multiplication is 20% faster than on the 3/280. The primality test which uses repeated division (TrlDiv) is much worse on the SPARC, taking 30% longer than it did on the SUN 3/50. On the SPARC, modular matrix multiplication is faster than integer matrix multiplication even though the former inner loop is more complicated (it has an IF); this is presumably because the entries in the integer matrix are pseudorandom values modulo 2**32 whereas those in the modular matrices require only 15 bits. Both the TrlDiv and GCD codes have two remainder operations in the inner loop; the TrlDiv/GCD ratio is between 3.5 and 4.5 on SUN 3 and Alliant, but is 14 on a SPARC. A possible explanation is the high frequency of quotients of 1 and 2 (59%) in the Euclidean algorithm [Cfrac also has a high frequency of low quotients]. As on the SUN 3, the use of an inline expansion template affects primarily the SPARC time for complex multiplication (the trigonometric times get worse rather than better). In another comp.arch posting, Bo Thide' reports a 6-fold speedup in integer multiply time using /usr/lib/libm.il, but .mul, .div, and .rem are missing from my /usr/lib/libm.il (which is dated 88/02/08). The unvectorized Alliant (with one processor) is faster than the SPARC on six of the eleven tests, and is slower on five tests; the overall rating is too close to call. The unvectorized Alliant beats the SUN 3/280 on three tests, loses on five, with three (Intgr, PrbPrm, GCD) too close to call; the Alliant beats the SUN 3/280 overall. The Alliant has an -autoinline option to insert user procedures inline. Its use had minimal effect on these tests. When -autoinline was used with multiple processors on a draft version of this benchmark, the program aborted with arithmetic exception. It is interesting to compare the ratios Sngl/Intgr for real vs. integer matrix multiplication on various machines. All matrix entries have the same size, and the indexing is identical, so the memory subsystem should be equally fast with both, and the compiler should be equally good at optimizing the loops (e.g., assigning variables to registers). On the SUN 3s with 68881, the Sngl/Intgr ratio is between 2 and 3, with integer being faster. On the 3/280 with FPA, the ratio is 0.7, with real being faster. On the SPARC, the ratio is about 0.25, with real being faster. On the Alliant, the ratio is about 0.9, with real being slightly faster. Another interesting ratio Cmplx/Sngl. A complex multiplication uses 4 real multiplication and 2 real additions; a complex addition uses 2 real additions. The inner loop of the complex matrix multiplication will therefore have 4 real multiplications and 4 real additions, whereas the single precision matrix multiplication has one of each. The ratio Cmplx/Sngl should be less than 4, since loop overhead, indexing, and memory references are not four times as complicated; also there are many opportunities for parallelism and pipelining. The SUN 3 and SPARC ratios are between 7 and 8 with inline expansion templates, and between 11 and 13 without. The Alliant ratios (1.6 to 1.9 unvectorized, 3 to 4 vectorized) seem more reasonable. All of these machines have hardware double precision arithmetic, so the Dble/Sngl ratios are all between 1.0 and 1.5. In many cases, the penalty for double precision is very slight. The ratio of worst-to-best single-processor time is 70 for complex matrix multiplication and about 20 for single and double precision matrix multiplication. On the other hand, this ratio is between 2 and 3 for TrlDiv, Cfrac and GCD. Some comments on the object code of these compilers: (1) The code printed by the Alliant compiler is much easier to read than the others. For example, it retains the user variable names. It also prints the hexadecimal equivalent of the object code, which is unimportant here but useful on other occasions. (2) All recognizes the common subexpression in MOD(n,6).ne.1 .and. MOD(n,6).ne.5 within TrlDiv, computing the remainder only once. (3) In GCD, we need both MIN(ABS(n1), ABS(n2)) and MAX(ABS(n1), ABS(n2)). None of these compilers detects that MIN and MAX have identical arguments (although all compiled ABS, MIN, and MAX inline). Indeed, the Alliant computes ABS(n1) and ABS(n2) twice. (4) All generate inline code for statement function MODMUL within PrbPrm (function MODEXP). (5) The lines string(place:place) = DIGITS(MOD(left,10)) and string(place:place) = DIGITS(left) in BINDEC transfer a single character. The Alliant compiler recognizes these and generates movb instructions. The SUN 3 and SPARC compilers generate a subroutine call. The Alliant also compiles the LEN (character string length) intrinsic function inline; SUN 3 and SPARC do not. (6) Subroutine BINDEC needs both the quotient left/10 and the remainder MOD(left,10). The SUN 3 compiler generates two divide instructions (one for quotient, one for remainder), even though the hardware makes both available at once. The SPARC calls both .div and .rem . The Alliant (which is also MC68020-based) generates only one divide instruction, but computes the remainder from the quotient via a multiply and subtract (here and elsewhere). (7) The SUN 3 and SPARC compilers generate special code for MOD(n,2) and n/2 within MODEXP. The Alliant uses the same instructions as for other divisors, except to optimize the multiply by 2 into an add when computing the remainder. None recognize that the MOD(n,2).ne.0 test (i.e., the ODD function of PASCAL) requires testing only the bottom bit of n, whether n is positive or negative (on a twos' complement machine). -------- Peter Montgomery pmontgom@MATH.UCLA.EDU Department of Mathematics, UCLA, Los Angeles, CA 90024