Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!zaphod.mps.ohio-state.edu!sdd.hp.com!decwrl!uunet!mcsun!hp4nl!charon!dik
From: dik@cwi.nl (Dik T. Winter)
Newsgroups: comp.arch
Subject: Re: benchmark for evaluating extended precision (data, long)
Keywords: extended precision,multiply,benchmark,arithmetic
Message-ID: <2114@charon.cwi.nl>
Date: 12 Sep 90 22:43:24 GMT
References: <3989@bingvaxu.cc.binghamton.edu> <119858@linus.mitre.org>
Sender: news@cwi.nl
Organization: CWI, Amsterdam
Lines: 122

Well, after the discussion between R. Kym Horsell and Robert D. Silverman,
here are some real facts.  First the background of it.

Eons ago Arjen Lenstra (now at the University of Chicago, I think) wrote a
program (in Fortran) to do primality proving for long numbers (upto
about 320 digits).  A writeup about it can be found in some back issue
of Mathematics of Computation.  For this program I did the base arithmetic
kernels.  Initially the program was written and executed on a CDC Cyber
750-75.  Later this program was ported to a Cray 1 and a Cyber 205, and
again I did the basic arithmetic kernels, using vector operations.

When we moved over to do our main work on minis etc., I brought along the
program and ported it to a large platform of minis.  The basic structure
of the arithmetic kernels is very simple, it allows for kernels written
in C or Fortran, using either the HOL arithmetic or some parts in assembler
allowing for 32x32->64 bit multiply and associated division.  Also provisions
are made to allow for simple loops over these operations to be coded in
assembler.  In this way the subroutine call overhead for a large number of
calls can be avoided.  Now whenever I get access to some machine I try
to port the program to it.  For a full port at least a Fortran compiler is
required (as the base program is still in Fortran).  Access to a C compiler
is useful.  I was not successful on one port only (to a Mac II) because of
severe bugs in the Fortran compiler.  Also there are a few platforms where
an assembler is not available (CDC Cyber 995, NEC SX/2).  And in a number of
cases I did a port of the arithmetic kernels, but could not do a full port
because I had no access to an adequate Fortran compiler (DG/MV and Intel
8086 with Xenix).  (And, no, the stuff is not yet publicly available; some
things have still to be done.)

Now, when Bob Silverman claimed that he experienced a 10 fold speedup
when going from 32x32->32 multiply to 32x32->64 multiply, I can only
say that I do not find such a speedup.  See the table below.  On the
other hand, Kym Horsell is also off base with his benchmark.  Also his
claims about input and output are besides the truth.  In my experience,
with this kind of work, a program will run happily during several seconds/
minutes/hours, doing no I/O at all and then finally write only a single
or a few numbers.  So in effect every effort spend to speed up I/O is
essentially waisted.  (The program I use does indeed twice write the number
it is testing; that is really a waste of time, but only a few milliseconds
in a run.)   Andrew Mullhaupt was of course right when he stated that
32x32->64 multiply is much more important than 64/32->32,32 division.
Even in a multiprecise division, the number of elementary divisions is order
n while the number of elementary multiplications is order n squared.

Now back to the data.  First column gives the machine, second column the
amount of time needed when using 32x32->64 multiply as compared to 32x32->32
multiply (in percents), the third column gives the same when also some
basic loops are coded.

				percentage of time needed when using
				32x32->64	coded loops
DEC VAX-780			62.84%		46.50%
Harris HCX7 (Tahoe)		61.75%		59.38%
Sun3 (3/60, 68020)		70.68%		56.23%
Alliant FX/4 (68020(1))		66.50%		48.43%
Sequent Symmetry (80386)	56.48%		 (2)
Sequent Balance (32032)		58.97%		46.08%
Encore Multimax (32332)		56.73%		45.81%
Encore Multimax (32532)		63.92%		48.02%
Gould PowerNode			59.06%		42.09%

SGI (mips R2000)		73.68%		68.30%
SGI (mips R3000)		73.28%		65.99%
DecSTATION 5100 (mips R3000)	70.20%		65.22%

Sun4 SPARCStation 1 (4/60)	57.04%		55.28%
IBM RTPC			36.79%		29.95%

Notes:
(1)	Not really a 68020, but a 68020 look-alike.  I did not use vector
	operations, because in that case I ought to go back to 32x32->32
	multiply.
(2)	I did not do the coding of basic loops because of the lack of registers,
	and I really did not want to do all kinds of loads/stores to and from
	memory.

Additional notes:

The CISC machines give in general a speed up to 55-70% of the original
program in the first case and to 40-60% in the second case.  The Tahoe
is extraordinary because of the small difference between first and second
case; there is nearly no subroutine call overhead.

The RISC machines display a lack of speedup when going to coded loops; that
is what RISC is all about.

Mips machines display a general lack of speedup.  It has hardware multiply.

SPARC and RTPC both use multiply step instructions rather than hardware
multiply.  The RTPC multiply step instruction delivers 2 bits each time
it is executed.  Both SPARC and RTPC deliver a 64 bit result.  They also
both use divide step instructions, and both are able to do 64/32 divide
in the same amount of time as 32/32 divide, although the supplied software
does not use this fact.  I am still not satisfied with the Sun 4 results and
thinking about methods to improve it.  RTPC displays a very marked speedup.
Is that a compiler problem?

An additional question is whether hardware multiply is required, or that
multiply step instructions are sufficient.  Below for one CISC and three
RISC machines the timing (in seconds) of the best version of the program:

Sun 3 (3/60, 68020)	24.36
Sun 4 (4/60)		14.80
IBM RTPC		14.10
SGI (R3000 @ 20 MHz)	 5.88

We see that the mips processor has a decided advantage.  On the other hand,
the RTPC shows that even with multiply step instructions a reasonable
performance can be obtained (the base machine is slower than a Sun 4).
Sun 4 is in fact unacceptable for this kind of programs.

And lastly the claim that 'bc' is extremely fast.  To the contrary, the
above program performs about 15000 operations of the form a*b and also
about 15000 operations of the form c/%d, with a, b and d 53 digit numbers
and c a 106 digit number (plus of course many more operations).  On the
Sun 4 'bc' needed 13 seconds to do only 400 of those operations.  So who
is the winner?

I hope this helps.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl