Path: utzoo!attcan!uunet!aplcen!samsung!usc!apple!agate!linus!linus!bs
From: bs@linus.mitre.org (Robert D. Silverman)
Newsgroups: comp.arch
Subject: Re: benchmark for evaluating extended precision (data, long)
Keywords: extended precision,multiply,benchmark,arithmetic
Message-ID: <119983@linus.mitre.org>
Date: 13 Sep 90 14:21:06 GMT
References: <3989@bingvaxu.cc.binghamton.edu> <119858@linus.mitre.org> <2114@charon.cwi.nl>
Organization: The MITRE Corporation, Bedford MA
Lines: 97

In article <2114@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
:Well, after the discussion between R. Kym Horsell and Robert D. Silverman,
:here are some real facts.  First the background of it.
               ----
 
Does this imply my claims are not real?

My claims are based on actual benchmarks with large scale multiprecise
FFTs, with the Elliptic Curve factoring algorithm and with the Elliptic
Curve prime proving algorithm.


:Eons ago Arjen Lenstra (now at the University of Chicago, I think) wrote a
:program (in Fortran) to do primality proving for long numbers (upto
:about 320 digits).  A writeup about it can be found in some back issue
:of Mathematics of Computation.  For this program I did the base arithmetic
 
Math. Comp. vol. 48 (Jan 1987) 
H. Cohen & A. Lenstra, "Implementation of a New Primality Test"

A large chunk of this algorithm is spent doing Jacobi sums that don't
consist of just multi-precise multiplications. This, I believe, is
relevent to some of the benchmarks given below. 

:program and ported it to a large platform of minis.  The basic structure
:of the arithmetic kernels is very simple, it allows for kernels written
:in C or Fortran, using either the HOL arithmetic or some parts in assembler
:allowing for 32x32->64 bit multiply and associated division.  Also provisions
:are made to allow for simple loops over these operations to be coded in
:assembler.  In this way the subroutine call overhead for a large number of
             ----------------------------------------------------------------
:calls can be avoided.  Now whenever I get access to some machine I try
 --------------------
 
If one chooses not to put the control loops in assembler, but instead
relies upon calling just low level routines for 64 bit arithmetic,
the speed difference becomes more dramatic. In going from 32 to 64
bit arithmetic, one makes 4 times fewer subroutine calls to low
level arithmetic routines. 

:
:Now, when Bob Silverman claimed that he experienced a 10 fold speedup
:when going from 32x32->32 multiply to 32x32->64 multiply, I can only
:say that I do not find such a speedup.  See the table below.  On the
:other hand, Kym Horsell is also off base with his benchmark.  Also his
:claims about input and output are besides the truth.  In my experience,
:with this kind of work, a program will run happily during several seconds/
:minutes/hours, doing no I/O at all and then finally write only a single
:or a few numbers.  So in effect every effort spend to speed up I/O is
:essentially waisted.  (The program I use does indeed twice write the number
:Even in a multiprecise division, the number of elementary divisions is order
:n while the number of elementary multiplications is order n squared.
 
This is true when dividing a multi-precise number by a single precision
number. It is not true when dividing two multi-precise numbers. 
Dividing 2n bits by n bits is O(n^2) not O(n).

:
:Now back to the data.  First column gives the machine, second column the
:amount of time needed when using 32x32->64 multiply as compared to 32x32->32
:multiply (in percents), the third column gives the same when also some
:basic loops are coded.
:
:				percentage of time needed when using
:				32x32->64	coded loops
:DEC VAX-780			62.84%		46.50%

Stuff deleted. 

I don't dispute the data, but I'm a little unclear as to what is
being measured. 'percentage of time needed' is unclear to me.
Does this refer to the percentage of total execution time doing
multi-precise divides and multiples?

 
I believe what it means is that, for example on the VAX, the program
spent (respectively) 62% and 46% as much time to run when using 64
bit arithmetic, as it did when using 32 bit arithmetic.

However, he does not say what fraction of the program is spent in
doing this arithmetic. If a program (say) spends 1/2 its time doing
multi-precise multiplies, a 10-fold speed increase will reduce this
to 5%, but one still has the other 50%, resulting in a total run
time of 55% of the original.

:And lastly the claim that 'bc' is extremely fast.  To the contrary, the
:above program performs about 15000 operations of the form a*b and also
:about 15000 operations of the form c/%d, with a, b and d 53 digit numbers
:and c a 106 digit number (plus of course many more operations).  On the
:Sun 4 'bc' needed 13 seconds to do only 400 of those operations.  So who
:is the winner?
:
-- 
Bob Silverman
#include <std.disclaimer>
Mitre Corporation, Bedford, MA 01730
"You can lead a horse's ass to knowledge, but you can't make him think"