Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!uunet!mcsun!hp4nl!charon!dik
From: dik@cwi.nl (Dik T. Winter)
Newsgroups: comp.arch
Subject: Re: benchmark for evaluating extended precision (data, long)
Keywords: extended precision,multiply,benchmark,arithmetic
Message-ID: <2119@charon.cwi.nl>
Date: 14 Sep 90 01:10:07 GMT
References: <119858@linus.mitre.org> <2114@charon.cwi.nl> <119983@linus.mitre.org>
Sender: news@cwi.nl
Organization: CWI, Amsterdam
Lines: 94

In article <119983@linus.mitre.org> bs@linus.mitre.org (Robert D. Silverman) writes:
 > In article <2114@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
 > :Well, after the discussion between R. Kym Horsell and Robert D. Silverman,
 > :here are some real facts.  First the background of it.
 >                ----
 >  
 > Does this imply my claims are not real?
No, not that.  But your claims told only about an order of magnitude.
Perhaps your claims are based on programs where one was 10 times
as fast as the other on some processor.  I have a lot of data for a
lot of different processors.
 > 
 > My claims are based on actual benchmarks with large scale multiprecise
 > FFTs, with the Elliptic Curve factoring algorithm and with the Elliptic
 > Curve prime proving algorithm.
You did not state what processor gives which advantage.  It highly depends
upon the processor used.
 > 
...
 > A large chunk of this algorithm is spent doing Jacobi sums that don't
 > consist of just multi-precise multiplications. This, I believe, is
 > relevent to some of the benchmarks given below. 
Right, but althought it is large chunk of the algorithm, it is not a large chunk
of the time required as far as I know.  But I will time that.
 > 
...
 > :            In this way the subroutine call overhead for a large number of
 >              ----------------------------------------------------------------
 > :calls can be avoided.  Now whenever I get access to some machine I try
 >  --------------------
 >  
 > If one chooses not to put the control loops in assembler, but instead
 > relies upon calling just low level routines for 64 bit arithmetic,
 > the speed difference becomes more dramatic. In going from 32 to 64
 > bit arithmetic, one makes 4 times fewer subroutine calls to low
 > level arithmetic routines. 
Especially if (as you make clear now) you also use subroutine calls to
low level routines that could be inlined and coded in the HOL.  Did you
try inlining on (for example) the Sparc?
 > 
 > :                      (The program I use does indeed twice write the number
 > :Even in a multiprecise division, the number of elementary divisions is order
 > :n while the number of elementary multiplications is order n squared.
 >  
 > This is true when dividing a multi-precise number by a single precision
 > number. It is not true when dividing two multi-precise numbers. 
 > Dividing 2n bits by n bits is O(n^2) not O(n).
I wrote that the number of elementary divisions is order n, the number of
elementary multiplications and additions is n^2.
 > 
 > :
 > :Now back to the data.  First column gives the machine, second column the
 > :amount of time needed when using 32x32->64 multiply as compared to 32x32->32
 > :multiply (in percents), the third column gives the same when also some
 > :basic loops are coded.
 > :
 > :				percentage of time needed when using
 > :				32x32->64	coded loops
 > :DEC VAX-780			62.84%		46.50%
 > 
 > Stuff deleted. 
 > 
 > I don't dispute the data, but I'm a little unclear as to what is
 > being measured. 'percentage of time needed' is unclear to me.
 > Does this refer to the percentage of total execution time doing
 > multi-precise divides and multiples?
Excuse me for my poor English.  The columns set off the time required for
the whole prime proving using an alternate form for the arithmetic compared
to the time required when doing everything in a HOL.
 > 
 > I believe what it means is that, for example on the VAX, the program
 > spent (respectively) 62% and 46% as much time to run when using 64
 > bit arithmetic, as it did when using 32 bit arithmetic.
Right.
 > 
 > However, he does not say what fraction of the program is spent in
 > doing this arithmetic. If a program (say) spends 1/2 its time doing
 > multi-precise multiplies, a 10-fold speed increase will reduce this
 > to 5%, but one still has the other 50%, resulting in a total run
 > time of 55% of the original.
Also true.  But I am *not* interested in the time it takes to do some
parts of a program; I am only interested in the time that is required
for the program complete.  The same discussion is often heard in
supercomputer regions about vectorization.  Look at Amdahl's law.
What is better, a processor that peaks at 800 MFlop but gets a performance
of 10 MFlop on realistic programs, or a processor that peaks at 400 MFlop
and gets 100 MFlop on realistic programs?

(To be honest, on more than one occasion I did this kind of micro optimization,
see my follow-up to Peter Montgomery's article.  But is it worthwile?  For me
yes, I learn a bit more about the processor, but that is all.)
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl