Path: utzoo!attcan!uunet!aplcen!samsung!usc!apple!agate!linus!linus!bs From: bs@linus.mitre.org (Robert D. Silverman) Newsgroups: comp.arch Subject: Re: benchmark for evaluating extended precision (data, long) Keywords: extended precision,multiply,benchmark,arithmetic Message-ID: <119983@linus.mitre.org> Date: 13 Sep 90 14:21:06 GMT References: <3989@bingvaxu.cc.binghamton.edu> <119858@linus.mitre.org> <2114@charon.cwi.nl> Organization: The MITRE Corporation, Bedford MA Lines: 97 In article <2114@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: :Well, after the discussion between R. Kym Horsell and Robert D. Silverman, :here are some real facts. First the background of it. ---- Does this imply my claims are not real? My claims are based on actual benchmarks with large scale multiprecise FFTs, with the Elliptic Curve factoring algorithm and with the Elliptic Curve prime proving algorithm. :Eons ago Arjen Lenstra (now at the University of Chicago, I think) wrote a :program (in Fortran) to do primality proving for long numbers (upto :about 320 digits). A writeup about it can be found in some back issue :of Mathematics of Computation. For this program I did the base arithmetic Math. Comp. vol. 48 (Jan 1987) H. Cohen & A. Lenstra, "Implementation of a New Primality Test" A large chunk of this algorithm is spent doing Jacobi sums that don't consist of just multi-precise multiplications. This, I believe, is relevent to some of the benchmarks given below. :program and ported it to a large platform of minis. The basic structure :of the arithmetic kernels is very simple, it allows for kernels written :in C or Fortran, using either the HOL arithmetic or some parts in assembler :allowing for 32x32->64 bit multiply and associated division. Also provisions :are made to allow for simple loops over these operations to be coded in :assembler. In this way the subroutine call overhead for a large number of ---------------------------------------------------------------- :calls can be avoided. Now whenever I get access to some machine I try -------------------- If one chooses not to put the control loops in assembler, but instead relies upon calling just low level routines for 64 bit arithmetic, the speed difference becomes more dramatic. In going from 32 to 64 bit arithmetic, one makes 4 times fewer subroutine calls to low level arithmetic routines. : :Now, when Bob Silverman claimed that he experienced a 10 fold speedup :when going from 32x32->32 multiply to 32x32->64 multiply, I can only :say that I do not find such a speedup. See the table below. On the :other hand, Kym Horsell is also off base with his benchmark. Also his :claims about input and output are besides the truth. In my experience, :with this kind of work, a program will run happily during several seconds/ :minutes/hours, doing no I/O at all and then finally write only a single :or a few numbers. So in effect every effort spend to speed up I/O is :essentially waisted. (The program I use does indeed twice write the number :Even in a multiprecise division, the number of elementary divisions is order :n while the number of elementary multiplications is order n squared. This is true when dividing a multi-precise number by a single precision number. It is not true when dividing two multi-precise numbers. Dividing 2n bits by n bits is O(n^2) not O(n). : :Now back to the data. First column gives the machine, second column the :amount of time needed when using 32x32->64 multiply as compared to 32x32->32 :multiply (in percents), the third column gives the same when also some :basic loops are coded. : : percentage of time needed when using : 32x32->64 coded loops :DEC VAX-780 62.84% 46.50% Stuff deleted. I don't dispute the data, but I'm a little unclear as to what is being measured. 'percentage of time needed' is unclear to me. Does this refer to the percentage of total execution time doing multi-precise divides and multiples? I believe what it means is that, for example on the VAX, the program spent (respectively) 62% and 46% as much time to run when using 64 bit arithmetic, as it did when using 32 bit arithmetic. However, he does not say what fraction of the program is spent in doing this arithmetic. If a program (say) spends 1/2 its time doing multi-precise multiplies, a 10-fold speed increase will reduce this to 5%, but one still has the other 50%, resulting in a total run time of 55% of the original. :And lastly the claim that 'bc' is extremely fast. To the contrary, the :above program performs about 15000 operations of the form a*b and also :about 15000 operations of the form c/%d, with a, b and d 53 digit numbers :and c a 106 digit number (plus of course many more operations). On the :Sun 4 'bc' needed 13 seconds to do only 400 of those operations. So who :is the winner? : -- Bob Silverman #include Mitre Corporation, Bedford, MA 01730 "You can lead a horse's ass to knowledge, but you can't make him think"