Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!zaphod.mps.ohio-state.edu!sdd.hp.com!decwrl!uunet!mcsun!hp4nl!charon!dik From: dik@cwi.nl (Dik T. Winter) Newsgroups: comp.arch Subject: Re: benchmark for evaluating extended precision (data, long) Keywords: extended precision,multiply,benchmark,arithmetic Message-ID: <2114@charon.cwi.nl> Date: 12 Sep 90 22:43:24 GMT References: <3989@bingvaxu.cc.binghamton.edu> <119858@linus.mitre.org> Sender: news@cwi.nl Organization: CWI, Amsterdam Lines: 122 Well, after the discussion between R. Kym Horsell and Robert D. Silverman, here are some real facts. First the background of it. Eons ago Arjen Lenstra (now at the University of Chicago, I think) wrote a program (in Fortran) to do primality proving for long numbers (upto about 320 digits). A writeup about it can be found in some back issue of Mathematics of Computation. For this program I did the base arithmetic kernels. Initially the program was written and executed on a CDC Cyber 750-75. Later this program was ported to a Cray 1 and a Cyber 205, and again I did the basic arithmetic kernels, using vector operations. When we moved over to do our main work on minis etc., I brought along the program and ported it to a large platform of minis. The basic structure of the arithmetic kernels is very simple, it allows for kernels written in C or Fortran, using either the HOL arithmetic or some parts in assembler allowing for 32x32->64 bit multiply and associated division. Also provisions are made to allow for simple loops over these operations to be coded in assembler. In this way the subroutine call overhead for a large number of calls can be avoided. Now whenever I get access to some machine I try to port the program to it. For a full port at least a Fortran compiler is required (as the base program is still in Fortran). Access to a C compiler is useful. I was not successful on one port only (to a Mac II) because of severe bugs in the Fortran compiler. Also there are a few platforms where an assembler is not available (CDC Cyber 995, NEC SX/2). And in a number of cases I did a port of the arithmetic kernels, but could not do a full port because I had no access to an adequate Fortran compiler (DG/MV and Intel 8086 with Xenix). (And, no, the stuff is not yet publicly available; some things have still to be done.) Now, when Bob Silverman claimed that he experienced a 10 fold speedup when going from 32x32->32 multiply to 32x32->64 multiply, I can only say that I do not find such a speedup. See the table below. On the other hand, Kym Horsell is also off base with his benchmark. Also his claims about input and output are besides the truth. In my experience, with this kind of work, a program will run happily during several seconds/ minutes/hours, doing no I/O at all and then finally write only a single or a few numbers. So in effect every effort spend to speed up I/O is essentially waisted. (The program I use does indeed twice write the number it is testing; that is really a waste of time, but only a few milliseconds in a run.) Andrew Mullhaupt was of course right when he stated that 32x32->64 multiply is much more important than 64/32->32,32 division. Even in a multiprecise division, the number of elementary divisions is order n while the number of elementary multiplications is order n squared. Now back to the data. First column gives the machine, second column the amount of time needed when using 32x32->64 multiply as compared to 32x32->32 multiply (in percents), the third column gives the same when also some basic loops are coded. percentage of time needed when using 32x32->64 coded loops DEC VAX-780 62.84% 46.50% Harris HCX7 (Tahoe) 61.75% 59.38% Sun3 (3/60, 68020) 70.68% 56.23% Alliant FX/4 (68020(1)) 66.50% 48.43% Sequent Symmetry (80386) 56.48% (2) Sequent Balance (32032) 58.97% 46.08% Encore Multimax (32332) 56.73% 45.81% Encore Multimax (32532) 63.92% 48.02% Gould PowerNode 59.06% 42.09% SGI (mips R2000) 73.68% 68.30% SGI (mips R3000) 73.28% 65.99% DecSTATION 5100 (mips R3000) 70.20% 65.22% Sun4 SPARCStation 1 (4/60) 57.04% 55.28% IBM RTPC 36.79% 29.95% Notes: (1) Not really a 68020, but a 68020 look-alike. I did not use vector operations, because in that case I ought to go back to 32x32->32 multiply. (2) I did not do the coding of basic loops because of the lack of registers, and I really did not want to do all kinds of loads/stores to and from memory. Additional notes: The CISC machines give in general a speed up to 55-70% of the original program in the first case and to 40-60% in the second case. The Tahoe is extraordinary because of the small difference between first and second case; there is nearly no subroutine call overhead. The RISC machines display a lack of speedup when going to coded loops; that is what RISC is all about. Mips machines display a general lack of speedup. It has hardware multiply. SPARC and RTPC both use multiply step instructions rather than hardware multiply. The RTPC multiply step instruction delivers 2 bits each time it is executed. Both SPARC and RTPC deliver a 64 bit result. They also both use divide step instructions, and both are able to do 64/32 divide in the same amount of time as 32/32 divide, although the supplied software does not use this fact. I am still not satisfied with the Sun 4 results and thinking about methods to improve it. RTPC displays a very marked speedup. Is that a compiler problem? An additional question is whether hardware multiply is required, or that multiply step instructions are sufficient. Below for one CISC and three RISC machines the timing (in seconds) of the best version of the program: Sun 3 (3/60, 68020) 24.36 Sun 4 (4/60) 14.80 IBM RTPC 14.10 SGI (R3000 @ 20 MHz) 5.88 We see that the mips processor has a decided advantage. On the other hand, the RTPC shows that even with multiply step instructions a reasonable performance can be obtained (the base machine is slower than a Sun 4). Sun 4 is in fact unacceptable for this kind of programs. And lastly the claim that 'bc' is extremely fast. To the contrary, the above program performs about 15000 operations of the form a*b and also about 15000 operations of the form c/%d, with a, b and d 53 digit numbers and c a 106 digit number (plus of course many more operations). On the Sun 4 'bc' needed 13 seconds to do only 400 of those operations. So who is the winner? I hope this helps. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl