Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!cs.utexas.edu!csc.ti.com!ti-csl!tilde.csc.ti.com!osage.csc.ti.com!bmk
From: bmk@csc.ti.com (Brian M Kennedy)
Newsgroups: comp.arch
Subject: Re: benchmark for evaluating extended precision
Keywords: extended precision,multiply,benchmark,arithmetic
Message-ID: <1990Sep12.223253.9574@csc.ti.com>
Date: 12 Sep 90 22:32:53 GMT
References: <3989@bingvaxu.cc.binghamton.edu>
Sender: usenet@csc.ti.com (USENET News System)
Distribution: usa
Organization: TI Computer Science Center, Dallas
Lines: 129

=>It has been claimed that a lack of 32x32->64 multiplication
=>makes a factor of 10 difference in the running time of
=>typical extended precision arithmetic routines. Although it
=>obviously makes _a_ difference in run time I do not measure
=>an order of magnitude difference.

I cannot comment directly about the performance gains possible
for general multi-precision arithmetic, but I can comment on
a special case, 64-bit arithmetic implemented via 32-bit
arithmetic.

I have a highly optimized simulator that must do 64-bit 
arithmetic.  For portability, the simulator is written 
in C++/ANSI C.  ANSI C provides no support for multi-precision
arithmetic.  It only guarantees that long is *at least* 32-bit.

Therefore, I have written a C++ type, doubleLong, which 
guarantees at least 64-bit.  The operation doubleLong*doubleLong
is implemented, as you might guess, with multiple long*long
operations.

IF ANSI C supported a library function
	void mult(long Op1, long Op2, long Product[2])
which multiplied Op1 times Op2 and placed the result in the
two longs making up the array Product,
AND my hardware provided support for 32*32->64, such that the
library function was implemented using that support,
THEN, I could implement doubleLong multiplication much more
efficiently.

The question in this thread has been "How much more efficient?"
Speed-up from 2.5 to 10 has been claimed.  Here's my measurements
using doubleLong:

Ideally, I would measure a program full of multiplications of
doubleLong's, first with the current 32*32->32 implementation,
and then with 32*32->64 implementation.  Unfortunately, I have
no hardware support for the latter.  Instead I will measure
an upper-bound on the performance increase by comparing:

  64*64->64 via 32*32->32      vs.      32*32->32

I assert that the speed of (64*64->64 via 32*32->64) calculations
is faster than (64*64->64 via 32*32->32) but slower than just
plain 32*32->32.


The Environment:
---------------
       Mips R3000 running RISC/os 4.50, medium load
       C++ code compiled with ATT C++ 2.0
       C code compiled with Mips 4.5 C compiler with -O1


The Results:
-----------
    The program was run a number of times.  The min and max times are
    shown.

      32*32->32                          64*64->64 via 32*32->32
      ---------                          -----------------------
MIN:  %multiSpeed                        %multiSpeed
      Final product = 246447234          Final product = 99_999_970_000_002 

           CPU Time = 7650 ms                 CPU Time = 83730 ms
          Real Time = 17187 ms               Real Time = 86343 ms

MAX:  %multiSpeed                        %multiSpeed
      Final product = 246447234          Final product = 99_999_970_000_002

           CPU Time = 7820 ms                 CPU Time = 85960 ms
          Real Time = 19191 ms               Real Time = 137034 ms


From these timings, I get an approximate ratio of -- 11 --.
This is about what I expected, knowing the code.
                                                 
Therefore, the speedup I would see from having a 32*32->64
operation available to me through ANSI C would be less than 11.
Knowing the code, I would guess the speed-up would be about 4.

That speedup, of course, is on a program that does little more
than 64-bit multiplies.  On my simulator that uses doubleLong,
the gain would be much less than five percent.

Finally, general multi-precision arithmetic may get more gains
than indicated here.  Also the gains possible when coding in 
assembly are probably higher than in C++.

The program (in C++):
--------------------
// Switch the definition of NUMBER between long and doubleLong
// to obtain the alternate versions of this program.

#include "doubleLong.h"      // contains definition of doubleLong

#include "timer.h"           // for software timings

#define NUMBER doubleLong    // for 64*64->64 calculations,
			     // implemented with 32*32->32 calculations

//#define NUMBER long	     // for 32*32->32 calculations

#define COUNT 10000000

main ()
   {NUMBER Product;
    timer Timer;

    Timer.start();

    for(int I = 0; I < COUNT; I++)
       {Product = NUMBER(I) * NUMBER(I-1);}

    Timer.split();

    cout << "Final product = " << Product << endl << endl;

    cout << "     CPU Time = " << Timer.cpu() << " ms" << endl;

    cout << "    Real Time = " << Timer.real() << " ms" << endl;
   }


---------------------------------
Brian M. Kennedy <bmk@csc.ti.com>
Computer Systems Laboratory
Computer Science Center
Texas Instruments Incorporated