Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!cs.utexas.edu!csc.ti.com!ti-csl!tilde.csc.ti.com!osage.csc.ti.com!bmk From: bmk@csc.ti.com (Brian M Kennedy) Newsgroups: comp.arch Subject: Re: benchmark for evaluating extended precision Keywords: extended precision,multiply,benchmark,arithmetic Message-ID: <1990Sep12.223253.9574@csc.ti.com> Date: 12 Sep 90 22:32:53 GMT References: <3989@bingvaxu.cc.binghamton.edu> Sender: usenet@csc.ti.com (USENET News System) Distribution: usa Organization: TI Computer Science Center, Dallas Lines: 129 =>It has been claimed that a lack of 32x32->64 multiplication =>makes a factor of 10 difference in the running time of =>typical extended precision arithmetic routines. Although it =>obviously makes _a_ difference in run time I do not measure =>an order of magnitude difference. I cannot comment directly about the performance gains possible for general multi-precision arithmetic, but I can comment on a special case, 64-bit arithmetic implemented via 32-bit arithmetic. I have a highly optimized simulator that must do 64-bit arithmetic. For portability, the simulator is written in C++/ANSI C. ANSI C provides no support for multi-precision arithmetic. It only guarantees that long is *at least* 32-bit. Therefore, I have written a C++ type, doubleLong, which guarantees at least 64-bit. The operation doubleLong*doubleLong is implemented, as you might guess, with multiple long*long operations. IF ANSI C supported a library function void mult(long Op1, long Op2, long Product[2]) which multiplied Op1 times Op2 and placed the result in the two longs making up the array Product, AND my hardware provided support for 32*32->64, such that the library function was implemented using that support, THEN, I could implement doubleLong multiplication much more efficiently. The question in this thread has been "How much more efficient?" Speed-up from 2.5 to 10 has been claimed. Here's my measurements using doubleLong: Ideally, I would measure a program full of multiplications of doubleLong's, first with the current 32*32->32 implementation, and then with 32*32->64 implementation. Unfortunately, I have no hardware support for the latter. Instead I will measure an upper-bound on the performance increase by comparing: 64*64->64 via 32*32->32 vs. 32*32->32 I assert that the speed of (64*64->64 via 32*32->64) calculations is faster than (64*64->64 via 32*32->32) but slower than just plain 32*32->32. The Environment: --------------- Mips R3000 running RISC/os 4.50, medium load C++ code compiled with ATT C++ 2.0 C code compiled with Mips 4.5 C compiler with -O1 The Results: ----------- The program was run a number of times. The min and max times are shown. 32*32->32 64*64->64 via 32*32->32 --------- ----------------------- MIN: %multiSpeed %multiSpeed Final product = 246447234 Final product = 99_999_970_000_002 CPU Time = 7650 ms CPU Time = 83730 ms Real Time = 17187 ms Real Time = 86343 ms MAX: %multiSpeed %multiSpeed Final product = 246447234 Final product = 99_999_970_000_002 CPU Time = 7820 ms CPU Time = 85960 ms Real Time = 19191 ms Real Time = 137034 ms From these timings, I get an approximate ratio of -- 11 --. This is about what I expected, knowing the code. Therefore, the speedup I would see from having a 32*32->64 operation available to me through ANSI C would be less than 11. Knowing the code, I would guess the speed-up would be about 4. That speedup, of course, is on a program that does little more than 64-bit multiplies. On my simulator that uses doubleLong, the gain would be much less than five percent. Finally, general multi-precision arithmetic may get more gains than indicated here. Also the gains possible when coding in assembly are probably higher than in C++. The program (in C++): -------------------- // Switch the definition of NUMBER between long and doubleLong // to obtain the alternate versions of this program. #include "doubleLong.h" // contains definition of doubleLong #include "timer.h" // for software timings #define NUMBER doubleLong // for 64*64->64 calculations, // implemented with 32*32->32 calculations //#define NUMBER long // for 32*32->32 calculations #define COUNT 10000000 main () {NUMBER Product; timer Timer; Timer.start(); for(int I = 0; I < COUNT; I++) {Product = NUMBER(I) * NUMBER(I-1);} Timer.split(); cout << "Final product = " << Product << endl << endl; cout << " CPU Time = " << Timer.cpu() << " ms" << endl; cout << " Real Time = " << Timer.real() << " ms" << endl; } --------------------------------- Brian M. Kennedy Computer Systems Laboratory Computer Science Center Texas Instruments Incorporated