Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uwm.edu!bionet!agate!linus!linus!bs From: bs@linus.mitre.org (Robert D. Silverman) Newsgroups: comp.arch Subject: Re: benchmark for evaluating extended precision Keywords: extended precision,multiply,benchmark,arithmetic Message-ID: <119858@linus.mitre.org> Date: 12 Sep 90 11:33:59 GMT References: <3989@bingvaxu.cc.binghamton.edu> Reply-To: bs@linus.UUCP (Robert D. Silverman) Organization: The MITRE Corporation, Bedford MA Lines: 47 In article <3989@bingvaxu.cc.binghamton.edu> vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes: > >After a number of private communications, I've managed to render >one of the little benchmarks I have presentable enough to post, >along with some performance figures from different machines >(basically, its put up or shut up time). > >It has been claimed that a lack of 32x32->64 multiplication >makes a factor of 10 difference in the running time of >typical extended precision arithmetic routines. Although it >obviously makes _a_ difference in run time I do not measure >an order of magnitude difference. some silly benchmark deleted.... You are totally confused. Nowhere in your benchmark did I see any assembler code to support 32 x 32 -- > 64. You are STILL working in radix 2^15 [or maybe radix 2^16; I didn't check closely], whereas allowing 32 x 32 --> 64 would allow you to work in radix 2^30 or 2^31 [depending on how careful you want to be about sign bit and 1's complement/2's complement portability problems]. Change of radix ALONE would account for a factor of 4 FEWER multiplies. Secondly, you are working with tiny examples. Working in a larger radix in a REAL WORLD PROGRAM means less storage required for each multi-precise variable, less loop overhead, fewer subroutine calls, fewer cache misses, etc. Try implementing (say) a prime proving algorithm, or an RSA encryption algorithm, or a multi-precise FFT over the rationals or anything but a tinkertoy program with just 1 or two multiprecise numbers where everything fits in cache. Furthermore, working in a power of two radix in performing 32 x 32--> 64 allows one to carefully implement some optimizations (in assembler) that are not readily available in C. By the way, 32 x 32 --> 64 can be needed for OTHER than just multiple precision support. Suppose A > B and A > C. How would one compute BC/A EXACTLY without 64 bit support? If all are 32 bit quantities, BC/A also fits into 32 bits, yet computing BC can overflow. Similarly, one cannot compute BC mod A without such support. -- Bob Silverman #include Mitre Corporation, Bedford, MA 01730 "You can lead a horse's ass to knowledge, but you can't make him think"