Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uwm.edu!bionet!agate!linus!linus!bs
From: bs@linus.mitre.org (Robert D. Silverman)
Newsgroups: comp.arch
Subject: Re: benchmark for evaluating extended precision
Keywords: extended precision,multiply,benchmark,arithmetic
Message-ID: <119858@linus.mitre.org>
Date: 12 Sep 90 11:33:59 GMT
References: <3989@bingvaxu.cc.binghamton.edu>
Reply-To: bs@linus.UUCP (Robert D. Silverman)
Organization: The MITRE Corporation, Bedford MA
Lines: 47

In article <3989@bingvaxu.cc.binghamton.edu> vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes:
>
>After a number of private communications, I've managed to render
>one of the little benchmarks I have presentable enough to post,
>along with some performance figures from different machines
>(basically, its put up or shut up time).
>
>It has been claimed that a lack of 32x32->64 multiplication
>makes a factor of 10 difference in the running time of
>typical extended precision arithmetic routines. Although it
>obviously makes _a_ difference in run time I do not measure
>an order of magnitude difference.

some silly benchmark deleted....


You are totally confused. Nowhere in your benchmark did I see any assembler
code to support 32 x 32 -- > 64. You are STILL working in radix 2^15 
[or maybe radix 2^16; I didn't check closely], whereas allowing 32 x 32 --> 64
would allow you to work in radix 2^30 or 2^31 [depending on how careful
you want to be about sign bit and 1's complement/2's complement portability
problems]. Change of radix ALONE would account for a factor of 4 FEWER
multiplies. Secondly, you are working with tiny examples. Working in a
larger radix in a REAL WORLD PROGRAM means less storage required for each
multi-precise variable, less loop overhead, fewer subroutine calls, fewer
cache misses, etc.

Try implementing (say) a prime proving algorithm, or an RSA encryption
algorithm, or a multi-precise FFT over the rationals or anything but
a tinkertoy program with just 1 or two multiprecise numbers where
everything fits in cache.

Furthermore, working in a power of two radix in performing 32 x 32--> 64
allows one to carefully implement some optimizations (in assembler) that
are not readily available in C.

By the way, 32 x 32 --> 64 can be needed for OTHER than just multiple
precision support. Suppose A > B and A > C. How would one compute
BC/A EXACTLY without 64 bit support? If all are 32 bit quantities,
BC/A also fits into 32 bits, yet computing BC can overflow. Similarly,
one cannot compute BC mod A without such support.

-- 
Bob Silverman
#include <std.disclaimer>
Mitre Corporation, Bedford, MA 01730
"You can lead a horse's ass to knowledge, but you can't make him think"