Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!munnari.oz.au!goanna!ok
From: ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe)
Newsgroups: comp.arch
Subject: Re: benchmark for evaluating extended precision
Keywords: extended precision,multiply,benchmark,arithmetic
Message-ID: <3794@goanna.cs.rmit.oz.au>
Date: 20 Sep 90 08:20:18 GMT
References: <3989@bingvaxu.cc.binghamton.edu> <513@abccam.abcl.co.uk> <4037@bingvaxu.cc.binghamton.edu>
Organization: Comp Sci, RMIT, Melbourne, Australia
Lines: 45

In article <4037@bingvaxu.cc.binghamton.edu>, kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
> The program attempts to perform one of the things that tend
> to take time in XP calcs -- big multiplies -- and have adopted the naive

Horsell's benchmark had two parts:  part 1 calculated a bunch of factorials,
then part 2 added and multiplied them.  I noticed that the code to add two
bignums was wrong ("cry" is one bit too short...), so concentrated on part 1.
part 1 does a lot of bigit * bignum multiplies, except that it turns out to
use very small bigits.  That's why I looked at factorials, because that was
part of Horsell's benchmark, and it was possible to use a technique for
factorials which _is_ similar to bignum multiplication.

> O'Keefe's figure of 4-5 times speedup when _using_ vs _not using_
> an _actual_ double-sized product is important to note.

It turned out that we have GCC here.  That makes a difference.
New figures:

Factorial test (NS32532, Encore Multimax 4.3BSD)

version	mult.size	lang	User t	System	Ratio
Horsell "8x8 -> 16"	cc	311.2	1.9	1.68
Horsell	"16x16 -> 32"	cc	228.8	2.6	1.23
My code "32x32 -> 64"	as	 52.6	0.5   1/3.52

Horsell "8x8->16"	gcc	264.1	1.3	1.43
Horsell "16x16->32"	gcc	185.1	0.8	1.00
My code "32x32->64"	as	 52.4	0.3   1/3.53		(.aligned)

So compared with Horsell's "DOUBLE" version compiled by a rather better
compiler, assembly code using 32x32->64 unsigned multiplication gets a
3.5ish speedup.  (The .aligned note refers to the trick of aligning branch
targets on word boundaries.  On machines with variable size instructions
that can often help.  Here it didn't, not to speak of.)

A figure of about 3.5 sounds right:   going from 16-bit to 32-bit chunks
we expect to do 1/4 the number of multiplications, and those multiplications
may be a wee bit slower.

It's worth noting that Horsell's representation for bignums is hairy.
I just used pointer-to-(length, bigit 0, bigit 1, ...) which amongst
other things let me use add-compare-and-branch.

-- 
Heuer's Law:  Any feature is a bug unless it can be turned off.