Path: utzoo!attcan!utgpu!news-server.csri.toronto.edu!mailrus!uwm.edu!rpi!leah!bingvaxu!kym
From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell)
Newsgroups: comp.arch
Subject: Re: benchmark for evaluating extended precision
Keywords: extended precision,multiply,benchmark,arithmetic
Message-ID: <4037@bingvaxu.cc.binghamton.edu>
Date: 16 Sep 90 18:57:34 GMT
References: <3989@bingvaxu.cc.binghamton.edu> <513@abccam.abcl.co.uk>
Reply-To: kym@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell)
Organization: SUNY Binghamton, NY
Lines: 65

In article <513@abccam.abcl.co.uk> pete@abccam.abcl.co.uk (Peter Cockerell) writes:
\\\
>The benchmark time for the case when LONG and SHORT are both defined
>to be int (ie the natural length for the processor) is 0.4s!
>			
>Or am I missing something...?

Maybe. What is happening to the high-order part of the 32-bit product?
It's lost; your benchmark isn't performing the same function as mine.
[And any difference is due to memory accessing effects -- what a difference
tho'!].

For those that don't want (yet another?) clarification of what I'm trying to
get at plz `n' here.
[I can't help it & doctors can't help-- working in a college environment
causes ``lecture latchup''.  But it _does_ help to clarify my own `ideas'].

To reiterate, I wish to measure the difference between performance
of XP software with & without the convenience of having multiply
produce a ``double size'' product.

Some folks argue that having a multiply that gives a double product
is crucial to efficient running of their XP software. The question
then is, _how much_ is it worth (in terms of area, running time, etc).

To this end I've released this lil' program for any interested party
to measure on their available h/w (and _I'm_ interested in the
results 4 sure).

The program attempts to perform one of the things that tend
to take time in XP calcs -- big multiplies -- and have adopted the naive
``pencil & paper'' method because (a) it is still used a lot (see a lot of
LISP ``bignum'' packages for one thing), and (b) it has a _high dynamic
density_ of machine-level multiply operations vv adds and shifts.

Now, since double-sized products are not universal, I have to
``guestimate'' their loss on some architectures. They way I have
chosen to do this is to perform some calculations using 32x32->32 and
16x16->16 arithmetic (where available). On machines where native
16x16->16 _isn't_ available we have a bit of a problem (not to mention
machines that don't have h/w multiply in _any_ form); but its
still useful to have some numbers for these machines anyway.

O'Keefe has concentrated on computing factorials -- and this _may_
be a good idea; the density of multiplies may be higher than
the program I released. However, the first set of figures I posted
was based on the same idea (although I didn't _actually_ use
any machine-level support for it) and the differences between
16 and 32-bit versions weren't as large as I thought they _might_
be in other contexts -- hence the _second_ program.

O'Keefe's figure of 4-5 times speedup when _using_ vs _not using_
an _actual_ double-sized product is important to note. Maybe
I'll go back and _actually_ insert this into my program.
However, it's only _possible_ for machines with the actual
h/w support.

Summary -- O'Keefe has raised (my own included) doubts over
the actual speedup 32x32->32 and 32x32->64, so I'm going back
to the bench (but not for a rest). Unrolling the loops is
an experiment that _has_ been suggested by several people.

Tnx to all who are participating.

-Kym Horsell