Path: utzoo!attcan!utgpu!news-server.csri.toronto.edu!mailrus!uwm.edu!rpi!leah!bingvaxu!kym From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) Newsgroups: comp.arch Subject: Re: benchmark for evaluating extended precision Keywords: extended precision,multiply,benchmark,arithmetic Message-ID: <4037@bingvaxu.cc.binghamton.edu> Date: 16 Sep 90 18:57:34 GMT References: <3989@bingvaxu.cc.binghamton.edu> <513@abccam.abcl.co.uk> Reply-To: kym@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) Organization: SUNY Binghamton, NY Lines: 65 In article <513@abccam.abcl.co.uk> pete@abccam.abcl.co.uk (Peter Cockerell) writes: \\\ >The benchmark time for the case when LONG and SHORT are both defined >to be int (ie the natural length for the processor) is 0.4s! > >Or am I missing something...? Maybe. What is happening to the high-order part of the 32-bit product? It's lost; your benchmark isn't performing the same function as mine. [And any difference is due to memory accessing effects -- what a difference tho'!]. For those that don't want (yet another?) clarification of what I'm trying to get at plz `n' here. [I can't help it & doctors can't help-- working in a college environment causes ``lecture latchup''. But it _does_ help to clarify my own `ideas']. To reiterate, I wish to measure the difference between performance of XP software with & without the convenience of having multiply produce a ``double size'' product. Some folks argue that having a multiply that gives a double product is crucial to efficient running of their XP software. The question then is, _how much_ is it worth (in terms of area, running time, etc). To this end I've released this lil' program for any interested party to measure on their available h/w (and _I'm_ interested in the results 4 sure). The program attempts to perform one of the things that tend to take time in XP calcs -- big multiplies -- and have adopted the naive ``pencil & paper'' method because (a) it is still used a lot (see a lot of LISP ``bignum'' packages for one thing), and (b) it has a _high dynamic density_ of machine-level multiply operations vv adds and shifts. Now, since double-sized products are not universal, I have to ``guestimate'' their loss on some architectures. They way I have chosen to do this is to perform some calculations using 32x32->32 and 16x16->16 arithmetic (where available). On machines where native 16x16->16 _isn't_ available we have a bit of a problem (not to mention machines that don't have h/w multiply in _any_ form); but its still useful to have some numbers for these machines anyway. O'Keefe has concentrated on computing factorials -- and this _may_ be a good idea; the density of multiplies may be higher than the program I released. However, the first set of figures I posted was based on the same idea (although I didn't _actually_ use any machine-level support for it) and the differences between 16 and 32-bit versions weren't as large as I thought they _might_ be in other contexts -- hence the _second_ program. O'Keefe's figure of 4-5 times speedup when _using_ vs _not using_ an _actual_ double-sized product is important to note. Maybe I'll go back and _actually_ insert this into my program. However, it's only _possible_ for machines with the actual h/w support. Summary -- O'Keefe has raised (my own included) doubts over the actual speedup 32x32->32 and 32x32->64, so I'm going back to the bench (but not for a rest). Unrolling the loops is an experiment that _has_ been suggested by several people. Tnx to all who are participating. -Kym Horsell