Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!rpi!leah!bingvaxu!kym
From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell)
Newsgroups: comp.arch
Subject: Re: int x int -> long for * (or is it 32x32->64)
Keywords: arithmetic,arbitrary precision,benchmark,modular arithmetic
Message-ID: <4020@bingvaxu.cc.binghamton.edu>
Date: 14 Sep 90 20:55:46 GMT
References: <3984@bingvaxu.cc.binghamton.edu> <41425@mips.mips.COM> <353@kaos.MATH.UCLA.EDU> <2118@charon.cwi.nl> <41501@mips.mips.COM>
Reply-To: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell)
Organization: SUNY Binghamton, NY
Lines: 71

Further to my recent posting of a simple benchmark for testing
extended precision arithmetic vv intxint->long multiply,
several people have expressed concerns regarding the veracity
of the program & its design methodology (if any).

Peter Montgomery expressed concerns in private email regarding
the apparently small numbers that were being multiplied and
that my benchmark might be criticized as being ``untypical''
in that regard.  He states he (typically?) performs calculations
on numbers in the 100's of decimal digits; Robert Silverman performs
similar (and, I presume, larger) calculations.
Greg Lindahl (gl8f@atsun7.astro.Virginia.EDU) also raised the
question of whether cacheing, specifically code cacheing,
would affect the results obtained. He was also concerned about
the (possible?) vectorizing of any code that might make
results atypical of those performed by Silverman & others.

I had intended that my benchmark give _worst possible case_ comparisons
between its short & long versions so as to give any possibility of a 10x
slowdown a good chance of success. To this end I used the naive algorithm,
made the program fairly tight, and tried to eliminate (roughly) secondary
effects due to cacheing, memory management, etc.
[Although I used malloc() to get storage for some of the intermediate
results I noted it used only 0.3% of the runtime so left it in -- thanks
for bringing it up O'Keefe].

However, in case there is some doubt about this (now that I look at it
again) somewhat questionable reasoning, I adjusted my benchmark -- by
`offsetting`` the i & j to calculate (slightly) larger factorials so that
the average size of numbers exceeds 1000 decimal digits (1028 digits on
average).

The results for this modified benchmark are as follows:
---------------------------------------------------------------------------
	16x16->32	8x8->16		ratio	original 

VAX 6640
	22.9		79.9		3.49	3.0
	(14.7)		(50.3)		3.42	2.8
VAX 8530
	60.8		197		3.24	3.0
	(58.6)		(198)		3.38	3.2
Sun Sparc 1
	48.2		142.6		2.96	2.8 	(bc -- 181.9!)
	(25.9)		(74.8)		2.89	3.0
SUN 3/60
	91.5		244.2		2.67	2.5
	(45.1)		(123.8)		2.74	2.5
---------------------------------------------------------------------------

To summarize, my estimate that I had produced a ``worst case'' comparison was 
a little off -- all the ratios, except for the Sparc, have crept up marginally 
by going to larger precision calculations. Presuming a quadratic behavior we 
might perhaps expect (on the basis of this _very_ small sample) to have a 10x 
slowdown at around 1M digits or possibly 100,000. I will investigate this 
final possibility (anyone got a Cray with a few Gb for rent?) of 10x slowdown 
and report my findings at a later date.  
[I will not point out my dispassionate attitude toward my own hypotheses (;-) ].

As for possible affects by cacheing/vectorizing, etc., I _should_ investigate
these (we _do_ have an underutilized 3090 out here). A priori I expect
no appreciable change. No promises at this stage.

-Kym Horsell

P.S.	I have been reading this group now for about 6 months; I've been a 
	bit too busy for the last few years to do so. I'm amazed (and perhaps 
	in John Mashey's case, a bit disappointed) that better use isn't made 
	of the facility.
----
I know if you go near a horses ass it might kick you into the water.