Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!rpi!leah!bingvaxu!kym From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) Newsgroups: comp.arch Subject: Re: int x int -> long for * (or is it 32x32->64) Keywords: arithmetic,arbitrary precision,benchmark,modular arithmetic Message-ID: <4020@bingvaxu.cc.binghamton.edu> Date: 14 Sep 90 20:55:46 GMT References: <3984@bingvaxu.cc.binghamton.edu> <41425@mips.mips.COM> <353@kaos.MATH.UCLA.EDU> <2118@charon.cwi.nl> <41501@mips.mips.COM> Reply-To: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) Organization: SUNY Binghamton, NY Lines: 71 Further to my recent posting of a simple benchmark for testing extended precision arithmetic vv intxint->long multiply, several people have expressed concerns regarding the veracity of the program & its design methodology (if any). Peter Montgomery expressed concerns in private email regarding the apparently small numbers that were being multiplied and that my benchmark might be criticized as being ``untypical'' in that regard. He states he (typically?) performs calculations on numbers in the 100's of decimal digits; Robert Silverman performs similar (and, I presume, larger) calculations. Greg Lindahl (gl8f@atsun7.astro.Virginia.EDU) also raised the question of whether cacheing, specifically code cacheing, would affect the results obtained. He was also concerned about the (possible?) vectorizing of any code that might make results atypical of those performed by Silverman & others. I had intended that my benchmark give _worst possible case_ comparisons between its short & long versions so as to give any possibility of a 10x slowdown a good chance of success. To this end I used the naive algorithm, made the program fairly tight, and tried to eliminate (roughly) secondary effects due to cacheing, memory management, etc. [Although I used malloc() to get storage for some of the intermediate results I noted it used only 0.3% of the runtime so left it in -- thanks for bringing it up O'Keefe]. However, in case there is some doubt about this (now that I look at it again) somewhat questionable reasoning, I adjusted my benchmark -- by `offsetting`` the i & j to calculate (slightly) larger factorials so that the average size of numbers exceeds 1000 decimal digits (1028 digits on average). The results for this modified benchmark are as follows: --------------------------------------------------------------------------- 16x16->32 8x8->16 ratio original VAX 6640 22.9 79.9 3.49 3.0 (14.7) (50.3) 3.42 2.8 VAX 8530 60.8 197 3.24 3.0 (58.6) (198) 3.38 3.2 Sun Sparc 1 48.2 142.6 2.96 2.8 (bc -- 181.9!) (25.9) (74.8) 2.89 3.0 SUN 3/60 91.5 244.2 2.67 2.5 (45.1) (123.8) 2.74 2.5 --------------------------------------------------------------------------- To summarize, my estimate that I had produced a ``worst case'' comparison was a little off -- all the ratios, except for the Sparc, have crept up marginally by going to larger precision calculations. Presuming a quadratic behavior we might perhaps expect (on the basis of this _very_ small sample) to have a 10x slowdown at around 1M digits or possibly 100,000. I will investigate this final possibility (anyone got a Cray with a few Gb for rent?) of 10x slowdown and report my findings at a later date. [I will not point out my dispassionate attitude toward my own hypotheses (;-) ]. As for possible affects by cacheing/vectorizing, etc., I _should_ investigate these (we _do_ have an underutilized 3090 out here). A priori I expect no appreciable change. No promises at this stage. -Kym Horsell P.S. I have been reading this group now for about 6 months; I've been a bit too busy for the last few years to do so. I'm amazed (and perhaps in John Mashey's case, a bit disappointed) that better use isn't made of the facility. ---- I know if you go near a horses ass it might kick you into the water.