Path: utzoo!attcan!uunet!snorkelwacker!apple!agate!linus!linus!bs From: bs@linus.mitre.org (Robert D. Silverman) Newsgroups: comp.arch Subject: Re: int x int -> long for * (or is it 32x32->64) Keywords: arithmetic,arbitrary precision,benchmark,modular arithmetic Message-ID: <119977@linus.mitre.org> Date: 13 Sep 90 13:51:59 GMT References: <3984@bingvaxu.cc.binghamton.edu> <41425@mips.mips.COM> <353@kaos.MATH.UCLA.EDU> Organization: The MITRE Corporation, Bedford MA Lines: 76 In article <353@kaos.MATH.UCLA.EDU> pmontgom@euphemia.math.ucla.edu (Peter Montgomery) writes: >In article <41425@mips.mips.COM> mash@mips.COM (John Mashey) writes: >>I'm confused about the following exchange: >> >>In article <3984@bingvaxu.cc.binghamton.edu> vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes: >>>In article <119733@linus.mitre.org> bs@gauss.UUCP (Robert D. Silverman) writes: stuff deleted. >>How about posting the benchmark, so we can understand what's happening? >>It might especially be good to post a sample of the disassembled code that's >>supposed to be doing 32x32->64, if that is truly supposed to be happening. > > I do many of the same types of computations as Silverman >(and, like he, some of my processes use a month of CPU time or more). >But I dispute his 10X figure. Here is a benchmark which tries >to compute the remainders i*j mod p where 0 <= i, j < p, using three >algorithms expressible in C and also an assembly code >algorithm (which, on the SUN 3, uses the 64-bit multiply and divide >instructions). Those instructions perform five times as fast as Even Peter reports a 5 fold speed increase. One difference between his code and mine that would tend to exaggerate the difference is that he codes the control loops directly in assembler [for his multiply routines] whereas I write my in C with low level calls to assembler routines for the 64 bit arithmetic. When I go from 32 to 64 bits, I cut my calls to these routines by a factor of 4. For me to reduce to 32 bit arithmetic is therefore more expensive. This applies especially to the VAX because the 'calls' instruction generated by the C compiler is a real pig. Secondly, he does his modular multiplication by a method that avoids the division instruction entirely (at the cost of some additional multiplies). I do mine by multiplication followed by long division [slower, but slightly more general]. Since division is slower than multiplication, the effect of going from 64/32 --> 32 to 32/16 --> 16 enhances the speed difference. My claim of 10x varies from machine to machine. On some machines [a microVAX] It has been as high as a factor of 13. On others as low as about 6. On one machine, [An Alliant] I observed a 20% speed difference just from cache effects. Storing in radix 2^30 versus 2^15 halves the storage requirements. I agree that my measurements DO NOT measure JUST the raw speed difference in arithmetic that one obtains from going from 32 to 64 bit multiplies and divides. I also agree that if I were to inline my calls to the low level assembler routines, the effect would be lessened. However, there can't be any doubt that there is a SIGNIFICANT difference between 32 and 64 bit arithmetic when doing multiple precision. The Sparc shows little difference because it has NO multiply or divide instruction for even 32 bits. It comes down to what we are measuring. I am measuring the overall performance benefit for large programs with lots of data including caching, subroutine calls etc. Peter's benchmarks tend to measure more closely just the speed difference in arithmetic. And in that respect, I agree with his assessments. If one had higher level language support for a 'long long' temporary data type so one could compute AB/C, AB mod C, (A << 30 + B)/C, (AB + C) mod 2^30, etc. directly in C, my claim for a 10x speed improvement would come down quite a bit. One thing is clear throughout ALL of this. Machines like the SPARC which provide NO integer multiply/divide are not worth much for this type of work. This is why I question ALUs like the i860 which also has no integer multiply/divide, yet does have a fancy graphics unit. -- Bob Silverman #include Mitre Corporation, Bedford, MA 01730 "You can lead a horse's ass to knowledge, but you can't make him think"