Path: utzoo!utgpu!news-server.csri.toronto.edu!clyde.concordia.ca!mcgill-vision!snorkelwacker!apple!julius.cs.uiuc.edu!zaphod.mps.ohio-state.edu!rpi!leah!bingvaxu!kym From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) Newsgroups: comp.arch Subject: Re: benchmark for evaluating extended precision Keywords: extended precision,multiply,benchmark,arithmetic Message-ID: <3994@bingvaxu.cc.binghamton.edu> Date: 12 Sep 90 20:10:24 GMT References: <3989@bingvaxu.cc.binghamton.edu> <119858@linus.mitre.org> Reply-To: kym@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) Organization: SUNY Binghamton, NY Lines: 78 In article <119858@linus.mitre.org> bs@linus.UUCP (Robert D. Silverman) writes: \\\ >You are totally confused. Nowhere in your benchmark did I see any assembler >code to support 32 x 32 -- > 64. You are STILL working in radix 2^15 >[or maybe radix 2^16; I didn't check closely], whereas allowing 32 x 32 --> 64 >would allow you to work in radix 2^30 or 2^31 [depending on how careful >you want to be about sign bit and 1's complement/2's complement portability >problems]. Change of radix ALONE would account for a factor of 4 FEWER >multiplies. Secondly, you are working with tiny examples. Working in a >larger radix in a REAL WORLD PROGRAM means less storage required for each >multi-precise variable, less loop overhead, fewer subroutine calls, fewer >cache misses, etc. \\\ Funny, I don't _fell_ confused. Some of the points you make _are_ valid in some sense, but we're diverging from the topic. My whole point is that 32x32->64 is not _required_ in order to support extended precision arithmetic. There are various definitions of _required_ here: ``logically required'', ``convenience'', etc. Obviously the operation isn't _logically_ required, and as I've pointed out, its introduction poses some logical _problems_. With this benchmark, I'm also trying to show that _convenience_ is also not an issue (esp. in the light of the ``bottom line'', which is what (almost?) everything boils down to, whether we like it or not). The fact that I'm doing multiply naively is in _favour_ of your argument that ``[without 32x32->64 high prec arithmetic will] run 10 times slower''. If I had written this code using _better algorithms_ the difference between using 16x16->16 and 32x32->32 would be _even less_, thereby weakening your claim further. Consider, we all know of fairly simply divide-and-conquer high-prec multiply with complexity of O(n^lg(3)). (Actually we _can_ do them in almost linear time, but the cost isn't worth it yet). Instead of using 32x32->64 arithmetic we _could_ use 32x32->32 arithmetic and only use the low-order bits for multiply. If we do this we must perform 4 times more ``small'' multiplications to perform the ``big'' one; but each of these ``small'' multiplies is performed on, effectively, 16-bit operands and should therefore be several times faster. (Checks for effectively multiplying by zero are _almost_ universal). In the limit, see above, this would work out at each multiply (2)^lg(3) faster / 4 times more multiplies = 3/4 of ``nxn->2n'' speed That is, *asymptotically* you only lose 25% in performance by using ``proper'' 32-bit arithmetic (to *simulate* 16x16->32) over having the ``convenience'' operation of 32x32->64 multiply. A 25% reduction in speed is not significant enough, in my accounting, to justify any _extra_ costs required to maintain the 32x32->64 crock in a HLL (_despite_ the fact that, historically, ``that's the way the harware does it anyway''). However, we can say 32 bits isn't close enough to the asymptote to justify any claims -- hence a little benchmark. Considering the amount of extended precision arithmetic that is performed anyway vis a vie fixed-point or the normal floating-point hackery, the the _theoretical_ (if you like) slowdown of 25% shown above, and in the light of _actual measurements_ of factors of 2-3 using naive algorithms, I remain unconvinced that 32x32->64 is required; the folks who have dropped it off their list of things to implement are justified I think. -Kym Horsell