Path: utzoo!utgpu!water!watmath!clyde!rutgers!mit-eddie!uw-beaver!cornell!batcomputer!itsgw!steinmetz!uunet!mfci!root From: root@mfci.UUCP (SuperUser) Newsgroups: comp.arch Subject: Re: Performance increase - a suggestion Message-ID: <281@m2.mfci.UUCP> Date: 12 Feb 88 18:05:58 GMT References: <3127@phri.UUCP> <9408@steinmetz.steinmetz.UUCP> Reply-To: mfci!colwell@uunet.UUCP (Robert Colwell) Organization: Multiflow Computer Inc., Branford, CT. 06405 Lines: 64 In article <9408@steinmetz.steinmetz.UUCP> sunset!oconnor@steinmetz.UUCP writes: ]An article by colwell@m6.UUCP (Robert Colwell) says: ]] I don't believe you can do double precision math as fast as single ]] precision math if both are implemented in the same technology. ]]... ]NO. Sorry. But the fastest floating point division and root ]algorithms use Newton-Rapheson iteration, where the time to ]solution is proportional to log2( number_of_bits_of_result ). ]That is, if single-precision takes 4 iterations, double precision ]will take 5, and quad will take 6. ] ]] If we're only discussing addition, subtraction, and multiplication, ]] then I still don't believe it. There's an adder at the heart of each ]] of those, and its width decides its speed -- the wider, the slower ]] (more levels of carry-lookahead). ]]... ]Doubling the width of a carry-select adder adds one gate delay to ]its latency. Carry-lookahead schemes have a similar less-than- ]proportionate penalty for speed up. Subtract is just an add. ]Floating point adds also need de- and re-normalization. ] ]Multipliers are more complex : all the fast stuff uses big ]booth-encoded adder arrays. But the final carry-resolution, ]the justifiaction and normalization logic add signicantly to ]the latency of the multiply. ] ]] Bob Colwell ] ]Bob, you seem a little naive about floating-point hardware. ]It's not all shift-add or shift-compare-subtract anymore. ] ] Dennis O'Connor oconnor@sungoddess.steinmetz.UUCP ?? ] ARPA: OCONNORDM@ge-crd.arpa Whether I'm naive or not isn't relevant to anything. In fact, I don't see where you and I disagreed anywhere above; I said more bits takes longer, and you quantified it. On the other hand, now that you brought it up, I'm with Gideon Yuval on the subject of rounding. As far as I know, you can't get double precision IEEE round-to-nearest sqrt or divide with only 5 iterations of Newton-Rapheson. I believe that gives you enough *accuracy*, meaning all of the mantissa bits are meaningful at that point, but you still won't know if the number you just got from iteration number 5 is the best finite-precision approximation to the real answer (meaning the infinite-precision case) unless you then proceed to another step which can figure that out. Iterative hardware, of course, does one extra step and rounds the mantissa according to whether the guard bit was a 1 or 0 (and if the mantissa then overflows, the exponent must be bumped, and if that overflows...) If you do the N-R in extended precision, obviously the conversion handles the rounding exactly. You might well argue that many programs are not sensitive to noise in the lsb of an IEEE double precision sqrt or divide, and that's certainly true. I have seen, however, at least one program that required absolutely monotonic behavior in the lsb or it would take unacceptably long to converge. And then there are programs which demand IEEE accuracy everywhere. Bob Colwell Multiflow Computer 175 N. Main St. Branford CT 06405 mfci!colwell@uunet.uucp 203-488-6090