Path: utzoo!utgpu!water!watmath!clyde!rutgers!mit-eddie!uw-beaver!cornell!batcomputer!itsgw!steinmetz!uunet!mfci!root
From: root@mfci.UUCP (SuperUser)
Newsgroups: comp.arch
Subject: Re: Performance increase - a suggestion
Message-ID: <281@m2.mfci.UUCP>
Date: 12 Feb 88 18:05:58 GMT
References: <3127@phri.UUCP> <9408@steinmetz.steinmetz.UUCP>
Reply-To: mfci!colwell@uunet.UUCP (Robert Colwell)
Organization: Multiflow Computer Inc., Branford, CT. 06405
Lines: 64

In article <9408@steinmetz.steinmetz.UUCP> sunset!oconnor@steinmetz.UUCP writes:
]An article by colwell@m6.UUCP (Robert Colwell) says:
]] I don't believe you can do double precision math as fast as single
]] precision math if both are implemented in the same technology.
]]...
]NO. Sorry. But the fastest floating point division and root
]algorithms use Newton-Rapheson iteration, where the time to
]solution is proportional to log2( number_of_bits_of_result ).
]That is, if single-precision takes 4 iterations, double precision
]will take 5, and quad will take 6.
]
]] If we're only discussing addition, subtraction, and multiplication,
]] then I still don't believe it.  There's an adder at the heart of each
]] of those, and its width decides its speed -- the wider, the slower
]] (more levels of carry-lookahead). 
]]...
]Doubling the width of a carry-select adder adds one gate delay to
]its latency. Carry-lookahead schemes have a similar less-than-
]proportionate penalty for speed up. Subtract is just an add.
]Floating point adds also need de- and re-normalization.
]
]Multipliers are more complex : all the fast stuff uses big
]booth-encoded adder arrays. But the final carry-resolution,
]the justifiaction and normalization logic add signicantly to
]the latency of the multiply.
]
]] Bob Colwell
]
]Bob, you seem a little naive about floating-point hardware.
]It's not all shift-add or shift-compare-subtract anymore.
]
]	Dennis O'Connor 	oconnor@sungoddess.steinmetz.UUCP ??
]				ARPA: OCONNORDM@ge-crd.arpa

Whether I'm naive or not isn't relevant to anything.  In fact, I
don't see where you and I disagreed anywhere above;  I said more
bits takes longer, and you quantified it.

On the other hand, now that you brought it up, I'm with Gideon Yuval
on the subject of rounding.  As far as I know, you can't get double
precision IEEE round-to-nearest sqrt or divide with only 5 iterations
of Newton-Rapheson.  I believe that gives you enough *accuracy*,
meaning all of the mantissa bits are meaningful at that point, but
you still won't know if the number you just got from iteration number
5 is the best finite-precision approximation to the real answer
(meaning the infinite-precision case) unless you then proceed to
another step which can figure that out.  Iterative hardware, of
course, does one extra step and rounds the mantissa according to
whether the guard bit was a 1 or 0 (and if the mantissa then
overflows, the exponent must be bumped, and if that overflows...)
If you do the N-R in extended precision, obviously the conversion
handles the rounding exactly.

You might well argue that many programs are not sensitive to noise
in the lsb of an IEEE double precision sqrt or divide, and that's
certainly true.  I have seen, however, at least one program that
required absolutely monotonic behavior in the lsb or it would take
unacceptably long to converge.  And then there are programs which
demand IEEE accuracy everywhere.

Bob Colwell
Multiflow Computer  <Above is personal opinion only!!>
175 N. Main St.
Branford CT 06405    mfci!colwell@uunet.uucp  203-488-6090