Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.2 9/18/84; site osiris.UUCP Path: utzoo!linus!philabs!cmcl2!seismo!umcp-cs!aplvax!osiris!phil From: phil@osiris.UUCP (Philip Kos) Newsgroups: net.micro,net.micro.68k,net.micro.16k Subject: Re: Floating Point Comparisons Message-ID: <227@osiris.UUCP> Date: Mon, 8-Apr-85 10:45:07 EST Article-I.D.: osiris.227 Posted: Mon Apr 8 10:45:07 1985 Date-Received: Wed, 10-Apr-85 00:02:36 EST References: <133@cfa.UUCP> <6370@boring.UUCP> <2574@nsc.UUCP> Distribution: net Organization: Johns Hopkins Hospital Lines: 109 Xref: linus net.micro:8625 net.micro.68k:647 net.micro.16k:287 Into the fray! By my own reasoning (which can be pretty convoluted and obscure, when it's not just plain perverted, so feel free to correct me), the slower FADD/FSUB times are strange but not inexplicable. Several people have already posted a reasonable (and, I assume, correct) explanation of this phenomenon. The required denormalization of the smaller addend and ultimate normalization of the sum are obviously the culprits here. But the question remains: is this reasonable? Here, for your edification, is a review of how floating point arithmetic actually works. I haven't included division because it's an operation of a different color; algorithms for speeding it up quite unreasonably exist and aren't particularly hard to implement, and I don't know what algorithms different FP chip makers use. FADD/FSUB Arguments: two 32-bit (or 64-bit for DADD/DSUB) addends. Result: one 32-bit (or 64-bit) sum. Algorithm: 1. Calculate difference of exponents. 2. Shift smaller (in absolute magnitude) addend to the right until it aligns with the larger addend (can use the exponent difference as a counter preset). 3. Add. 4. Truncate and normalize sum. Notes: If the difference of exponents is greater than mantissa mantissa resolution (24 bits for single precision, 48 for double), no addition or subtraction needs to be done. The sum in these cases is simply the larger addend. Thus, no more than 24 (or 48) shifts need be done (allowing for 1 check bit). Maximum time = (max denormalization time) + (addition time) + (max normalization time) = ((exponent subtraction time) + (time for 24 (or 48) shifts)) + (addition time) + ((time for 23 (or 47) shifts) + (exponent adjust time)) Using a barrel shifter to align addends reduces the max. denormalization time by changing the order of the algorithm from O(n) to O(log2(n)). FMUL Arguments: 32-bit (or 64-bit for DMUL) multiplier and multiplicand. Result: one 64-bit (or 128-bit) product, only half of which is usually used. Algorithm: 1. Add multiplier and multiplicand exponents to generate product exponents. 2. Multiply by repeated addition (there may be special hardware for this, otherwise it's shift and con- ditionally add, shift and conditionally add, etc.) 3. Truncate and normalize product. Notes: Multiplication by repeated addition is an O(n) algorithm. If a special shift/add unit is used, multiply time is reduced because the total time is based on gate delays rather than bucket shifter clock cycles. Maximum time = (exponent addition time) + (multiplication time) + (normalization time) = (exponent addition time) + (24 (48) * (time for one add and shift)) + (one shift and exponent adjust) Conclusions: As has been noted, normalizing a product is almost always quicker than normalizing a sum. This is because in multi- plication you begin with two normalized numbers, which will yield a product needing at most 1 normalization shift. FADD, on the other hand, may need 23 (47) normalization shifts because of leading-bit cancellation. If a barrel shifter is available, the initial denormalization for FADD could be reduced significantly. A barrel shift may be as fast as a single-bit bucket shift. This would improve worst case performance by 23 (47) bucket shift clock cycles. It would probably improve the average FADD times enough to make it a faster operation than FMUL. I am surprised and dismayed that many commercial FPUs do not have barrel shifters. Granted, it's extra complexity which may not be justified by the *overall* performance increase, but we're only talking about a circuit with 48*log2(48) = ~288 multiplexors (not counting the 5 intermediate registers), which may be a significant chunk of the available silicon, but at today's densities shouldn't add that much complexity to the circuit. Am I asking for too much? (I do enjoy having my cake and eating it too...) Phil Kos The Johns Hopkins Hospital