Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/18/84; site osiris.UUCP
Path: utzoo!linus!philabs!cmcl2!seismo!umcp-cs!aplvax!osiris!phil
From: phil@osiris.UUCP (Philip Kos)
Newsgroups: net.micro,net.micro.68k,net.micro.16k
Subject: Re: Floating Point Comparisons
Message-ID: <227@osiris.UUCP>
Date: Mon, 8-Apr-85 10:45:07 EST
Article-I.D.: osiris.227
Posted: Mon Apr  8 10:45:07 1985
Date-Received: Wed, 10-Apr-85 00:02:36 EST
References: <133@cfa.UUCP> <6370@boring.UUCP> <2574@nsc.UUCP>
Distribution: net
Organization: Johns Hopkins Hospital
Lines: 109
Xref: linus net.micro:8625 net.micro.68k:647 net.micro.16k:287


Into the fray!

By my own reasoning (which can be pretty convoluted and obscure,
when it's not just plain perverted, so feel free to  correct me),
the slower FADD/FSUB times are strange but not inexplicable.
Several people have already posted a reasonable (and, I assume,
correct) explanation of this phenomenon.

The required denormalization of the smaller addend and ultimate
normalization of the sum are obviously the culprits here.  But
the question remains:  is this reasonable?

Here, for your edification, is a review of how floating point
arithmetic actually works.  I haven't included division because
it's an operation of a different color; algorithms for speeding
it up quite unreasonably exist and aren't particularly hard to
implement, and I don't know what algorithms different FP chip
makers use.


FADD/FSUB

Arguments:  two 32-bit (or 64-bit for DADD/DSUB) addends.

Result:  one 32-bit (or 64-bit) sum.

Algorithm:
	1.  Calculate difference of exponents.
	2.  Shift smaller (in absolute magnitude) addend to the
		right until it aligns with the larger addend (can
		use the exponent difference as a counter preset).
	3.  Add.
	4.  Truncate and normalize sum.

Notes:
	If the difference of exponents is greater than mantissa
	mantissa resolution (24 bits for single precision, 48 for
	double), no addition or subtraction needs to be done.  The
	sum in these cases is simply the larger addend.  Thus, no
	more than 24 (or 48) shifts need be done (allowing for 1
	check bit).

	Maximum time = (max denormalization time) + (addition time)
	+ (max normalization time) = ((exponent subtraction time)
	+ (time for 24 (or 48) shifts)) + (addition time) + ((time
	for 23 (or 47) shifts) + (exponent adjust time))

	Using a barrel shifter to align addends reduces the max.
	denormalization time by changing the order of the algorithm
	from O(n) to O(log2(n)).


FMUL

Arguments:  32-bit (or 64-bit for DMUL) multiplier and multiplicand.

Result:  one 64-bit (or 128-bit) product, only half of which is
	usually used.

Algorithm:
	1.  Add multiplier and multiplicand exponents to generate
		product exponents.
	2.  Multiply by repeated addition (there may be special
		hardware for this, otherwise it's shift and con-
		ditionally add, shift and conditionally add, etc.)
	3.  Truncate and normalize product.

Notes:
	Multiplication by repeated addition is an O(n) algorithm.
	If a special shift/add unit is used, multiply time is reduced
	because the total time is based on gate delays rather than
	bucket shifter clock cycles.

	Maximum time = (exponent addition time) + (multiplication
	time) + (normalization time) = (exponent addition time) +
	(24 (48) * (time for one add and shift)) + (one shift and
	exponent adjust)


Conclusions:
	As has been noted, normalizing a product is almost always
	quicker than normalizing a sum.  This is because in multi-
	plication you begin with two normalized numbers, which will
	yield a product needing at most 1 normalization shift.  FADD,
	on the other hand, may need 23 (47) normalization shifts
	because of leading-bit cancellation.

	If a barrel shifter is available, the initial denormalization
	for FADD could be reduced significantly.  A barrel shift may
	be as fast as a single-bit bucket shift.  This would improve
	worst case performance by 23 (47) bucket shift clock cycles.
	It would probably improve the average FADD times enough to
	make it a faster operation than FMUL.


I am surprised and dismayed that many commercial FPUs do not have
barrel shifters.  Granted, it's extra complexity which may not be
justified by the *overall* performance increase, but we're only
talking about a circuit with 48*log2(48) = ~288 multiplexors (not
counting the 5 intermediate registers), which may be a significant
chunk of the available silicon, but at today's densities shouldn't
add that much complexity to the circuit.

Am I asking for too much?  (I do enjoy having my cake and eating it
too...)

					Phil Kos
					The Johns Hopkins Hospital