Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!apple!amdcad!dvorak.amd.com!proton!tim
From: tim@proton.amd.com (Tim Olson)
Newsgroups: comp.arch
Subject: Re: IEEE arithmetic
Message-ID: <1991Jun17.231640.3426@dvorak.amd.com>
Date: 17 Jun 91 23:16:40 GMT
References: <MCCALPIN.91Jun16095840@pereland.cms.udel.edu> <3709@charon.cwi.nl> <3710@charon.cwi.nl>
Sender: usenet@dvorak.amd.com (Usenet News)
Reply-To: tim@amd.com (Tim Olson)
Organization: Advanced Micro Devices, Austin, TX
Lines: 51

In article <3710@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
| Oh yes, they exist.  It depends on your programming model.  I take the
| 80287/80387 and the 88100.  They all have a fp control register that defines
| the rounding modes and the exception mask bits.  To change rounding modes
| you can load that register with precalculated values, as the exception mask
| bits change only rarely.  On those machines an interval add takes two load
| control register instructions and two adds, as compared to a single add.
| Cycle times:
| 	load fcr	fp add
| 80287	10		85
| 80387	19		24-32
| 88100	 1		 3
| So on the 80287 and the 88100 interval add is clearly better than 3 times a
| single add.  On the 80387 it is only slightly slower.  If we look at latency
| for the 88100, a single add has a latency of 5 cycles, an interval add a
| latency of 9 cycles.

One thing to remember, however, is that writes to control registers,
such as a floating-point control register, are usually "serializing
operations" in pipelined machines.  This means that the FP pipeline will
be flushed before the modification of the floating-point control
register takes place.

This doesn't make a difference in your timing analysis if result
dependencies already existed in the code stream (so that the FP adds
executed at their result rate, rather than their issue rate), but if
there was enough pipelining/parallelism to allow the FP adds to run at
full issue rate, then interval adds will be significantly more
expensive:

				FP add
	load fcr	latency		issue
Am29050     1              3	          1

Series of 5 dependent FADDs:
	normal:	  15 cycles
	interval: 40 cycles (2.67 X)

Series of 5 independent FADDs:
	normal:	   5 cycles
	interval: 40 cycles (8 X)

I suppose that careful analysis of the code could point out where you
can "batch" a series of independent calculations to reduce the
round-mode switching, and thus the serialization penalty, but the
penalty is still going to probably be much larger than 3X.

--
	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)