Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!apple!amdcad!dvorak.amd.com!proton!tim From: tim@proton.amd.com (Tim Olson) Newsgroups: comp.arch Subject: Re: IEEE arithmetic Message-ID: <1991Jun17.231640.3426@dvorak.amd.com> Date: 17 Jun 91 23:16:40 GMT References: <3709@charon.cwi.nl> <3710@charon.cwi.nl> Sender: usenet@dvorak.amd.com (Usenet News) Reply-To: tim@amd.com (Tim Olson) Organization: Advanced Micro Devices, Austin, TX Lines: 51 In article <3710@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: | Oh yes, they exist. It depends on your programming model. I take the | 80287/80387 and the 88100. They all have a fp control register that defines | the rounding modes and the exception mask bits. To change rounding modes | you can load that register with precalculated values, as the exception mask | bits change only rarely. On those machines an interval add takes two load | control register instructions and two adds, as compared to a single add. | Cycle times: | load fcr fp add | 80287 10 85 | 80387 19 24-32 | 88100 1 3 | So on the 80287 and the 88100 interval add is clearly better than 3 times a | single add. On the 80387 it is only slightly slower. If we look at latency | for the 88100, a single add has a latency of 5 cycles, an interval add a | latency of 9 cycles. One thing to remember, however, is that writes to control registers, such as a floating-point control register, are usually "serializing operations" in pipelined machines. This means that the FP pipeline will be flushed before the modification of the floating-point control register takes place. This doesn't make a difference in your timing analysis if result dependencies already existed in the code stream (so that the FP adds executed at their result rate, rather than their issue rate), but if there was enough pipelining/parallelism to allow the FP adds to run at full issue rate, then interval adds will be significantly more expensive: FP add load fcr latency issue Am29050 1 3 1 Series of 5 dependent FADDs: normal: 15 cycles interval: 40 cycles (2.67 X) Series of 5 independent FADDs: normal: 5 cycles interval: 40 cycles (8 X) I suppose that careful analysis of the code could point out where you can "batch" a series of independent calculations to reduce the round-mode switching, and thus the serialization penalty, but the penalty is still going to probably be much larger than 3X. -- -- Tim Olson Advanced Micro Devices (tim@amd.com)