Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!think.com!snorkelwacker.mit.edu!bloom-beacon!eru!hagbard!sunic!mcsun!hp4nl!cwi.nl!dik From: dik@cwi.nl (Dik T. Winter) Newsgroups: comp.arch Subject: Re: IEEE arithmetic Message-ID: <3710@charon.cwi.nl> Date: 17 Jun 91 00:41:30 GMT References: <3707@charon.cwi.nl> <3709@charon.cwi.nl> Sender: news@cwi.nl Organization: CWI, Amsterdam Lines: 64 In article <3709@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: Sorry about that one; something went wrong. In article mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes: > I see Shearer asking a very simple question: Please provide *specific* > information on machines where Winter's algorithms for interval add and > interval multiply perform less than 3 times slower than the equivalent > simple operation with a single rounding mode. Oh yes, they exist. It depends on your programming model. I take the 80287/80387 and the 88100. They all have a fp control register that defines the rounding modes and the exception mask bits. To change rounding modes you can load that register with precalculated values, as the exception mask bits change only rarely. On those machines an interval add takes two load control register instructions and two adds, as compared to a single add. Cycle times: load fcr fp add 80287 10 85 80387 19 24-32 88100 1 3 So on the 80287 and the 88100 interval add is clearly better than 3 times a single add. On the 80387 it is only slightly slower. If we look at latency for the 88100, a single add has a latency of 5 cycles, an interval add a latency of 9 cycles. So given proper support for change of rounding modes the factor of 3 can easily be obtained for add. Multiply requires more than 3 due to compares and branches. Still I would expect a factor of about 3 for especially the 80287 and the 88100. The main point here is that those machines have a fpu control register and a fpu status register. Machines that combine the two (SPARC, MIPS, RS6000, HPPA, 68k, 32k) will be at an disadvantage because change of rounding mode implies fetching the current status/control register (which in turn implies that the fp execute queue must be empty), changing a single field and restoring the register. The situation on the RS6000 is a bit unclear. There are instructions that modify only a subfield of the FPSCR. So it would have been simple to organize the FPSCR such that a single instruction would modify only the rounding mode bits. This ought to be possible even if instructions are still in progress. I do not know whether this is done in reality (I do not even know the layout of the FPSCR, as the stupid Assembler manual does not give it!). > > Then I suppose that the next step is to compare the performance of > interval arithmetic with 128-bit arithmetic on a machine like the > RS/6000. Note: this is *not* a replacement for interval arithmetic. Granted, you get better results in more cases; the guarantee on your result is lacking. I am *not* an advocate for interval arithmetic (the people at Karlsruhe are). I do not use it. But I object to the way Shearer handles this: a. Shearer asks: what is the justification for the different rounding modes. b. Many responses come: interval arithmetic. c. Shearer asks: would it not be better helped with quad arithmetic? d. Response: observed speed difference a factor 3 with hardware rounding modes, a factor 300 in software. e. Shearer questions the factor 3. Apparently he believes the factor 300 (does he?). Even if the factor 3 would degrade on other machines to a factor of 5 or even 10, the difference with 300 is still striking. I ask Shearer again: come with an interval add assuming the base arithmetic is round to nearest only (or even worse, with truncating arithmetic, which you advocate in another article). -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl