Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!think.com!snorkelwacker.mit.edu!bloom-beacon!eru!hagbard!sunic!mcsun!hp4nl!cwi.nl!dik
From: dik@cwi.nl (Dik T. Winter)
Newsgroups: comp.arch
Subject: Re: IEEE arithmetic
Message-ID: <3710@charon.cwi.nl>
Date: 17 Jun 91 00:41:30 GMT
References: <3707@charon.cwi.nl> <MCCALPIN.91Jun16095840@pereland.cms.udel.edu> <3709@charon.cwi.nl>
Sender: news@cwi.nl
Organization: CWI, Amsterdam
Lines: 64

In article <3709@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
Sorry about that one; something went wrong.

In article <MCCALPIN.91Jun16095840@pereland.cms.udel.edu> mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:
 > I see Shearer asking a very simple question: Please provide *specific*
 > information on machines where Winter's algorithms for interval add and
 > interval multiply perform less than 3 times slower than the equivalent
 > simple operation with a single rounding mode.
Oh yes, they exist.  It depends on your programming model.  I take the
80287/80387 and the 88100.  They all have a fp control register that defines
the rounding modes and the exception mask bits.  To change rounding modes
you can load that register with precalculated values, as the exception mask
bits change only rarely.  On those machines an interval add takes two load
control register instructions and two adds, as compared to a single add.
Cycle times:
	load fcr	fp add
80287	10		85
80387	19		24-32
88100	 1		 3
So on the 80287 and the 88100 interval add is clearly better than 3 times a
single add.  On the 80387 it is only slightly slower.  If we look at latency
for the 88100, a single add has a latency of 5 cycles, an interval add a
latency of 9 cycles.

So given proper support for change of rounding modes the factor of 3 can
easily be obtained for add.  Multiply requires more than 3 due to compares
and branches.  Still I would expect a factor of about 3 for especially the
80287 and the 88100.  The main point here is that those machines have a fpu
control register and a fpu status register.  Machines that combine the two
(SPARC, MIPS, RS6000, HPPA, 68k, 32k) will be at an disadvantage because
change of rounding mode implies fetching the current status/control register
(which in turn implies that the fp execute queue must be empty), changing a
single field and restoring the register.  The situation on the RS6000 is a
bit unclear.  There are instructions that modify only a subfield of the
FPSCR.  So it would have been simple to organize the FPSCR such that a single
instruction would modify only the rounding mode bits.  This ought to be
possible even if instructions are still in progress.  I do not know whether
this is done in reality (I do not even know the layout of the FPSCR, as the
stupid Assembler manual does not give it!).
 > 
 > Then I suppose that the next step is to compare the performance of
 > interval arithmetic with 128-bit arithmetic on a machine like the
 > RS/6000.
Note: this is *not* a replacement for interval arithmetic.  Granted, you
get better results in more cases; the guarantee on your result is lacking.

I am *not* an advocate for interval arithmetic (the people at Karlsruhe are).
I do not use it.  But I object to the way Shearer handles this:
a.  Shearer asks: what is the justification for the different rounding modes.
b.  Many responses come: interval arithmetic.
c.  Shearer asks: would it not be better helped with quad arithmetic?
d.  Response:  observed speed difference a factor 3 with hardware rounding
	modes, a factor 300 in software.
e.  Shearer questions the factor 3.  Apparently he believes the factor 300
	(does he?).
Even if the factor 3 would degrade on other machines to a factor of 5 or even
10, the difference with 300 is still striking.

I ask Shearer again: come with an interval add assuming the base arithmetic
is round to nearest only (or even worse, with truncating arithmetic, which
you advocate in another article).
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl