Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!husc6!rutgers!labrea!aurora!ames!lll-tis!lll-lcc!pyramid!prls!mips!rowen
From: rowen@mips.UUCP (Chris Rowen)
Newsgroups: comp.arch
Subject: Re: MIPS Floating Point processor.
Message-ID: <627@dumbo.UUCP>
Date: Wed, 26-Aug-87 18:34:46 EDT
Article-I.D.: dumbo.627
Posted: Wed Aug 26 18:34:46 1987
Date-Received: Sat, 29-Aug-87 06:12:58 EDT
References: <987@omepd>
Reply-To: rowen@dumbo.UUCP (Chris Rowen)
Distribution: comp.arch
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 108
Keywords: Floating Point, MIPS, R2010

>From: randys@mipon3.intel.com (Randy Steck)
>The August 20 issue of Electronics has an interesting description of the
>MIPS floating point processor in their "Technology to Watch" section.  The
>implementation looks pretty impressive and I was wonder if someone from
>MIPS (John?) could enlighten us on some of the more interesting features
>of the part?  

The R2010 is a CMOS single chip, closely coupled floating point coprocessor 
for the R2000 CPU.  It includes its own file of 16 64-bit registers and
interprets the instruction stream in parallel with the CPU.  Operands
can be loaded directly from memory (data cache, usually) or from the
CPU's registers.  Its instructions look a lot like the CPU's: loads and 
stores, arithmetic ops, compares and branches.  The tight handshake between
the two chips handles machine stalls and exceptions without sacrificing
speed or error recovery.  MIPS sells it as part of a chip-set, in our line 
of 5-10MIPS processor boards and in our M/Series UNIX boxes.  Our optimizing 
compiler suite (C, FORTRAN, Pascal and more) takes advantage of the 
parallelism of the R2010's independent add, multiply and divide units.  

>The execution times look very impressive and the fact that the adder,
>multiplier, and divider can operate all in parallel could really increase
>performance.  However, what sort of algorithm is used to provide a double
>precision divide in 5 clocks?

Gee, I wish WE knew how to do it 5 cycles.  Unfortunately, this was a 
misprint.  Here is a table of correct operation cycle counts for the R2010:

	R2010 operation		cycles

	{add,sub}.{s,d}		2	
	mul.s			4	
	mul.d			5	
	div.s			12	
	div.d			19	
	{mov,abs,neg}.{s,d}	1	
	cvt{s,d,w},{s,d,w}	2-3	

The divider effectively retires 4 bits per cycle, plus 3 cycles of
overhead for quotient adjustment and IEEE rounding at the end.  The 
multiplier retires about 14 bits per cycle, with 1 cycle of overhead
for IEEE rounding.

For comparison, here are instruction latencies on back-to-back operations
for a bunch of FP units.  The Intel 80287 and Motorola 68881 are single-chip 
coprocessors like the MIPS R2010. The Weitek 1164/1165 multiplier and ALU 
require external hardware for register file, instruction decode and exception
control to marry them to any particular CPU.  These numbers are for 64-bit 
arithmetic on the Weitek and MIPS chips.  The Motorola and Intel chips do 
all internal operations in extended precision(~80-bits).  All values assume 
register-to-register operations.

		R2010		1164/65		80287		68881
	      16.67MHz		16.67MHz	10MHz		20MHz

add		120ns		600ns		7000ns		2550ns
mul		300		660		9000		3550
div	       1140	       3840	       19300		5150

(Sorry, we don't have numbers handy for the 80387, 68882, Clipper)

Other inaccuracies in the Electronics article:	

* The article mentions that the R2010 chip replaces 16KB of SRAM on MIPS's
  old R2360 Weitek-based FPA board.  It's true there is a bunch of RAM on that
  board, but only a tiny section of it is used for the FP register file.

* The chip dissipates 3.8W worst case, not the 2-3W mentioned in the article.

* Whetstones at 16.67MHz are in the range of 12.0MWhets (single) and 9.3MWhets
  (double), not 10.7MWhets and 8.9MWhets (Your mileage may vary, etc. etc.)

>One of the interesting items was the way in which the pipeline of the
>processor can be shut down when an exception in one of the floating point
>operations is found.  Apparently the instruction stream can be restarted
>on the failing instruction.  But, since the execution units operate in
>parallel, what happens when a 5 clock multiply is followed by a 2 clock add
>and the multiply overflows or signals some other exception?  The add could
>have already completed and changed one of the input operands to the previous
>multiply so that simply restarting the multiply instruction would not be
>sufficient to guarantee the correct result.

We actually dedicate a good bit of hardware to make parallel operations
("flushing three toilets at the same time") work in the presence of 
exceptions.  We delay committing state for an instruction until earlier 
instructions are known to be exception-free.  A second write port into 
the register file makes this easier.

>Also, do all operations conform to IEEE 754?  This would include rounding
>and precision considerations.

Yes, systems based on the R2010 conform with requirements and recommendations 
of the standard (what a mouthful :-)).  The hardware for rounding to nearest, 
zero, +infinity and -infinity is pretty complicated, so it is implemented 
only once, in the add unit.  Multiply and divide operations must get access 
to the adder at the end of their executions.  The chip adopts the "RISC 
philosophy" in handling of certain infrequent operands (like denormalized 
numbers) and exceptional results -- it punts the problem over to system 
software.  This lets us concentrate hardware on the frequent cases and moves 
complexity over to the system software, which has to be there anyway to 
provide user-level exception support.  The overhead for software handling of 
these special operands and operations makes no perceivable difference in 
normal floating point performance.


Chris Rowen		decwrl!mips!rowen		930 Arques Ave.
Mark Johnson		decwrl!mips!mark		Sunnyvale CA 94086

Generic disclaimer: We speak only for us...