Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!husc6!rutgers!labrea!aurora!ames!lll-tis!lll-lcc!pyramid!prls!mips!rowen From: rowen@mips.UUCP (Chris Rowen) Newsgroups: comp.arch Subject: Re: MIPS Floating Point processor. Message-ID: <627@dumbo.UUCP> Date: Wed, 26-Aug-87 18:34:46 EDT Article-I.D.: dumbo.627 Posted: Wed Aug 26 18:34:46 1987 Date-Received: Sat, 29-Aug-87 06:12:58 EDT References: <987@omepd> Reply-To: rowen@dumbo.UUCP (Chris Rowen) Distribution: comp.arch Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 108 Keywords: Floating Point, MIPS, R2010 >From: randys@mipon3.intel.com (Randy Steck) >The August 20 issue of Electronics has an interesting description of the >MIPS floating point processor in their "Technology to Watch" section. The >implementation looks pretty impressive and I was wonder if someone from >MIPS (John?) could enlighten us on some of the more interesting features >of the part? The R2010 is a CMOS single chip, closely coupled floating point coprocessor for the R2000 CPU. It includes its own file of 16 64-bit registers and interprets the instruction stream in parallel with the CPU. Operands can be loaded directly from memory (data cache, usually) or from the CPU's registers. Its instructions look a lot like the CPU's: loads and stores, arithmetic ops, compares and branches. The tight handshake between the two chips handles machine stalls and exceptions without sacrificing speed or error recovery. MIPS sells it as part of a chip-set, in our line of 5-10MIPS processor boards and in our M/Series UNIX boxes. Our optimizing compiler suite (C, FORTRAN, Pascal and more) takes advantage of the parallelism of the R2010's independent add, multiply and divide units. >The execution times look very impressive and the fact that the adder, >multiplier, and divider can operate all in parallel could really increase >performance. However, what sort of algorithm is used to provide a double >precision divide in 5 clocks? Gee, I wish WE knew how to do it 5 cycles. Unfortunately, this was a misprint. Here is a table of correct operation cycle counts for the R2010: R2010 operation cycles {add,sub}.{s,d} 2 mul.s 4 mul.d 5 div.s 12 div.d 19 {mov,abs,neg}.{s,d} 1 cvt{s,d,w},{s,d,w} 2-3 The divider effectively retires 4 bits per cycle, plus 3 cycles of overhead for quotient adjustment and IEEE rounding at the end. The multiplier retires about 14 bits per cycle, with 1 cycle of overhead for IEEE rounding. For comparison, here are instruction latencies on back-to-back operations for a bunch of FP units. The Intel 80287 and Motorola 68881 are single-chip coprocessors like the MIPS R2010. The Weitek 1164/1165 multiplier and ALU require external hardware for register file, instruction decode and exception control to marry them to any particular CPU. These numbers are for 64-bit arithmetic on the Weitek and MIPS chips. The Motorola and Intel chips do all internal operations in extended precision(~80-bits). All values assume register-to-register operations. R2010 1164/65 80287 68881 16.67MHz 16.67MHz 10MHz 20MHz add 120ns 600ns 7000ns 2550ns mul 300 660 9000 3550 div 1140 3840 19300 5150 (Sorry, we don't have numbers handy for the 80387, 68882, Clipper) Other inaccuracies in the Electronics article: * The article mentions that the R2010 chip replaces 16KB of SRAM on MIPS's old R2360 Weitek-based FPA board. It's true there is a bunch of RAM on that board, but only a tiny section of it is used for the FP register file. * The chip dissipates 3.8W worst case, not the 2-3W mentioned in the article. * Whetstones at 16.67MHz are in the range of 12.0MWhets (single) and 9.3MWhets (double), not 10.7MWhets and 8.9MWhets (Your mileage may vary, etc. etc.) >One of the interesting items was the way in which the pipeline of the >processor can be shut down when an exception in one of the floating point >operations is found. Apparently the instruction stream can be restarted >on the failing instruction. But, since the execution units operate in >parallel, what happens when a 5 clock multiply is followed by a 2 clock add >and the multiply overflows or signals some other exception? The add could >have already completed and changed one of the input operands to the previous >multiply so that simply restarting the multiply instruction would not be >sufficient to guarantee the correct result. We actually dedicate a good bit of hardware to make parallel operations ("flushing three toilets at the same time") work in the presence of exceptions. We delay committing state for an instruction until earlier instructions are known to be exception-free. A second write port into the register file makes this easier. >Also, do all operations conform to IEEE 754? This would include rounding >and precision considerations. Yes, systems based on the R2010 conform with requirements and recommendations of the standard (what a mouthful :-)). The hardware for rounding to nearest, zero, +infinity and -infinity is pretty complicated, so it is implemented only once, in the add unit. Multiply and divide operations must get access to the adder at the end of their executions. The chip adopts the "RISC philosophy" in handling of certain infrequent operands (like denormalized numbers) and exceptional results -- it punts the problem over to system software. This lets us concentrate hardware on the frequent cases and moves complexity over to the system software, which has to be there anyway to provide user-level exception support. The overhead for software handling of these special operands and operations makes no perceivable difference in normal floating point performance. Chris Rowen decwrl!mips!rowen 930 Arques Ave. Mark Johnson decwrl!mips!mark Sunnyvale CA 94086 Generic disclaimer: We speak only for us...