Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!uwm.edu!uakari.primate.wisc.edu!ames!ames.arc.nasa.gov!lamaster
From: lamaster@ames.arc.nasa.gov (Hugh LaMaster)
Newsgroups: comp.arch
Subject: Re: 3010 fp (was linpack)
Message-ID: <34443@ames.arc.nasa.gov>
Date: 25 Oct 89 20:40:02 GMT
References: <36621@lll-winken.LLNL.GOV> <3300080@m.cs.uiuc.edu> <30100@obiwan.mips.COM>
Sender: usenet@ames.arc.nasa.gov
Organization: NASA - Ames Research Center
Lines: 41

In article <30100@obiwan.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
>In article <3300080@m.cs.uiuc.edu> gillies@m.cs.uiuc.edu writes:

>  >At some point I heard that MIPS pulled out just about every stopper to
>  >speed up the floating point speed of the R2000/R3000.  In other words,

Actually, MIPSCo has a lot of room for improvement, as good as the
R3010 is: by throwing more hardware at the job, they could lower f.p.
multiply to two cycles latency, and division to 8 cycles latency,
with full segmentation (1 operation started/clock cycle) on every operation.
And then adding vector instructions, and more memory bandwidth, to make full
use of the additional FPU bandwidth.  Fortunately, there are ideas good for 
more improvement up to about 1 million gates, based on Cray and CDC/ETA 
designs.  (Some of the CDC Cyber 205 models actually had fully segmented
division, but it added a lot of extra real estate...)  

The first increment of improvement could come about by segmenting addition only
and giving the multiply unit its own round/normalize capability.  This might 
result in something like (a wild guess) a 50% improvement on codes like Linpack,
without adding very many more transistors.

>However, the chip contains only 75,000 transistors.  So there were

(A question, I know, which can't be answered precisely, but how many
"gates" is that, very roughly ...?)

>transistors on a chip, perhaps it's reasonable to postulate that the
>additional transistors could be used to improve f.p. performance.
>(On top of speedups due to faster technology)

Another approach would be to integrate the FPU on the same chip as the CPU.
When you consider the cost of off-chip communication vs. the low on-chip gate 
delays you get these days, it might make more sense to first
put the entire CPU/FPU on one chip, using the current low-transistor-count
design, and then, add more gates to the FPU as space allows with future 
technology.  (Hmmm, sounds like another competitor's approach ... )

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117