Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!uwm.edu!uakari.primate.wisc.edu!ames!ames.arc.nasa.gov!lamaster From: lamaster@ames.arc.nasa.gov (Hugh LaMaster) Newsgroups: comp.arch Subject: Re: 3010 fp (was linpack) Message-ID: <34443@ames.arc.nasa.gov> Date: 25 Oct 89 20:40:02 GMT References: <36621@lll-winken.LLNL.GOV> <3300080@m.cs.uiuc.edu> <30100@obiwan.mips.COM> Sender: usenet@ames.arc.nasa.gov Organization: NASA - Ames Research Center Lines: 41 In article <30100@obiwan.mips.COM> mark@mips.COM (Mark G. Johnson) writes: >In article <3300080@m.cs.uiuc.edu> gillies@m.cs.uiuc.edu writes: > >At some point I heard that MIPS pulled out just about every stopper to > >speed up the floating point speed of the R2000/R3000. In other words, Actually, MIPSCo has a lot of room for improvement, as good as the R3010 is: by throwing more hardware at the job, they could lower f.p. multiply to two cycles latency, and division to 8 cycles latency, with full segmentation (1 operation started/clock cycle) on every operation. And then adding vector instructions, and more memory bandwidth, to make full use of the additional FPU bandwidth. Fortunately, there are ideas good for more improvement up to about 1 million gates, based on Cray and CDC/ETA designs. (Some of the CDC Cyber 205 models actually had fully segmented division, but it added a lot of extra real estate...) The first increment of improvement could come about by segmenting addition only and giving the multiply unit its own round/normalize capability. This might result in something like (a wild guess) a 50% improvement on codes like Linpack, without adding very many more transistors. >However, the chip contains only 75,000 transistors. So there were (A question, I know, which can't be answered precisely, but how many "gates" is that, very roughly ...?) >transistors on a chip, perhaps it's reasonable to postulate that the >additional transistors could be used to improve f.p. performance. >(On top of speedups due to faster technology) Another approach would be to integrate the FPU on the same chip as the CPU. When you consider the cost of off-chip communication vs. the low on-chip gate delays you get these days, it might make more sense to first put the entire CPU/FPU on one chip, using the current low-transistor-count design, and then, add more gates to the FPU as space allows with future technology. (Hmmm, sounds like another competitor's approach ... ) Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117