Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!tut.cis.ohio-state.edu!cica!sol.ctr.columbia.edu!samsung!think!mintaka!ogicse!blake!talex
From: talex@blake.acs.washington.edu (Thomas Alexander)
Newsgroups: comp.arch
Subject: Re: Dual FPUs?
Message-ID: <6283@blake.acs.washington.edu>
Date: 17 Mar 90 02:32:28 GMT
References: <24915@princeton.Princeton.EDU>
Reply-To: talex@blake.acs.washington.edu (Thomas Alexander)
Distribution: comp
Organization: University of Washington, Seattle
Lines: 45


>In article <24915@princeton.Princeton.EDU> Paul Haahr writes:
> .....
> For codes like these, wouldn't it be possible to take advantage
> of two (or more) independent, off-the-shelf floating point units?
>
>For example, given a MIPS R3000, could one attach two R3010s, one as
>the usual coprocessor 1, and the other as coprocessor 2.  In loops in
>which the computations on elements i and i+1 can do not interfere with
>each other (ie, are vectorizable), do the computations on the different
>fpus.  This gives you a shot at overlapping (say) multiplications.
> ......

As a matter of fact, the TMS34020 Graphics System Processor (from Texas
Instruments) allows you to hook up to 8 of its floating-point coprocessors
in parallel. Each coprocessor has a 50 nsec cycle, resulting in the
(for advertising purposes only) peak performance of 40 MFLOPs when
doing a multiply-accumulate. Each coprocessor can be addressed and
accessed independently by the CPU (the 34020), and can execute different
instructions simultaneously. To top things off, each FPU can access
external microcode through a separate microcode bus, allowing you to
custom-tailor your coprocessor instructions.

Some people over here recently put together a system with one 34020
and four coprocessors. Expected peak performance of 160 MFLOPs, right?
They got about 3 on compiled code, rising to 10 or so on hand-optimized
assembly language. And this was on heavily vectorizable workloads -
2-D convolution on 1024 x 1024 images, where you can work on million
element vectors at a time doing nothing but multiply-accumulates (a
little oversimplification here).

Multiple coprocessors appear to suffer from one or more of the
following:

* limited instruction issue rate - can't keep them all busy.
* VERY limited data transfer rate - trying to support several data-hungry
	FPUs on one bus does not pay, especially when the same bus/memory
	system is already maxed out trying to keep up with the CPU.
* limited internal register resources - too many data transfers
	in and out of the FPU when large vectors are involved.

In a word - BANDWIDTH! Personally, I'll trade all the MFLOPs you can get
for one good megabyte/sec of data transfer :-)

- Tom