Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!tut.cis.ohio-state.edu!cica!sol.ctr.columbia.edu!samsung!think!mintaka!ogicse!blake!talex From: talex@blake.acs.washington.edu (Thomas Alexander) Newsgroups: comp.arch Subject: Re: Dual FPUs? Message-ID: <6283@blake.acs.washington.edu> Date: 17 Mar 90 02:32:28 GMT References: <24915@princeton.Princeton.EDU> Reply-To: talex@blake.acs.washington.edu (Thomas Alexander) Distribution: comp Organization: University of Washington, Seattle Lines: 45 >In article <24915@princeton.Princeton.EDU> Paul Haahr writes: > ..... > For codes like these, wouldn't it be possible to take advantage > of two (or more) independent, off-the-shelf floating point units? > >For example, given a MIPS R3000, could one attach two R3010s, one as >the usual coprocessor 1, and the other as coprocessor 2. In loops in >which the computations on elements i and i+1 can do not interfere with >each other (ie, are vectorizable), do the computations on the different >fpus. This gives you a shot at overlapping (say) multiplications. > ...... As a matter of fact, the TMS34020 Graphics System Processor (from Texas Instruments) allows you to hook up to 8 of its floating-point coprocessors in parallel. Each coprocessor has a 50 nsec cycle, resulting in the (for advertising purposes only) peak performance of 40 MFLOPs when doing a multiply-accumulate. Each coprocessor can be addressed and accessed independently by the CPU (the 34020), and can execute different instructions simultaneously. To top things off, each FPU can access external microcode through a separate microcode bus, allowing you to custom-tailor your coprocessor instructions. Some people over here recently put together a system with one 34020 and four coprocessors. Expected peak performance of 160 MFLOPs, right? They got about 3 on compiled code, rising to 10 or so on hand-optimized assembly language. And this was on heavily vectorizable workloads - 2-D convolution on 1024 x 1024 images, where you can work on million element vectors at a time doing nothing but multiply-accumulates (a little oversimplification here). Multiple coprocessors appear to suffer from one or more of the following: * limited instruction issue rate - can't keep them all busy. * VERY limited data transfer rate - trying to support several data-hungry FPUs on one bus does not pay, especially when the same bus/memory system is already maxed out trying to keep up with the CPU. * limited internal register resources - too many data transfers in and out of the FPU when large vectors are involved. In a word - BANDWIDTH! Personally, I'll trade all the MFLOPs you can get for one good megabyte/sec of data transfer :-) - Tom