Xref: utzoo comp.arch:10836 comp.misc:6670 Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!csd4.milw.wisc.edu!cs.utexas.edu!uunet!mitel!melair!low From: low@melair.UUCP (Rick Low) Newsgroups: comp.arch,comp.misc Subject: Re: Info on DSP chips Summary: real results: 17 MFLOPs for a hand-coded 320C30 FFT Message-ID: <277@melair.UUCP> Date: 23 Jul 89 02:29:10 GMT References: <337@venus.iotek.UUCP> <23379@winchester.mips.COM> Organization: MEL Defence Systems Ltd., Ottawa, Canada Lines: 49 Sorry if I digress (I missed the original posting), but... In article <23379@winchester.mips.COM>, mash@mips.COM (John Mashey) writes: > In article <337@venus.iotek.UUCP> garyb@iotek.UUCP (Gary R. Burrell) writes: > > >IEEE Micro December 1988 > > > > A Special Issue on DSP processors, contains detailed articles > >on the TMS320C30, DSP32c and DSP96002. These are fairly good articles > >about the various DSP processors. The only part of the issue I really > >question is the editors afterword in which they come up with some > >amazing figures for Linpack ratings of the TMS320C30 (20 MFLOPS) and > >DSP96000 (30 MFLOPS). I question these ratings as most supercomputers > >only get about 1/10 of there peak performance rating on the LINPACK > >benchmark. The chips may do better than that but I for one would like > >to see some real numbers using real systems and real compilers. (Can > >anyone provide these) I'll be surprised if these chips get anywhere > >near these figures on the LINPACK benchmark Here are some real numbers. Well, simulated anyway. Not Linpack either. I did a project for Bob Morris (one of the guest editors for this Micro) in which I studied how to build efficient DFT algorithms for the 320C30. I had a good, long look at the C30 architecture and what it means to DFT algorithms, then I wrote a 1024-point, radix-4, complex, floating-point (obviously), looped (i.e. not inline coded) FFT for this beast. I simulated this FFT using TI's C30 simulator and assuming zero wait states for external memory. This FFT ran in 2.71 ms for an average of about 17 MFLOPs. The control structure of this FFT -- i.e. non-butterfly code -- took 18 percent of the total execution time. > I conjecture that what they must mean is the inner-loop timing for > the standard LINPACK code, but with zero-wait-state memory, > i.e., something not particularly buildable. TI's C30 User's Guide shows you how to build zero wait-state memory and gives examples using Cypress CY7C164 25 ns SRAMs and IDT7198 25 ns SRAMs. In any case, my FFT only needed external memory for storage of some control variables. The rest of the code and data resided in the 4K words of on-chip ROM (code and twiddle factors) and 2K words of on-chip RAM (data). Accesses to all three memory areas were done in a way to cause no access conflict pipeline delays -- in effect zero wait states for all memory accesses (internal and external), even with parallel memory accesses, e.g. ADDF3 *AR0,R3,R2 || STF R0,*+AR1(IR1). Just more fuel for the fire. Cheers. __ __ _____ _ | \ / | |_____| | | | V | _____ | | Rick Low | |\_/| | |_____| | | MEL Defence Systems Limited, Ottawa, Canada | | | | _____ | |___ +1 613 836 6860 |_| |_| |_____| |_____| mitel!melair!low@uunet.UU.NET