Xref: utzoo comp.arch:10836 comp.misc:6670
Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!csd4.milw.wisc.edu!cs.utexas.edu!uunet!mitel!melair!low
From: low@melair.UUCP (Rick Low)
Newsgroups: comp.arch,comp.misc
Subject: Re: Info on DSP chips
Summary: real results: 17 MFLOPs for a hand-coded 320C30 FFT
Message-ID: <277@melair.UUCP>
Date: 23 Jul 89 02:29:10 GMT
References: <337@venus.iotek.UUCP> <23379@winchester.mips.COM>
Organization: MEL Defence Systems Ltd., Ottawa, Canada
Lines: 49

Sorry if I digress (I missed the original posting), but...

In article <23379@winchester.mips.COM>, mash@mips.COM (John Mashey) writes:
> In article <337@venus.iotek.UUCP> garyb@iotek.UUCP (Gary R. Burrell) writes:
> 
> >IEEE Micro December 1988
> >
> >	A Special Issue on DSP processors, contains detailed articles
> >on the TMS320C30, DSP32c and DSP96002.  These are fairly good articles
> >about the various DSP processors.  The only part of the issue I really
> >question is the editors afterword in which they come up with some
> >amazing figures for Linpack ratings of the TMS320C30 (20 MFLOPS) and
> >DSP96000 (30 MFLOPS).  I question these ratings as most supercomputers
> >only get about 1/10 of there peak performance rating on the LINPACK
> >benchmark.  The chips may do better than that but I for one would like
> >to see some real numbers using real systems and real compilers.  (Can
> >anyone provide these) I'll be surprised if these chips get anywhere
> >near these figures on the LINPACK benchmark

Here are some real numbers.  Well, simulated anyway.  Not Linpack either.

I did a project for Bob Morris (one of the guest editors for this Micro)
in which I studied how to build efficient DFT algorithms for the
320C30.  I had a good, long look at the C30 architecture and what it
means to DFT algorithms, then I wrote a 1024-point, radix-4,
complex, floating-point (obviously), looped (i.e. not inline coded)
FFT for this beast.

I simulated this FFT using TI's C30 simulator and assuming zero wait
states for external memory.  This FFT ran in 2.71 ms for an average
of about 17 MFLOPs.  The control structure of this FFT -- i.e. non-butterfly
code -- took 18 percent of the total execution time.

> I conjecture that what they must mean is the inner-loop timing for
> the standard LINPACK code, but with zero-wait-state memory,
> i.e., something not particularly buildable.

TI's C30 User's Guide shows you how to build zero wait-state
memory and gives examples using Cypress CY7C164 25 ns SRAMs and
IDT7198 25 ns SRAMs.  In any case, my FFT only needed external
memory for storage of some control variables.  The rest of the
code and data resided in the 4K words of on-chip ROM (code and
twiddle factors) and 2K words of on-chip RAM (data).  Accesses
to all three memory areas were done in a way to cause no
access conflict pipeline delays -- in effect zero wait states
for all memory accesses (internal and external), even with parallel
memory accesses, e.g. ADDF3 *AR0,R3,R2 || STF R0,*+AR1(IR1).

Just more fuel for the fire.  Cheers.
 __   __   _____   _
|  \ /  | |_____| | |
|   V   |  _____  | |       Rick Low
| |\_/| | |_____| | |       MEL Defence Systems Limited, Ottawa, Canada
| |   | |  _____  | |___    +1 613 836 6860
|_|   |_| |_____| |_____|   mitel!melair!low@uunet.UU.NET