Path: utzoo!attcan!uunet!lll-winken!lll-tis!ames!ll-xn!husc6!cs.utexas.edu!oakhill!wca
From: wca@oakhill.UUCP (william anderson)
Newsgroups: comp.arch
Subject: Re: m88000 benchmarks (LONG)
Keywords: FFT benchmarks M88K MIPS
Message-ID: <1384@claude.oakhill.UUCP>
Date: 14 Jul 88 18:54:05 GMT
Organization: Motorola Inc. Austin, Tx
Lines: 82

In article <2532@wright.mips.COM> earl@mips.COM (Earl Killian) writes:
>In article <1359@claude.oakhill.UUCP> wca@oakhill.UUCP (I) wrote:

  [FFT inner loop in C:]

>> 	for (i=0 ; i<n ; i+=n1) {
>> 		l = i+n2;
>> 		xrt = xr[i] - xr[l];
>> 		xr[i] = xr[i] + xr[l];
>> 		xit = xi[i] - xi[l];
>> 		xi[i] = xi[i] + xi[l];
>> 		xr[l] = c*xrt - s*xit;
>> 		xi[l] = c*xit + s*xrt;
>> 	}

Mr. Killian responded with a hand-coded inner loop for the MIPS
architecture:

>	l.s	f0, 0(t0)		# xr[i]
>l:	l.s	f2, 0(t1)		# xr[l]
>	l.s	f4, 0(t2)		# xi[i]
>	sub.s	f10, f0, f2		# xrt = xr[i] - xr[l]
>	l.s	f6, 0(t3)		# xi[l]
>	mul.s	f20, f28, f10		# c*xrt
>	sub.s	f12, f4, f6		# xit = xi[i] - xi[l]
>	addu	t0, 4
>	addu	t1, 4
>	mul.s	f22, f30, f12		# s*xit
>	add.s	f14, f0, f2		# xr[i] + xr[l]
>	addu	t2, 4
>	addu	t3, 4
>	mul.s	f24, f28, f12		# c*xit
>	add.s	f16, f4, f6		# xi[i] + xi[l]
>	s.s	f14, -4(t0)		# xr[i] = xr[i] + xr[l]
>	l.s	f0, 0(t0)		# xr[i+1]
>	mul.s	f26, f30, f10		# s*xrt
>	sub.s	f20, f20, f22		# c*xrt - s*xit
>	s.s	f16, -4(t2)		# xi[i] = xi[i] + xi[l]
>	s.s	f20, -4(t1)		# xr[l] = c*xrt - s*xit
>	add.s	f24, f24, f26		# c*xit + s*xrt
>	bne	t0, t4, l
>	 s.s	f24, -4(t3)		# xi[l] = c*xit + s*xrt
>
>Summary: 23 instructions, 23 cycles, 10.9 mflops @ 25MHz, and 25
>native mips.  No pipelining within an op unit required (the R3010 has
>none).  Multiply/add/load/store overlap is extensively used.

One problem with this code is that is assumes the "stride" of the loop
(the varible "n1" in the C code segment above) is unity!

What about the code for the inner loop in the general case?  What
effect does the assumption of non-unity stride have on the MIPS loop
timing?  Since the results must be stored at the same addresses as the
loads used, must the MIPS code be altered for the general case to do
its pointer incrementing in some other place than the FP latency
slots?  Since it lacks an indexed addressing mode, in this case the
MIPS architecture must increment 4 pointers instead of one array
index.

The Motorola 88000's ability to do more sophisticated (indexed and
scaled-indexed) addressing in its data memory unit allows a compiler
(or an assembly code writer) to loop-induce the index l out of the loop
and allows reference to all array elements with only one index
increment/loop (as opposed to 4 pointer increments/loop).  Thus the
loop instruction count for the M88K is reduced and performance is
maximized (22 instructions/23 clocks/8.7 MFLOPS/19.1 native MIPS at 20
MHz as per <1359@claude.oakhill.UUCP>).

>UUCP: {ames,decwrl,prls,pyramid}!mips!earl

Thanks to Marvin Denman and Mitch Alsup of the M88K Design Group for
their help with this note.

The statements and opinions presented in this article are my own.
They should not be interpreted as being the opinons or policy,
official or otherwise, of Motorola Inc.

       /\        /\ 		William C. Anderson
      //\\      //\\		Member of the M88000 Design Group
     ///\\\    ///\\\		Motorola Microprocessor Division
    //    \\  //    \\		Oak Hill, TX.
   /        \/        \
  /                    \