Path: utzoo!attcan!uunet!lll-winken!lll-tis!ames!ll-xn!husc6!cs.utexas.edu!oakhill!wca From: wca@oakhill.UUCP (william anderson) Newsgroups: comp.arch Subject: Re: m88000 benchmarks (LONG) Keywords: FFT benchmarks M88K MIPS Message-ID: <1384@claude.oakhill.UUCP> Date: 14 Jul 88 18:54:05 GMT Organization: Motorola Inc. Austin, Tx Lines: 82 In article <2532@wright.mips.COM> earl@mips.COM (Earl Killian) writes: >In article <1359@claude.oakhill.UUCP> wca@oakhill.UUCP (I) wrote: [FFT inner loop in C:] >> for (i=0 ; i> l = i+n2; >> xrt = xr[i] - xr[l]; >> xr[i] = xr[i] + xr[l]; >> xit = xi[i] - xi[l]; >> xi[i] = xi[i] + xi[l]; >> xr[l] = c*xrt - s*xit; >> xi[l] = c*xit + s*xrt; >> } Mr. Killian responded with a hand-coded inner loop for the MIPS architecture: > l.s f0, 0(t0) # xr[i] >l: l.s f2, 0(t1) # xr[l] > l.s f4, 0(t2) # xi[i] > sub.s f10, f0, f2 # xrt = xr[i] - xr[l] > l.s f6, 0(t3) # xi[l] > mul.s f20, f28, f10 # c*xrt > sub.s f12, f4, f6 # xit = xi[i] - xi[l] > addu t0, 4 > addu t1, 4 > mul.s f22, f30, f12 # s*xit > add.s f14, f0, f2 # xr[i] + xr[l] > addu t2, 4 > addu t3, 4 > mul.s f24, f28, f12 # c*xit > add.s f16, f4, f6 # xi[i] + xi[l] > s.s f14, -4(t0) # xr[i] = xr[i] + xr[l] > l.s f0, 0(t0) # xr[i+1] > mul.s f26, f30, f10 # s*xrt > sub.s f20, f20, f22 # c*xrt - s*xit > s.s f16, -4(t2) # xi[i] = xi[i] + xi[l] > s.s f20, -4(t1) # xr[l] = c*xrt - s*xit > add.s f24, f24, f26 # c*xit + s*xrt > bne t0, t4, l > s.s f24, -4(t3) # xi[l] = c*xit + s*xrt > >Summary: 23 instructions, 23 cycles, 10.9 mflops @ 25MHz, and 25 >native mips. No pipelining within an op unit required (the R3010 has >none). Multiply/add/load/store overlap is extensively used. One problem with this code is that is assumes the "stride" of the loop (the varible "n1" in the C code segment above) is unity! What about the code for the inner loop in the general case? What effect does the assumption of non-unity stride have on the MIPS loop timing? Since the results must be stored at the same addresses as the loads used, must the MIPS code be altered for the general case to do its pointer incrementing in some other place than the FP latency slots? Since it lacks an indexed addressing mode, in this case the MIPS architecture must increment 4 pointers instead of one array index. The Motorola 88000's ability to do more sophisticated (indexed and scaled-indexed) addressing in its data memory unit allows a compiler (or an assembly code writer) to loop-induce the index l out of the loop and allows reference to all array elements with only one index increment/loop (as opposed to 4 pointer increments/loop). Thus the loop instruction count for the M88K is reduced and performance is maximized (22 instructions/23 clocks/8.7 MFLOPS/19.1 native MIPS at 20 MHz as per <1359@claude.oakhill.UUCP>). >UUCP: {ames,decwrl,prls,pyramid}!mips!earl Thanks to Marvin Denman and Mitch Alsup of the M88K Design Group for their help with this note. The statements and opinions presented in this article are my own. They should not be interpreted as being the opinons or policy, official or otherwise, of Motorola Inc. /\ /\ William C. Anderson //\\ //\\ Member of the M88000 Design Group ///\\\ ///\\\ Motorola Microprocessor Division // \\ // \\ Oak Hill, TX. / \/ \ / \