Path: utzoo!utgpu!water!watmath!clyde!att!ucbvax!decwrl!granite!jmd From: jmd@granite.dec.com (John Danskin) Newsgroups: comp.arch Subject: Re: m88000 benchmarks Keywords: FFT benchmarks M88K MIPS Message-ID: <249@granite.dec.com> Date: 15 Jul 88 17:28:55 GMT Organization: DEC Technology Development, Palo Alto, CA Lines: 127 x x x I have screwed around a little bit with Earl's code to try to answer some of wca@oakhill (William C. Anderson)'s questions about performance with non unity stride. There is probably a better answer, but here goes: >In article <2532@wright.mips.COM> earl@mips.COM (Earl Killian) writes: >>In article <1359@claude.oakhill.UUCP> wca@oakhill.UUCP (I) wrote: > > [FFT inner loop in C:] > >>> for (i=0 ; i>> l = i+n2; >>> xrt = xr[i] - xr[l]; >>> xr[i] = xr[i] + xr[l]; >>> xit = xi[i] - xi[l]; >>> xi[i] = xi[i] + xi[l]; >>> xr[l] = c*xrt - s*xit; >>> xi[l] = c*xit + s*xrt; >>> } > >Mr. Killian responded with a hand-coded inner loop for the MIPS >architecture: > >> l.s f0, 0(t0) # xr[i] >>l: l.s f2, 0(t1) # xr[l] >> l.s f4, 0(t2) # xi[i] >> sub.s f10, f0, f2 # xrt = xr[i] - xr[l] >> l.s f6, 0(t3) # xi[l] >> mul.s f20, f28, f10 # c*xrt >> sub.s f12, f4, f6 # xit = xi[i] - xi[l] >> addu t0, 4 >> addu t1, 4 >> mul.s f22, f30, f12 # s*xit >> add.s f14, f0, f2 # xr[i] + xr[l] >> addu t2, 4 >> addu t3, 4 >> mul.s f24, f28, f12 # c*xit >> add.s f16, f4, f6 # xi[i] + xi[l] >> s.s f14, -4(t0) # xr[i] = xr[i] + xr[l] >> l.s f0, 0(t0) # xr[i+1] >> mul.s f26, f30, f10 # s*xrt >> sub.s f20, f20, f22 # c*xrt - s*xit >> s.s f16, -4(t2) # xi[i] = xi[i] + xi[l] >> s.s f20, -4(t1) # xr[l] = c*xrt - s*xit >> add.s f24, f24, f26 # c*xit + s*xrt >> bne t0, t4, l >> s.s f24, -4(t3) # xi[l] = c*xit + s*xrt >> >>Summary: 23 instructions, 23 cycles, 10.9 mflops @ 25MHz, and 25 >>native mips. No pipelining within an op unit required (the R3010 has >>none). Multiply/add/load/store overlap is extensively used. > >One problem with this code is that is assumes the "stride" of the loop >(the varible "n1" in the C code segment above) is unity! > >What about the code for the inner loop in the general case? What >effect does the assumption of non-unity stride have on the MIPS loop >timing? # 4 * n1 is in a register named 'n1' l.s f0, 0(t0) # xr[i] l: l.s f2, 0(t1) # xr[l] l.s f4, 0(t2) # xi[i] sub.s f10, f0, f2 # xrt = xr[i] - xr[l] l.s f6, 0(t3) # xi[l] mul.s f20, f28, f10 # c*xrt sub.s f12, f4, f6 # xit = xi[i] - xi[l] addu t5, t0, n1 addu t6, t1, n1 mul.s f22, f30, f12 # s*xit add.s f14, f0, f2 # xr[i] + xr[l] addu t7, t2, n1 addu t8, t3, n1 mul.s f24, f28, f12 # c*xit add.s f16, f4, f6 # xi[i] + xi[l] s.s f14, 0(t0) # xr[i] = xr[i] + xr[l] l.s f0, 0(t5) # xr[i+1] mul.s f26, f30, f10 # s*xrt sub.s f20, f20, f22 # c*xrt - s*xit s.s f16, 0(t2) # xi[i] = xi[i] + xi[l] s.s f20, 0(t1) # xr[l] = c*xrt - s*xit add.s f24, f24, f26 # c*xit + s*xrt be t5, t4, e s.s f24, 0(t3) # xi[l] = c*xit + s*xrt l.s f2, 0(t6) # xr[l] l.s f4, 0(t7) # xi[i] sub.s f10, f0, f2 # xrt = xr[i] - xr[l] l.s f6, 0(t8) # xi[l] mul.s f20, f28, f10 # c*xrt sub.s f12, f4, f6 # xit = xi[i] - xi[l] addu t0, t5, n1 addu t1, t6, n1 mul.s f22, f30, f12 # s*xit add.s f14, f0, f2 # xr[i] + xr[l] addu t2, t7, n1 addu t3, t8, n1 mul.s f24, f28, f12 # c*xit add.s f16, f4, f6 # xi[i] + xi[l] s.s f14, 0(t5) # xr[i] = xr[i] + xr[l] l.s f0, 0(t0) # xr[i+1] mul.s f26, f30, f10 # s*xrt sub.s f20, f20, f22 # c*xrt - s*xit s.s f16, 0(t7) # xi[i] = xi[i] + xi[l] s.s f20, 0(t6) # xr[l] = c*xrt - s*xit add.s f24, f24, f26 # c*xit + s*xrt bne t0, t4, l s.s f24, 0(t3) # xi[l] = c*xit + s*xrt e: This code has exactly the same performance as earl's code except that is also handles the non-unity case. Yet another example of needing lots more registers to write good code in pipelined machines. Can I have lots more registers please? -- John Danskin | decwrl!jmd DEC Technology Development | (415) 853-6724 100 Hamilton Avenue | My comments are my own. Palo Alto, CA 94306 | I do not speak for DEC.