Path: utzoo!utgpu!water!watmath!clyde!att!ucbvax!decwrl!granite!jmd
From: jmd@granite.dec.com (John Danskin)
Newsgroups: comp.arch
Subject: Re: m88000 benchmarks
Keywords: FFT benchmarks M88K MIPS
Message-ID: <249@granite.dec.com>
Date: 15 Jul 88 17:28:55 GMT
Organization: DEC Technology Development, Palo Alto, CA
Lines: 127

x
x
x

I have screwed around a little bit with Earl's code to try to
answer some of wca@oakhill (William C. Anderson)'s questions
about performance with non unity stride. There is probably a better
answer, but here goes:


>In article <2532@wright.mips.COM> earl@mips.COM (Earl Killian) writes:
>>In article <1359@claude.oakhill.UUCP> wca@oakhill.UUCP (I) wrote:
>
>  [FFT inner loop in C:]
>
>>> 	for (i=0 ; i<n ; i+=n1) {
>>> 		l = i+n2;
>>> 		xrt = xr[i] - xr[l];
>>> 		xr[i] = xr[i] + xr[l];
>>> 		xit = xi[i] - xi[l];
>>> 		xi[i] = xi[i] + xi[l];
>>> 		xr[l] = c*xrt - s*xit;
>>> 		xi[l] = c*xit + s*xrt;
>>> 	}
>
>Mr. Killian responded with a hand-coded inner loop for the MIPS
>architecture:
>
>>	l.s	f0, 0(t0)		# xr[i]
>>l:	l.s	f2, 0(t1)		# xr[l]
>>	l.s	f4, 0(t2)		# xi[i]
>>	sub.s	f10, f0, f2		# xrt = xr[i] - xr[l]
>>	l.s	f6, 0(t3)		# xi[l]
>>	mul.s	f20, f28, f10		# c*xrt
>>	sub.s	f12, f4, f6		# xit = xi[i] - xi[l]
>>	addu	t0, 4
>>	addu	t1, 4
>>	mul.s	f22, f30, f12		# s*xit
>>	add.s	f14, f0, f2		# xr[i] + xr[l]
>>	addu	t2, 4
>>	addu	t3, 4
>>	mul.s	f24, f28, f12		# c*xit
>>	add.s	f16, f4, f6		# xi[i] + xi[l]
>>	s.s	f14, -4(t0)		# xr[i] = xr[i] + xr[l]
>>	l.s	f0, 0(t0)		# xr[i+1]
>>	mul.s	f26, f30, f10		# s*xrt
>>	sub.s	f20, f20, f22		# c*xrt - s*xit
>>	s.s	f16, -4(t2)		# xi[i] = xi[i] + xi[l]
>>	s.s	f20, -4(t1)		# xr[l] = c*xrt - s*xit
>>	add.s	f24, f24, f26		# c*xit + s*xrt
>>	bne	t0, t4, l
>>	 s.s	f24, -4(t3)		# xi[l] = c*xit + s*xrt
>>
>>Summary: 23 instructions, 23 cycles, 10.9 mflops @ 25MHz, and 25
>>native mips.  No pipelining within an op unit required (the R3010 has
>>none).  Multiply/add/load/store overlap is extensively used.
>
>One problem with this code is that is assumes the "stride" of the loop
>(the varible "n1" in the C code segment above) is unity!
>
>What about the code for the inner loop in the general case?  What
>effect does the assumption of non-unity stride have on the MIPS loop
>timing?


# 4 * n1 is in a register named 'n1'

	l.s	f0, 0(t0)		# xr[i]
l:	l.s	f2, 0(t1)		# xr[l]
	l.s	f4, 0(t2)		# xi[i]
	sub.s	f10, f0, f2		# xrt = xr[i] - xr[l]
	l.s	f6, 0(t3)		# xi[l]
	mul.s	f20, f28, f10		# c*xrt
	sub.s	f12, f4, f6		# xit = xi[i] - xi[l]
	addu	t5, t0, n1
	addu	t6, t1, n1
	mul.s	f22, f30, f12		# s*xit
	add.s	f14, f0, f2		# xr[i] + xr[l]
	addu	t7, t2, n1
	addu	t8, t3, n1
	mul.s	f24, f28, f12		# c*xit
	add.s	f16, f4, f6		# xi[i] + xi[l]
	s.s	f14, 0(t0)		# xr[i] = xr[i] + xr[l]
	l.s	f0, 0(t5)		# xr[i+1]
	mul.s	f26, f30, f10		# s*xrt
	sub.s	f20, f20, f22		# c*xrt - s*xit
	s.s	f16, 0(t2)		# xi[i] = xi[i] + xi[l]
	s.s	f20, 0(t1)		# xr[l] = c*xrt - s*xit
	add.s	f24, f24, f26		# c*xit + s*xrt
	be	t5, t4, e
	s.s	f24, 0(t3)		# xi[l] = c*xit + s*xrt

	l.s	f2, 0(t6)		# xr[l]
	l.s	f4, 0(t7)		# xi[i]
	sub.s	f10, f0, f2		# xrt = xr[i] - xr[l]
	l.s	f6, 0(t8)		# xi[l]
	mul.s	f20, f28, f10		# c*xrt
	sub.s	f12, f4, f6		# xit = xi[i] - xi[l]
	addu	t0, t5, n1
	addu	t1, t6, n1
	mul.s	f22, f30, f12		# s*xit
	add.s	f14, f0, f2		# xr[i] + xr[l]
	addu	t2, t7, n1
	addu	t3, t8, n1
	mul.s	f24, f28, f12		# c*xit
	add.s	f16, f4, f6		# xi[i] + xi[l]
	s.s	f14, 0(t5)		# xr[i] = xr[i] + xr[l]
	l.s	f0, 0(t0)		# xr[i+1]
	mul.s	f26, f30, f10		# s*xrt
	sub.s	f20, f20, f22		# c*xrt - s*xit
	s.s	f16, 0(t7)		# xi[i] = xi[i] + xi[l]
	s.s	f20, 0(t6)		# xr[l] = c*xrt - s*xit
	add.s	f24, f24, f26		# c*xit + s*xrt
	bne	t0, t4, l
	s.s	f24, 0(t3)		# xi[l] = c*xit + s*xrt

e:

This code has exactly the same performance as earl's code except that
is also handles the non-unity case. Yet another example of needing lots
more registers to write good code in pipelined machines. Can I have lots
more registers please?
-- 
John Danskin				| decwrl!jmd
DEC Technology Development		| (415) 853-6724 
100 Hamilton Avenue			| My comments are my own.
Palo Alto, CA  94306			| I do not speak for DEC.