Path: utzoo!attcan!uunet!husc6!mailrus!ames!ubvax!vsi1!wyse!mips!earl
From: earl@mips.COM (Earl Killian)
Newsgroups: comp.arch
Subject: Re: m88000 benchmarks (LONG)
Message-ID: <2532@wright.mips.COM>
Date: 2 Jul 88 07:11:22 GMT
References: <1359@claude.oakhill.UUCP>
Lines: 62
In-reply-to: wca@oakhill.UUCP's message of 30 Jun 88 18:58:18 GMT

In article <1359@claude.oakhill.UUCP> wca@oakhill.UUCP (william anderson) writes:
> 	for (i=0 ; i<n ; i+=n1) {
> 		l = i+n2;
> 		xrt = xr[i] - xr[l];
> 		xr[i] = xr[i] + xr[l];
> 		xit = xi[i] - xi[l];
> 		xi[i] = xi[i] + xi[l];
> 		xr[l] = c*xrt - s*xit;
> 		xi[l] = c*xit + s*xrt;
> 	}
> As a limit to the optimal instruction ordering for this fft inner loop
> example, we include here an equivalent inner loop, hand-coded in M88000
> assembler by Marvin Denman of the MC88100 Design Team.  Although the
> set-up and register usage for this loop differ slightly from the compiler
> examples given above, the number and selection of instructions is identical
> to compiler B.  This loop of 22 instructions executes in 23 clocks.
> [88100 code excised]
> Although the last two FP instructions in the loop do not complete before
> the next iteration, there are no data dependencies which cause a stall
> from loop to loop.  This code loop executes at 8.70 Mflops on an M88000
> system running at 20 MHz.  This corresponds to 19.1 (native) Mips.

An fun exercise.  Here's the MIPSco R3000 code for the inner loop:

	l.s	f0, 0(t0)		# xr[i]
l:	l.s	f2, 0(t1)		# xr[l]
	l.s	f4, 0(t2)		# xi[i]
	sub.s	f10, f0, f2		# xrt = xr[i] - xr[l]
	l.s	f6, 0(t3)		# xi[l]
	mul.s	f20, f28, f10		# c*xrt
	sub.s	f12, f4, f6		# xit = xi[i] - xi[l]
	addu	t0, 4
	addu	t1, 4
	mul.s	f22, f30, f12		# s*xit
	add.s	f14, f0, f2		# xr[i] + xr[l]
	addu	t2, 4
	addu	t3, 4
	mul.s	f24, f28, f12		# c*xit
	add.s	f16, f4, f6		# xi[i] + xi[l]
	s.s	f14, -4(t0)		# xr[i] = xr[i] + xr[l]
	l.s	f0, 0(t0)		# xr[i+1]
	mul.s	f26, f30, f10		# s*xrt
	sub.s	f20, f20, f22		# c*xrt - s*xit
	s.s	f16, -4(t2)		# xi[i] = xi[i] + xi[l]
	s.s	f20, -4(t1)		# xr[l] = c*xrt - s*xit
	add.s	f24, f24, f26		# c*xit + s*xrt
	bne	t0, t4, l
	 s.s	f24, -4(t3)		# xi[l] = c*xit + s*xrt

Summary: 23 instructions, 23 cycles, 10.9 mflops @ 25MHz, and 25
native mips.  No pipelining within an op unit required (the R3010 has
none).  Multiply/add/load/store overlap is extensively used.

Analyzing in terms of the previous posting m=.8 (8 load/store cycles
per 10 flops), f1=.6 (6 add/subtract out of 10 flops), f2=.4 (4
multiply out of 10 flops).  Thus a new add/subtract required every
(m+1)/f1 = 3 cycles, and a new multiply required every (m+1)/f2 = 4.5
cycles.  Since the respective latencies are 2 and 4, no pipelining is
required.
-- 
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086