Path: utzoo!attcan!uunet!husc6!mailrus!ames!ubvax!vsi1!wyse!mips!earl From: earl@mips.COM (Earl Killian) Newsgroups: comp.arch Subject: Re: m88000 benchmarks (LONG) Message-ID: <2532@wright.mips.COM> Date: 2 Jul 88 07:11:22 GMT References: <1359@claude.oakhill.UUCP> Lines: 62 In-reply-to: wca@oakhill.UUCP's message of 30 Jun 88 18:58:18 GMT In article <1359@claude.oakhill.UUCP> wca@oakhill.UUCP (william anderson) writes: > for (i=0 ; i l = i+n2; > xrt = xr[i] - xr[l]; > xr[i] = xr[i] + xr[l]; > xit = xi[i] - xi[l]; > xi[i] = xi[i] + xi[l]; > xr[l] = c*xrt - s*xit; > xi[l] = c*xit + s*xrt; > } > As a limit to the optimal instruction ordering for this fft inner loop > example, we include here an equivalent inner loop, hand-coded in M88000 > assembler by Marvin Denman of the MC88100 Design Team. Although the > set-up and register usage for this loop differ slightly from the compiler > examples given above, the number and selection of instructions is identical > to compiler B. This loop of 22 instructions executes in 23 clocks. > [88100 code excised] > Although the last two FP instructions in the loop do not complete before > the next iteration, there are no data dependencies which cause a stall > from loop to loop. This code loop executes at 8.70 Mflops on an M88000 > system running at 20 MHz. This corresponds to 19.1 (native) Mips. An fun exercise. Here's the MIPSco R3000 code for the inner loop: l.s f0, 0(t0) # xr[i] l: l.s f2, 0(t1) # xr[l] l.s f4, 0(t2) # xi[i] sub.s f10, f0, f2 # xrt = xr[i] - xr[l] l.s f6, 0(t3) # xi[l] mul.s f20, f28, f10 # c*xrt sub.s f12, f4, f6 # xit = xi[i] - xi[l] addu t0, 4 addu t1, 4 mul.s f22, f30, f12 # s*xit add.s f14, f0, f2 # xr[i] + xr[l] addu t2, 4 addu t3, 4 mul.s f24, f28, f12 # c*xit add.s f16, f4, f6 # xi[i] + xi[l] s.s f14, -4(t0) # xr[i] = xr[i] + xr[l] l.s f0, 0(t0) # xr[i+1] mul.s f26, f30, f10 # s*xrt sub.s f20, f20, f22 # c*xrt - s*xit s.s f16, -4(t2) # xi[i] = xi[i] + xi[l] s.s f20, -4(t1) # xr[l] = c*xrt - s*xit add.s f24, f24, f26 # c*xit + s*xrt bne t0, t4, l s.s f24, -4(t3) # xi[l] = c*xit + s*xrt Summary: 23 instructions, 23 cycles, 10.9 mflops @ 25MHz, and 25 native mips. No pipelining within an op unit required (the R3010 has none). Multiply/add/load/store overlap is extensively used. Analyzing in terms of the previous posting m=.8 (8 load/store cycles per 10 flops), f1=.6 (6 add/subtract out of 10 flops), f2=.4 (4 multiply out of 10 flops). Thus a new add/subtract required every (m+1)/f1 = 3 cycles, and a new multiply required every (m+1)/f2 = 4.5 cycles. Since the respective latencies are 2 and 4, no pipelining is required. -- UUCP: {ames,decwrl,prls,pyramid}!mips!earl USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086