Path: utzoo!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!usc!apple!spies!zorch!ardent!mips!wright.mips.com From: earl@wright.mips.com (Earl Killian) Newsgroups: comp.sys.m88k Subject: Re: Information wanted on m88000 Risc workstations Message-ID: <34780@mips.mips.COM> Date: 17 Jan 90 23:36:05 GMT References: <641@s5.Morgan.COM> <25A64468.11498@paris.ics.uci.edu> <648@s5.Morgan.COM> <1879@xyzzy.UUCP> <2811@yogi.oakhill.UUCP> <34446@mips.mips.COM> <2831@yogi.oakhill.UUCP> Sender: news@mips.COM Reply-To: earl@wright.mips.com (Earl Killian) Organization: MIPS Computer Systems Inc. Lines: 67 In-reply-to: marvin@oakhill.UUCP (Marvin Denman) In article <2831@yogi.oakhill.UUCP>, marvin@oakhill (Marvin Denman) writes: >It should be noted that the 88k numbers you repeated are apparently >at 20Mhz I see. I included both 25 and 33MHz numbers because I wasn't sure what clock to compare to. I didn't think of 20. >and for the specific code fragment posted: > DO 10 J = I,N >10 A(I,J) = A(I,J) + B(I,K) * C(K,J) > >The numbers you posted for the R3000 are PROBABLY for a slightly different >code fragment: ( I am more conversant in C so I will translate) > for (i=0 ; i for (j=0 ; j for (k=0, a[i][j]=0.0 ; k a[i][j] = a[i][j] + (b[i][k] * c[k][j]); >Is that true? Yes, the matrix multiply library routine quoted uses an algorithm close to the above (the appropriate algorithm for matrix multiply does depend on the machine). One difference from the above is that it appears you're assuming the array bounds are known at compile-time, which is not true for the library subroutine I used (the stride is a parameter). This makes the address arithmetic more expensive (it adds a whole instruction per flop). The second is that unrolling was done for the middle-loop, not the inner loop. >When I recoded this inner loop for the 88100 unrolling the loop 8 >times, I got 10.8 Mflops at 25 Mhz and 14.4 Mflops at 33 Mhz for >single precision. What about double? ;-) >The inner loop written in this style can accumulate a[i][j] into a >register and remove the stores from the inner loop. >... >This loop had only 1 cycle of stalling out of 37 cycles so the >floating point latencies had a neglible effect. But accumulates into a[i][j] are dependent, and I thought the fp add was 5 cycles, so 8 dependent fp adds should take a minimum of 40 cycles, true? Did you convert to multiple parallel accumulation registers to get around the fp latency? >How much was the inner loop unrolled for the R3000? The middle loop was unrolled 8 times. Anyway, the point of my response to the original It will be interesting to see if MIPS goes to pipelining floating point instructions in future parts. is that we're not going to add pipelining at the expense of latency, because low-latency lets you do two things well (scalar and vector), whereas pipelining lets you only do vector well. I was surprised that a high-latency highly-pipelined machine like the 88100 actually appeared to slower on a vector problem than the R3000, and you correctly pointed out was only because the originally posted code was somewhat sub-optimal for the 88100. On a vector problem, both machines should be instruction-issue limited. The latency or pipelining required to run at peak rate is a function of the instruction to flop ratio. We try to keep our latencies below that ratio, whereas the 88100 keeps its pipelining below that rate. -- -- UUCP: {ames,decwrl,prls,pyramid}!mips!earl USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086