Path: utzoo!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!usc!apple!spies!zorch!ardent!mips!wright.mips.com
From: earl@wright.mips.com (Earl Killian)
Newsgroups: comp.sys.m88k
Subject: Re: Information wanted on m88000 Risc workstations
Message-ID: <34780@mips.mips.COM>
Date: 17 Jan 90 23:36:05 GMT
References: <641@s5.Morgan.COM> <25A64468.11498@paris.ics.uci.edu> <648@s5.Morgan.COM> <1879@xyzzy.UUCP> <TOM.90Jan9101628@hcx2.ssd.csd.harris.com> <2811@yogi.oakhill.UUCP> <34446@mips.mips.COM> <2831@yogi.oakhill.UUCP>
Sender: news@mips.COM
Reply-To: earl@wright.mips.com (Earl Killian)
Organization: MIPS Computer Systems Inc.
Lines: 67
In-reply-to: marvin@oakhill.UUCP (Marvin Denman)

In article <2831@yogi.oakhill.UUCP>, marvin@oakhill (Marvin Denman) writes:
>It should be noted that the 88k numbers you repeated are apparently
>at 20Mhz

I see.  I included both 25 and 33MHz numbers because I wasn't sure
what clock to compare to.  I didn't think of 20.

>and for the specific code fragment posted:
>   DO 10 J = I,N
>10 A(I,J) = A(I,J) + B(I,K) * C(K,J)
>
>The numbers you posted for the R3000 are PROBABLY for a slightly different
>code fragment:   ( I am more conversant in C so I will translate)
>  for (i=0 ; i<MAXI ; i++)
>    for (j=0 ; j<MAXJ ; j++)
>      for (k=0, a[i][j]=0.0 ; k<MAXK ; k++)
>        a[i][j] = a[i][j] + (b[i][k] * c[k][j]);
>Is that true?

Yes, the matrix multiply library routine quoted uses an algorithm
close to the above (the appropriate algorithm for matrix multiply does
depend on the machine).  One difference from the above is that it
appears you're assuming the array bounds are known at compile-time,
which is not true for the library subroutine I used (the stride is a
parameter).  This makes the address arithmetic more expensive (it adds
a whole instruction per flop).  The second is that unrolling was done
for the middle-loop, not the inner loop.

>When I recoded this inner loop for the 88100 unrolling the loop 8
>times, I got 10.8 Mflops at 25 Mhz and 14.4 Mflops at 33 Mhz for
>single precision.

What about double? ;-)

>The inner loop written in this style can accumulate a[i][j] into a
>register and remove the stores from the inner loop.
>...
>This loop had only 1 cycle of stalling out of 37 cycles so the
>floating point latencies had a neglible effect.

But accumulates into a[i][j] are dependent, and I thought the fp add
was 5 cycles, so 8 dependent fp adds should take a minimum of 40
cycles, true?  Did you convert to multiple parallel accumulation
registers to get around the fp latency?

>How much was the inner loop unrolled for the R3000?

The middle loop was unrolled 8 times.

Anyway, the point of my response to the original
	It will be interesting to see if MIPS goes to pipelining
	floating point instructions in future parts.
is that we're not going to add pipelining at the expense of latency,
because low-latency lets you do two things well (scalar and vector),
whereas pipelining lets you only do vector well.  I was surprised that
a high-latency highly-pipelined machine like the 88100 actually
appeared to slower on a vector problem than the R3000, and you
correctly pointed out was only because the originally posted code was
somewhat sub-optimal for the 88100.  On a vector problem, both
machines should be instruction-issue limited.  The latency or
pipelining required to run at peak rate is a function of the
instruction to flop ratio.  We try to keep our latencies below that
ratio, whereas the 88100 keeps its pipelining below that rate.
-- 
-- 
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086