Path: utzoo!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!oakhill!marvin
From: marvin@oakhill.UUCP (Marvin Denman)
Newsgroups: comp.sys.m88k
Subject: Re: Information wanted on m88000 Risc workstations
Message-ID: <2831@yogi.oakhill.UUCP>
Date: 16 Jan 90 20:31:22 GMT
References: <641@s5.Morgan.COM> <25A64468.11498@paris.ics.uci.edu> <648@s5.Morgan.COM> <1879@xyzzy.UUCP> <TOM.90Jan9101628@hcx2.ssd.csd.harris.com> <2811@yogi.oakhill.UUCP> <34446@mips.mips.COM>
Reply-To: cs.utexas.edu!oakhill!marvin (Marvin Denman)
Organization: Motorola Inc., Austin, Texas
Lines: 52

In article <34446@mips.mips.COM> , earl@wright.mips.com (Earl Killian) writes:

>Consider the application being discussed,
>matrix multiply, which is highly vectorizable.  If the original poster
>is correct in that the 88100, with its pipelined floating-point units,
>tops out in 6.7 mflop/s in single precision matrix multiplies, it
>really proves this point.  The MIPS R3000, with non-pipelined
>floating-point units, can do matrix multiplies at
>      			   25MHz	   33MHz
>	single		11.8 mflop/s	15.7 mflop/s
>	double		 7.8 mflop/s	10.4 mflop/s
>This an example of why MIPS perfers low-latency to pipelined fp.
>--
>UUCP: {ames,decwrl,prls,pyramid}!mips!earl
>USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086

It should be noted that the 88k numbers you repeated are apparently at 20Mhz
and for the specific code fragment posted:
   DO 10 J = I,N
10 A(I,J) = A(I,J) + B(I,K) * C(K,J)

The numbers you posted for the R3000 are PROBABLY for a slightly different
code fragment:   ( I am more conversant in C so I will translate)

  for (i=0 ; i<MAXI ; i++)
    for (j=0 ; j<MAXJ ; j++)
      for (k=0, a[i][j]=0.0 ; k<MAXK ; k++)
        a[i][j] = a[i][j] + (b[i][k] * c[k][j]);

Is that true?

The inner loop written in this style can accumulate a[i][j] into a register and
remove the stores from the inner loop.  (Note that the assumption that the arrays
do not overlap is necessary)

When I recoded this inner loop for the 88100 unrolling the loop 8 times, I got 
10.8 Mflops at 25 Mhz and 14.4 Mflops at 33 Mhz for single precision.  This loop 
had only 1 cycle of stalling out of 37 cycles so the floating point latencies had 
a neglible effect.  How much was the inner loop unrolled for the R3000?  By my 
rough calculation I would suspect it would have to be 16 or so to get the 
numbers quoted.  This is probably a legitimate difference, but I would be 
interested to know if the extra unrolling is the cause of this difference.

Marvin Denman
Motorola 88000 Design
cs.utexas.edu!oakhill!marvin

-- 

Marvin Denman
Motorola 88000 Design
cs.utexas.edu!oakhill!marvin