Path: utzoo!attcan!uunet!cs.utexas.edu!samsung!uakari.primate.wisc.edu!uflorida!mephisto!udel!udccvax1!mccalpin
From: mccalpin@vax1.acs.udel.EDU (John D Mccalpin)
Newsgroups: comp.sys.sgi
Subject: Re: Processor efficiency
Message-ID: <6604@vax1.acs.udel.EDU>
Date: 15 Jun 90 14:30:12 GMT
References: <9006150334.AA03405@physics.phy.duke.edu>
Reply-To: mccalpin@vax1.udel.edu (John D Mccalpin)
Organization: College of Marine Studies, Univ. of Delaware
Lines: 64

In article <9006150334.AA03405@physics.phy.duke.edu> rgb@PHY.DUKE.EDU ("Robert G. Brown") writes:
>
>We have a Power Series 220S [....]
>[....] small jobs [...] run at around 3.5 MFLOPS (as advertised). 
>[...] if one takes these jobs (typically a loop containing just one 
>equation with a multiply, a divide, an add, and a subtract) and
>scales them up by making the loop set every element of a vector and
>increasing the size of the vector and the loop, there is a point
>(which I have not yet tried to precisely pinpoint) where the speed
>degrades substantially -- by more than a factor of two.

This degradation is a bit larger than is typical, but it is exactly
what one expects to find with many algorithms on a cached machine.
On my 4D/25 I typically see 25% slowdowns on dense linear algebra
benchmarks when the cache size is exceeded.

(Side note:  It is unfortunate that SGI put a 32kB data cache in
the 4D/25 as it is just a bit too small to handle the 100x100 LINPACK
benchmark case.  The rated performance is 1.6 MFLOPS for the 64-bit
case, while the Sparcstation I is rated at 2.6 MFLOPS.  Despite these
ratings, the 4D/25 is faster than the Sparcstation on almost every 
realistic FP benchmark that I have run.  Also, re-arranging the LINPACK
test case to run in block mode produces performance of up to 3.1 MFLOPS
for the same test case.)

(Side Question:  Does anyone at SGI want to tell me what the cache 
line size and refill delays are for the 4D/25?  Thanks for any info!)

>My current hypothesis is that this phenomenon is caused by saturation
>of some internal cache on the R3000.  Has anyone else noticed or
>documented this? Dr. Robert G. Brown rgb@phy.duke.edu

Here are some numbers from the port of LAPACK that I have been playing with
on my 4D/25 (32 kB data cache).  These use hand-coded BLAS routines from
earl@mips.com.

 size      factor     solve      total     mflops
  ------------------------------------------------
    32  0.000E+00  9.398E-03  9.398E-03  2.542E+00
    50  2.819E-02  0.000E+00  2.819E-02  3.133E+00
   100  1.692E-01  0.000E+00  1.692E-01  4.059E+00
   150  6.296E-01  1.880E-02  6.484E-01  3.539E+00
   200  1.626E+00  2.819E-02  1.654E+00  3.273E+00
   250  3.411E+00  4.699E-02  3.458E+00  3.048E+00
   300  6.137E+00  6.578E-02  6.202E+00  2.931E+00
   500  2.904E+01  1.692E-01  2.921E+01  2.870E+00

I get a bit more than 25% degradation going to the larger problems.

So what does one do about it?

Mostly it depends on the problem.  If you are doing problems that make
extensive use of reduction operations (sums and dot products) then
you should be able to improve the cache locality by unrolling the
outer loops.  This is roughly equivalent to the block-mode algorithms
used in LAPACK.
If your operations are vector<-vector+vector, then you are basically
out of luck and your problem will be memory bandwidth-limited.....

Please let me know if I have not made myself clear!
-- 
John D. McCalpin                               mccalpin@vax1.udel.edu
Assistant Professor                            mccalpin@delocn.udel.edu
College of Marine Studies, U. Del.             mccalpin@scri1.scri.fsu.edu