Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!mips!zaphod.mps.ohio-state.edu!wuarchive!udel!brahms.udel.edu!mccalpin
From: mccalpin@brahms.udel.edu (John D McCalpin)
Newsgroups: comp.benchmarks
Subject: Re: A question about flat out Snake speed.
Summary: improvements to my previous estimates
Message-ID: <20459@brahms.udel.edu>
Date: 15 Apr 91 19:53:53 GMT
References: <7834@idunno.Princeton.EDU> <45780003@hpcupt3.cup.hp.com> <MCCALPIN.91Apr8194820@pereland.cms.udel.edu>
Organization: College of Marine Studies, U. Del.
Lines: 64

In article <MCCALPIN.91Apr8194820@perelandra.cms.udel.edu> I wrote:

>The memory bandwidth of the machine limits the long-vector DAXPY
>performance to under 22 MFLOPS.  How well does the Fortran-produced
>code perform for long vectors?  (My estimate would be in the range of 
>18-20 MFLOPS).

I realized that this number is way too high.  Looking at reasonable
values for the cache miss latency leads me to estimate the long
DAXPY performance of the 66 MHz machines as about 9 MFLOPS, or 
under 1/2 of the performance based on the "peak" memory bandwidth.


>For comparison, the IBM RS/6000 machines run uncached DAXPY at:

I have revised these numbers taking into account the type of memory
access pattern and cache miss pattern exhibited by long DAXPY's.  These
new results are shown below:

Model	Measured Estimated   Theory   	Measured/
	 MFLOPS	  MFLOPS     MFLOPS	 Theory
------------------------------------------------------------------
320	6.25 		       6.67	  93.7%
320H	----	   7.81	       8.33
530    10.53 		      11.11	  94.8%
540     ----	  12.64	      13.33
550     ----	  17.26	      18.22
------------------------------------------------------------------
HP 720   ???		       6.67
HP 750   ???		       8.80
------------------------------------------------------------------

The "sustained" MFLOPS of the machine is now modelled by:

			  	       N words
 MFLOPS = (peak bandwidth) * ----------------------------- / (12 words/op)
			     (N cycles + 8 cycles latency)


where N=8 for the 320, 320H, 520, and N=16 for the 530, 540, 500.

Note that this means that the first three machines can only sustain
1/2 of the advertised "peak" memory bandwidth, while the last three
can sustain 2/3 of the "peak" memory bandwidth.


For the HP machines, I can only guess, but "reasonable" guesses give
an estimate like:

			  	       N words
 MFLOPS = (peak bandwidth) * ------------------------------ / (12 words/op)
			     (N cycles + 12 cycles latency)

I have HP glossies that tell me that the cache line size (N) on the 720
is 8 words.  I estimate a larger latency because the HP does not have
the "critical-word-first" cache refill hardware that IBM does.  So if the
memory latency is the same for the first word, then the HP will wait
(on the average) another 4 cycles before it gets the one it was waiting 
for.
I have no info on the cache line size for the 750.

Anybody from HP want to correct my errors on the latencies and provide
some numbers for long DAXPY operations to see how well the compiler 
manages?