Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!wuarchive!udel!nigel.ee.udel.edu!mccalpin
From: mccalpin@perelandra.cms.udel.edu (John D. McCalpin)
Newsgroups: comp.benchmarks
Subject: Re: A question about flat out Snake speed.
Message-ID: <MCCALPIN.91Apr8194820@pereland.cms.udel.edu>
Date: 8 Apr 91 23:48:20 GMT
References: <7834@idunno.Princeton.EDU> <45780003@hpcupt3.cup.hp.com>
Sender: usenet@ee.udel.edu
Organization: College of Marine Studies, U. Del.
Lines: 55
Nntp-Posting-Host: perelandra.cms.udel.edu
In-reply-to: daryl@hpcupt3.cup.hp.com's message of 6 Apr 91 20:38:43 GMT

>>>>> On 6 Apr 91 20:38:43 GMT, daryl@hpcupt3.cup.hp.com (Daryl Odnert) said:

Daryl> Here are the key things to consider in formulating an answer to
Daryl> Steve's question:

Thanks for the helpful posting....

I just thought I would point out one detail that got buried near the
bottom of your posting:

Daryl> Thus, peak performance for the DAXPY loop on the 66MHz Snakes
Daryl> box is:  66 million instructions per second * (2 flops / 5 instructions)
Daryl> = 26.4 MFLOPS

Daryl> This (compiled code) result(s) in performance rating of 66 *
Daryl> (8/ 22) = 24 MFLOPS in the inner loop.  Thus at the present
Daryl> time, the compilers are achieving about 90% of the peak
Daryl> performance potential on this particular loop 
Daryl> (assuming all data fits in the cache.)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is, of course, a big assumption!  It is true that the HP caches
are rather large, but the 256kB data cache will not even hold one
200x200 doubleprecision matrix.

The memory bandwidth of the machine limits the long-vector DAXPY
performance to under 22 MFLOPS.  How well does the Fortran-produced
code perform for long vectors?  (My estimate would be in the range of 
18-20 MFLOPS).

It is also important to remember that the 66 MHz machines are not the
$12000 machines.  For these 50 MHz boxes, the numbers scale down by a
factor of 0.76, giving long-vector DAXPY performance of about 14-15
MFLOPS.  

For comparison, the IBM RS/6000 machines run uncached DAXPY at:

	Model 320	6.25 MFLOPS (measured)		13.3 Theoretical
	Model 320H	7.81 MFLOPS (est)		16.7 Theoretical
	Model 530      10.53 MFLOPS (observed)		16.7 Theoretical
	Model 540      12.64 MFLOPS (est)		20.0 Theoretical
	Model 550      17.26 MFLOPS (est)		27.3 Theoretical

The 320H is slower than the 530 even though they both have the same
clock speed (25 MHz) because the 530 has a wider bus (128-bit vs
64-bit), larger cache line size (128 bytes vs 64 bytes), and a larger
data cache (64kB vs 32kB).

Perhaps part of the reason that the IBM performance is so much less
than the memory-bandwidth-limited performance (labeled "Theoretical"
above) is that stores cannot overlap with reads or computations....
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET