Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!wuarchive!udel!nigel.ee.udel.edu!mccalpin
From: mccalpin@perelandra.cms.udel.edu (John D. McCalpin)
Newsgroups: comp.benchmarks
Subject: Re: A question about flat out Snake speed.
Message-ID: <MCCALPIN.91Apr4093107@pereland.cms.udel.edu>
Date: 4 Apr 91 14:31:07 GMT
References: <1991Apr2.120117.21406@nas.nasa.gov> <7834@idunno.Princeton.EDU>
	<MCCALPIN.91Apr3142704@pereland.cms.udel.edu>
Sender: usenet@ee.udel.edu
Followup-To: comp.benchmarks
Organization: College of Marine Studies, U. Del.
Lines: 59
Nntp-Posting-Host: perelandra.cms.udel.edu
In-reply-to: mccalpin@perelandra.cms.udel.edu's message of 3 Apr 91 19:27:04 GMT

>>>>> On 3 Apr 91 19:27:04 GMT, (mccalpin@perelandra.cms.udel.edu) I
wrote about the "Peak" vs "Streaming" speeds (MFLOPS) of the new HP
Snakes and IBM RS/6000 computers.

Thanks to jbs@watson.ibm.com for pointing out some errors that I would
like to clear up below:

-----------------------
Me> (2) Except for specially coded, cache-friendly stuff like Matrix Multiply
Me> and Gaussian Elimination of Dense Matrices (i.e. LINPACK), most
Me> calculations will be limited by the memory bandwidth.  The formula for
Me> 64-bit vector dyad operations (e.g. DSCAL) is

Me> 	Streaming MFLOPS = (MBytes/sec)/(24 bytes/FLOP)

This is almost correct.  Just delete the word DSCAL and consider
operations of the form:

	a(i) = b(i) op c(i)

where op is one of +,-,*.  This requires 2 8-byte reads and one 8-byte
write per FP operation.

This operation dominates most scientific codes (in my experience), and
is therefore to be preferred over DAXPY for estimating machine speed.
DAXPY requires the same amount of memory traffic, but squeezes in an
extra FP op by using a loop-invariant scalar:
	y(i) = y(i) + scalar*x(i)

DSCAL requires only one read per op, and so scales like:

	DSCAL MFLOPS = (MB/s)/(16 bytes/FLOP)

-------------------------
Me> The formula for the IBM RS/6000 is slightly different since the
Me> bottleneck is between cache and the FPU, not between main memory and
Me> cache.  The "Streaming MFLOPS" for these machines are

To clarify: What I was trying to say was that the memory bandwidth to
be used to calculate streaming MFLOPS on these machines is the
bandwidth of 8 bytes/clock from the cache to the registers, NOT the
bandwidth of 16 bytes/clock from the main memory to the cache.

This extra bandwidth from main memory to cache is not wasted (since it
decreases the cache refill rate considerably) but the machine is not
capable of 128 byte/clock transfers to and from the FPU where the work
is done.

-------------------------
Finally:

Me> 	IBM RS/6000 Models:
Me> 	320,520 	= 160/24 =  6.8 MFLOPS
                                    ^^^
This should of course be 6.7 MFLOPS.
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET