Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!wuarchive!udel!nigel.ee.udel.edu!mccalpin
From: mccalpin@perelandra.cms.udel.edu (John D. McCalpin)
Newsgroups: comp.benchmarks
Subject: Re: A question about flat out Snake speed.
Message-ID: <MCCALPIN.91Apr3142704@pereland.cms.udel.edu>
Date: 3 Apr 91 19:27:04 GMT
References: <1991Apr2.120117.21406@nas.nasa.gov> <7834@idunno.Princeton.EDU>
Sender: usenet@ee.udel.edu
Followup-To: comp.benchmarks
Organization: College of Marine Studies, U. Del.
Lines: 61
Nntp-Posting-Host: perelandra.cms.udel.edu
In-reply-to: ssr@stokes.Princeton.EDU's message of 3 Apr 91 14:34:49 GMT

>>>>> On 3 Apr 91 14:34:49 GMT, ssr@stokes.Princeton.EDU (Steve S. Roy) said:

Steve> With all the discussion of the speed of HP's hot new Snake
Steve> systems, I've been wondering what their peak speeds are.
Steve> Suppose you hand coded a matrix multiply, trig function, FFT or
Steve> whatever.  What is the maximum speed you could get and what
Steve> would the limiting factors be?  Is a daxpy ( x = a*x+y )
Steve> limited by the FPU or by cache or by main memory?  How flexible
Steve> is the multiply accumulate?  Can the integer and fp units run
Steve> in parallel as on the i860?

(1) It looks like the "Peak Speed" of the Snakes will be 50 MFLOPS for
the 720 and 66 MFLOPS for the 730.  This is based on the comment from
someone at HP that the adder and multiplier could each accept a new
pair of operands every other clock.  Since they are fully pipelined
and can run simultaneously, this gives a peak MFLOPS = MHz.

Note that this is different than the IBM RS/6000 whose
adder/multiplier can accept operands every clock so that add/multiply
peak MFLOPS = 2*MHz.


(2) Except for specially coded, cache-friendly stuff like Matrix Multiply
and Gaussian Elimination of Dense Matrices (i.e. LINPACK), most
calculations will be limited by the memory bandwidth.  The formula for
64-bit vector dyad operations (e.g. DSCAL) is

	Streaming MFLOPS = (MBytes/sec)/(24 bytes/FLOP)

The HP has a 32-bit memory bus capable of 1 word/clock transfer rates,
so the "Streaming MFLOPS" for these machines is:

	HP/9000 Models:
		720	= 200/24 =  8.3 MFLOPS
		730	= 264/24 = 11.0 MFLOPS

The formula for the IBM RS/6000 is slightly different since the
bottleneck is between cache and the FPU, not between main memory and
cache.  The "Streaming MFLOPS" for these machines are

	IBM RS/6000 Models:
	320,520 	= 160/24 =  6.8 MFLOPS
	530,730,930	= 200/24 =  8.3 MFLOPS
	540		= 240/24 = 10.0 MFLOPS
	550		= 328/24 = 13.7 MFLOPS

The corresponding numbers for triadic vector operations like DAXPY are
exactly twice these estimates.


(3) How fast your code will run will depend to a great degree on how
much re-use you get of cached data.  Long streaming vector ops will
run at the memory bandwidth-limited speed, while short (cacheable),
reused vectors will run at closer to the "Peak Speed".
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET