Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!sol.ctr.columbia.edu!emory!gatech!udel!nigel.ee.udel.edu!mccalpin
From: mccalpin@perelandra.cms.udel.edu (John D. McCalpin)
Newsgroups: comp.benchmarks
Subject: Re: Price/Performance figures for Number-Crunching
Message-ID: <MCCALPIN.91Mar20163118@pereland.cms.udel.edu>
Date: 20 Mar 91 21:31:18 GMT
References: <MCCALPIN.91Mar18165912@pereland.cms.udel.edu>
	<1991Mar20.104926@IASTATE.EDU>
Sender: usenet@ee.udel.edu
Organization: College of Marine Studies, U. Del.
Lines: 104
Nntp-Posting-Host: perelandra.cms.udel.edu
In-reply-to: carter@IASTATE.EDU's message of 20 Mar 91 16:49:26 GMT

> On 20 Mar 91 16:49:26 GMT, carter@IASTATE.EDU (Carter Michael Brannon) said:

Carter> We found your LINPACK/peak MFLOPS table quite interesting, and
Carter> have some data to add to it; we applaud the use of "stream
Carter> MFLOPS," which we generally refer to as "Level 1 BLAS MFLOPS"
Carter> around here.  Your "MFLOPS Max" is what we would call "Level 3
Carter> BLAS MFLOPS", meaning that operands are re-used and processors
Carter> that are bandwidth-starved can still show high numbers.

I am glad you found it interesting....

Carter> However, we find this kind of performance evaluation to be
Carter> limited and misleading.  MFLOPS measures correlate poorly with
Carter> actual performance on complete applications, and the LINPACK
Carter> measure is particularly inaccurate in that it ignores all but
Carter> a particular, not very common, form of linear algebra
Carter> operation.  

What you mean is that you find the LINPACK MFLOPS correlates poorly
with *your* complete applications.  I find pretty good correlation
with *my* complete applications.  As for someone *else's*
applications, the appropriate response is, "Well, it depends...."

But in any event, please notice that I only used the LINPACK 1000x1000
hand-optimized number to calculate a price/performance ratio (Max
MFLOPS/million $).  The other price/performance number (Stream
MFLOPS/million $) comes from other sources.

The LINPACK 1000x1000 number is a particularly *good* estimate of the
maximum speed attainable by cache-friendly applications on vector and
mildly parallel machines (ncpus=2,4,8 but not ncpus=256).  It is not
intended to be a good estimate of the speed of anyone's application.

The "Stream MFLOPS" is a good estimator (in *my* experience) of the
performance of well-structured vectorizable codes with no specific
optimizations.


Carter> [....]  LINPACK cannot assess parallel computers because of
Carter> its ethnocentric uniprocessor FORTRAN rules, and it does not
Carter> scale to the amount of computing power available.  Do people
Carter> buy computers to perform MFLOPS or to solve problems?

You are getting carried away by your own propaganda here.  The LINPACK
100x100 test case has some specific rules.  Following those rules
allows one to make specific statements about the results that could
not be made in the absence of those "ethnocentric uniprocessor FORTRAN
rules".  

It is certainly true that these rules do not allow massively parallel
machines to show off their best potential. So what?  If you don't like
those rules, make up another benchmark.  I would suggest something
like solving Laplace's equation on a 512x512x512 finite-difference
grid.  (That's 128 MW of data, and is probably big enough for just
about any massively parallel computer).

Carter> All measurements are for 64-bit, IEEE floating-point
Carter> arithmetic.  (You need to state this in your table.  Some
Carter> vendors, such as Convex, are notorious for citing 32-bit
Carter> MFLOPS and hoping you won't notice.) 

(1) I thought that my table was clear enough.  Dongarra's report
only contains 64-bit results and my "Stream MFLOPS" are clearly based
on 64-bit arithmetic.
(2) Are you sure you have your libel in order here?  The Convex machines
do not run noticeably faster in 32-bits than 64-bits, so why would
they bother?  CDC is the company that would have had something to
gain, but their results were generally clearly labelled as well....

Carter> Another problem with MFLOPS and LINPACK is "benchmark rot".
Carter> The LINPACK benchmark has become so important that compilers
Carter> now have "LINPACK recognizers" that drop in super-optimized
Carter> code whenever they see a structure that looks like the LINPACK
Carter> kernel.

Documentation?  Or is this another "urban myth"?

Carter>  We have found that LINPACK overpredicts actual
Carter> application performance by an order of magnitude for some
Carter> computers... FPS, Stardent, Convex, Alliant, and to some
Carter> extent, CRAY, are all guilty.  So you might find that
Carter> traditional vector computers aren't holding up that well when
Carter> asked to do something other than DAXPY with unit stride!

Which LINPACK number overpredicts performance by an order of magnitude?

The LINPACK 100x100 test case gives very reasonable numbers for
vectorizable applications.  If your applications are running at an
order of magnitude slower than the LINPACK 100x100 numbers, then you 
are doing something seriously wrong -- either in implementing your
code or in choosing what machine to run it on....

On the other hand, the LINPACK 1000x1000 numbers could be an order of
magnitude faster than your application, especially on parallel
machines.  That test case was never intended to give you an estimate
of your program performance, it was intended to verify that it is
possible to write an application that runs at a substantial fraction
of the machine's peak speed.  Thus machines like the Alliant FX/80
stuck out like a sore thumb, since the best performance was only 1/3
of the peak advertised performance (69 MFLOPS vs 188 MFLOPS).
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET