Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!swrinde!cs.utexas.edu!sun-barr!newstop!texsun!convex!news
From: patrick@convex.COM (Patrick F. McGehearty)
Newsgroups: comp.benchmarks
Subject: Re: Price/Performance figures for Number-Crunching
Message-ID: <1991Mar21.000302.10103@convex.com>
Date: 21 Mar 91 00:03:02 GMT
References: <MCCALPIN.91Mar18165912@pereland.cms.udel.edu> <1991Mar20.104926@IASTATE.EDU> <MCCALPIN.91Mar20163118@pereland.cms.udel.edu>
Sender: news@convex.com (news access account)
Reply-To: patrick@convex.COM (Patrick F. McGehearty)
Organization: Convex Computer Corporation, Richardson, Tx.
Lines: 89
Nntp-Posting-Host: mozart.convex.com

In article <MCCALPIN.91Mar20163118@pereland.cms.udel.edu> mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:
>> On 20 Mar 91 16:49:26 GMT, carter@IASTATE.EDU (Carter Michael Brannon) said:
>
lots of interesting discussion of LINPACK vs $$ vs other things.
...
...until we get to here...which leads to what I want to comment on
...
>
>Carter> Another problem with MFLOPS and LINPACK is "benchmark rot".
>Carter> The LINPACK benchmark has become so important that compilers
>Carter> now have "LINPACK recognizers" that drop in super-optimized
>Carter> code whenever they see a structure that looks like the LINPACK
>Carter> kernel.
>
Mccalpin>Documentation?  Or is this another "urban myth"?
>

I would recommend the phrase "benchmark smoothing" to "benchmark rot",
in the sense that the benchmark has been used like sandpaper to smooth
rough edges out of the compiler.  The smaller the benchmark, the coarser
and more specific the smoothing.  Thus, any benchmark which has been
widely used for a number of years will be much less likely to trip over
compiler weaknesses than newer benchmarks.  Many customers use known
benchmarks to select their first round of bidders, and then ask that
their own benchmarks be run for final selection purposes.  This procedure
explains why having "LINPACK only recognizers" is not a winning approach.
Yes, you get invited to bid, but no, you don't win the business.
I'm sure they exist, but are quickly found to be a waste of effort.

On the compiler development side (where I sit), LINPACK certainly has
been a "compiler smoother".  Several of our current optimizations were
driven by LINPACK.  They are not specific to LINPACK, since they will
work for most dense vector*vector or matrix*vector code, but they
certainly are important to getting the LINPACK results we get.  Is this
the sort of "LINPACK recognizer" that Carter was refering too?  If so,
I am concerned about the apparent distain for "pattern recognizers".
Pattern recognition is a critical tool in the optimizing compiler's tool bag.
There are large numbers of such "pattern recognizers" in any good optimizing
compiler, and a good compiler development group is continually trying
to identify new ones that have at least moderate degrees of utility.

So, I would argue that you should not distain the old benchmarks for
"benchmark rot", but be wary of any single number to characterize an
architecture.  Look at several benchmark results (as many as you can find),
and cross check the different machines on the different tests.  When a large
variety of tests show machine A to be twice as fast as machine B, then you
can have more confidence of your result than when a single benchmark shows
such a result.  If the results vary, then understanding why will lead to
better understanding of how the machines would work in your environment.

I agree with McCalpin that the Linpack 1000x1000 data to provide a good
'truth in advertising' comparison with "PEAK/guaranteed not to exceed"
numbers quoted by marketing glossies.  Most applications will not approach
those rates without some serious optimization effort, but at least you know
what the potentials are.  Data for new systems may be subject to later
improvement due to better understanding of the system by the benchmarkers,
but otherwise, it is a useful guideline.

That problem size should be sufficient for up to about 10 Gflop size
systems.  For larger and faster systems, I like the concept put forth by
Carter of "how large of a problem can be solved in fixed time?".  A
debatable point is "what is the ideal fixed time?"  Consider the following
two extremes expressed:

1) Anything less than 1/25 of a second is 'apparently instant' from the
   human perception point of view.
2) For anything longer than a week, you forget why you ran the program. :-)

I know there are exceptions on both ends, but these limits serve as
convenient bounds for the domain of discourse.  The "one minute" suggested
by Carter has several good points.  As a model of real work, it is small
enough so that a a researcher will not switch to some other major activity
before getting results.  This allows the researcher to continue an
incremental train of thought.  It also is small enough for the benchmarker
to run repeatedly with different configurations or tuning options.  For
these reasons, it is a useful length of time.  However, it does not cover
the 'overnight' run category.  For this category of test, issues involving
very large data sets and management of same come into play.  It is an
important area of benchmarking that is generally neglected due to the high
cost of developing the benchmark and getting vendors to run and tune such a
benchmark.  Perhaps we (vendors and customers) could encourage the
development of such 'super benchmarks' by supporting two versions of the
same code, the one minute version and the one night version.  Then, most
tuning could be done with the one minute version, and the one night version
could be used to confirm the effectiveness of the system for truly large
problems.

I seem to have wandered over several related topics, so I will stop now
and wait for the net's comments.