Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!gatech!udel!nigel.ee.udel.edu!mccalpin
From: mccalpin@perelandra.cms.udel.edu (John D. McCalpin)
Newsgroups: comp.benchmarks
Subject: Re: Approximate MFLOPS
Message-ID: <MCCALPIN.90Nov22090957@pereland.cms.udel.edu>
Date: 22 Nov 90 14:09:57 GMT
References: <1131@cnw01.storesys.coles.oz.au> <6760@uceng.UC.EDU>
Sender: usenet@ee.udel.edu
Organization: College of Marine Studies, U. Del.
Lines: 82
Nntp-Posting-Host: perelandra.cms.udel.edu
In-reply-to: mfinegan@uceng.UC.EDU's message of 21 Nov 90 16:35:17 GMT

> On 21 Nov 90 16:35:17 GMT, mfinegan@uceng.UC.EDU (michael k finegan) said:

michael> Does anyone have some reasonable way of measuring MFLOPS (Linpack
michael> or otherwise :-)) ? Could they email/post the code, or mix used ?
michael> 					mfinegan@uceng.uc.edu

I would normally reply by e-mail, but the word "reasonable" caught my
attention and I would like to know if my idea of "reasonable" matches
anyone else's.

First, the LINPACK 100x100 test is a fairly exact and repeatable
measure of MFLOPS.  It is only "fairly" exact since the vendor is
allowed to re-write the BLAS-1 (Basic Linear Algebra Subroutines)
however s/he wants in order to improve performance.  On the other
hand, the vendor is *not* allowed to modify the source at all -- not
even the comments!  Presumably this is intended to mimic the situation
of running large dusty decks with no time or expertise available for
detailed optimization.  On the down side, the test is awfully small.
The array has only 10,000 elements, which means it can be fully
contained in a 128kB cache for the 64-bit problem and in a 64kB cache
for the 32-bit problem.  For machines with vector units or pipelined
FPU's, the overhead of calling the BLAS routines can be large compared
with the time required to process the average vector length of 67
elements.   

(By the way, it has never been clear to me if producing an
"unsafe" BLAS is legal for this test.  By that I mean a BLAS which
assumes only strides of 1 and which removes all the silly "IF" tests
in the original source.  If the compilation system were smart enough
to use the "unsafe" BLAS only for the LINPACK library (probably by
inlining) and the "safe" BLAS for direct user calls, then this should
be workable.  I have seen significant performance improvements on
vector machines (ETA-10 and Cray Y/MP) by following this approach.)

There are two other LINPACK tests of interest.  The LINPACK 300x300
test was devised to use the level-2 BLAS (Matrix-Vector ops) and to
allow a slightly larger problem.  This produced huge speed
improvements on the Cray machines, but was a disaster for the
memory-to-memory Cyber 205/ETA-10 because of the non-unit strides
employed.   In any event, this test is the least popular, and has
subsequently been dropped from the report.

The LINPACK 1000x1000 test is often called the "anything goes" test.
The vendor is allowed to solve the system of equations in any way s/he
sees fit, including hand-coding the entire solver in assembly
language.  The only requirements are that the original driver code be
used and that the MFLOPS calculation be based on the number of
operations required for the original LU-decomposition code.  Almost
all vendors of high-performance machines have been able to achieve
something close to their hardware peak performance on this test.  In
the interests of politeness, I will not mention any names here of
those who could *not* do so well -- they are all in the report (see
below).

My experience has been that the LINPACK 1000x1000 test (using the
vendor's best technique) is a good estimate of the *real* peak
performance of a computer.  *Very* seldom is it possible to write user
code that performs at a significantly higher MFLOPS rate.  I have also
noticed that for code with very simple vector constructs (simple dyads
and triads, with few library functions) the LINPACK 100x100 case gives
a surprisingly good estimate of the performance attainable by "real"
codes.   By "surprisingly good", I mean that my real codes almost
always run at speeds within a factor of 2 of the LINPACK 100x100
results.  The largest differences are with those machines whose cache
refills are slow (like my Silicon Graphics 4D/25) which run close to a
factor of two slower on large codes than on the LINPACK 100x100.

Enough rambling.  The LINPACK codes and the paper tabulating the
results are available from the netlib server.  Send an e-mail message
to netlib@ornl.gov, with the text:
send index for benchmark
and the server will send an e-mail message back listing the names and
descriptions of the benchmark codes and other related material.  To
get the LINPACK 100x100 single-precision code, for example, send a
message with the text:
send linpacks from benchmark

Have fun....
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET