Path: utzoo!attcan!uunet!husc6!bloom-beacon!mit-eddie!ll-xn!ames!pioneer!eugene
From: eugene@pioneer.arpa (Eugene N. Miya)
Newsgroups: comp.arch
Subject: Re: benchmarks
Message-ID: <8734@ames.arc.nasa.gov>
Date: 13 May 88 20:15:58 GMT
References: <30872@amdahl.uts.amdahl.com> <3460014@hpsrla.HP.COM> <2175@winchester.mips.COM> <31755@amdahl.uts.amdahl.com>
Sender: usenet@ames.arc.nasa.gov
Reply-To: eugene@pioneer.UUCP (Eugene N. Miya)
Organization: NASA Ames Research Center, Moffett Field, Calif.
Lines: 86

In summary from my paper:

What makes good benchmarks:

You require reproducibility, comprehendability, and you would like
simplicity.  In fact, benchmarks are too simple.  Machines are becoming
more diverse: multiprocessors of different architectures, smart
software, etc.

The simple linear model of measurement: take start time, do work, take
stop time is susceptable to these influences.  What you want are strict
controls of the pre-test condition, the test condition, and the post-test
condition.  Also important is the during test condition.  One of the
biggest challenges to performance measurement is parallelism: the
mythical MIPS problem: confusing work and effort as Brooks would say.
SO linearity is something we have to fight.  Our problem is
oversimplicity.

Test data cases must be carefully selected.  Do you execute a
significant (not necessarily most) portion of the code?  Data can be as
important as the program itself.  Don't even thing about asking about
interactive characterization.  It's mostly a joke.  (Pardons to those
reading who did their PhD thesis on characterizing interactive
systems).

We have to know when hardware/software is being benficial or detrimental.
We must throw off the idea that a program is an experiment [Feigenbaum?];
it is not.  A single program measurement lacks the experimental controls
necessary for good measurements.

We have synthetic as well as real program benchmarks.  The former are
usually derailed as unrealistic, but the problem is  our concept of the
execution of a program is too simplistic.  We talk of memory-CPU
benchmarks when most of the time is spent doing I/O.  We have few IO
benchamrks.  There's no such thing as a standard application.  We need
to do is run programs over applcations, then take the resultant data to
form a performance prediction. (notice we are never told expected Linpacks
only measured Linpacks).
My survey covers the shortest and longest synthetic programs I know.
Some syntactic analysis, some dynamic, etc.

The problem is we treat machines and benchmarks like black boxes.  We
expect near worthless single figures of merit to select machines (I
wonder what types of cars there people buy).  I'm interested in starting
a new "study" call it computer cardiology.

We must systematize the measurement process.  I'm talking to one special
software house on a benchmark test program generator and am working on
prototypes, now in my spare time.
The ideal measurement tool must have a high degree of portability.  It
must be reasonably simple, the analysis portion must be seprable from
the measurement portion.  Unfortunately, most machine make poor
measurement environments: IBM370s, VAXen, Mac, PCs.  Quantity does not
make something good, my standalone time on our X-MP has been curtailed
because our users also need the machine.  Cray-2s don't have HPMs.

The ideal tools should allow one to vary parameters carefully, one at a
time.  Linpack while a nice appearing simple single figure of merit has
diminishing parallelism (since it's a direct solution).  Any 32-bit result
should be viewed with suspicion  (it is a 64-bit test).  It's value is
that Jack Dongarra dares to name names, does not fear getting sued for
holding damning numbers, nor does Rick Richardson for that Dhrystone
matter.  We want the computer equivalent of the pocket tape measure.

Computers really don't differ in fundamental construction all that much
currently (well, Multiflow, CM[12], DAP, etc.).  These represent new
challenges for becnhmarking.  No, the paper does not read like this, it's
being typed "stream of consciousness" in during the "heat of passion."

There is a mailing list devoted to performance measurement (@cs.wisc.edu).
But they are mostly queueing theorists, not benchmarkers.  Largely
quiet, after all SIGMETRICS'88 is what next week?

P.S. Don't ask me for a copy yet, I will announce availability, I've
promised far too many people and get side-tracked too often.
I have a shorter paper which is undergoing review on a tiny aspect of
the bigger pictures, but I have to send a copy of the bigger one to
John, Chuck, and lots of others.  Don't worry.

Another gross generalization from

--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov
  resident cynic at the Rock of Ages Home for Retired Hackers:
  "Mailers?! HA!", "If my mail does not reach you, please accept my apology."
  {uunet,hplabs,hao,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene
  "Send mail, avoid follow-ups.  If enough, I'll summarize."