Path: utzoo!utgpu!watserv1!watmath!att!att!linac!pacific.mps.ohio-state.edu!zaphod.mps.ohio-state.edu!usc!apple!agate!eos!eugene
From: eugene@eos.arc.nasa.gov (Eugene Miya)
Newsgroups: comp.benchmarks
Subject: More issues of benchmarking
Summary: sorry for a long posting, but I have raw data inside
Keywords: benchmarking balance,
Message-ID: <7601@eos.arc.nasa.gov>
Date: 21 Nov 90 11:08:03 GMT
Reply-To: eugene@eos.UUCP (Eugene Miya)
Organization: NASA Ames Research Center, Calif.
Lines: 261

I hope it is okay that I defer responding to what Rafael and the DG
fellow mentioned about setting a reference machine aside.  I have certainly
thought about those issues in the past and I will get to them (honest)
actually part of the reason appears below.
I don't think I will have time to post again this week (I am actively
trying to benchmark, at XXX, XXXX, XXXXXX, XXXXXX, and X-X 8^) due to the short
week and that I am away from my office.

Instead, let me present some issues having to do with metrics, how to take
a simple one, some of the intrusive forces which act on them, and
show some results.  This is an intentionally naive survey.

There's lots of metrics: FLOPS (the joke was "When does a FLOPS become a
success?" [humor out of the way]) and its variants MFLOPS, GFLOPS, TFLOPS,
and I just saw EFLOPS in SC'90), but also MIPS (BIPS or GIPS),
counts of specific arbitrary units of work:
	Whet|Dhry|NFS|IO|LHYNE|KHORNER|iHnsert your own STONES
	RhollingStones......(at NBSLIB)
But also vaguer things like "Logical Inferences Per Second." (LIPS)
If you see a good measure of the latter, I want to talk to you.
I have no qualms of the Megas, Gigas, Teras, Etas, peta-prefixes.
I have no qualms about the "Second."  I, and many others, do have
problems counting intructions, floating-operations (while ignoring
other instructions).  Yes, it must be admitted early, "the only real
measure is how fast you get your solution done."  If that is all you
see, and all you care about is buying machines, stop reading here
and go to the next post.

If you have continued, I hope, it is because you are interested in
other purposes of benchmarking to understand how to make machines faster.

I have come to some conclusions that raw timing and limited counts are
the most useful measures.  Yes, they are too simple, but I have to try
and build a framework from the bottom up to understand benchmarking.

It is possible to definite any number of metrics, and write any number of
programs (I have a paper under review which one co-worker likes my
line "Any program can be a benchmark, but good benchmarks are hard to find.").
Sure, "A metric is defined by a function d() which maps elements on a space
X...," goes on to prove d maps to the positive real numbers, "triangle
inequality,"  I want to toss that aside.

I'd like to use a different, more empirical approach.
As a former mathematician, one of the common techniques of proof by
induction typically begins: If X exists, then there must be a smallest
representative of X.  X here is a benchmark.  We's like it to be
representative or something we do (computation).  So I began s search for
"the smallest benchmark."  Smallest has lots of advantages: carry it anywhere,
solves the portability problem (simplicity problem, sort of), ignoring
the database of results and changing time, no card images to carry, etc.

The smallest (fewest characters) of any benchmark I know is in APL.
Pardon my APL
	256 (APL iota) ...(I have it written precisely somewhere)
Basically create a vector of elements 1..256 and sum them (Gaussian sum).
255 additions.  Time this.
Also illustrates the very first example of optimization I know:
when Karl F. Gauss "found" n(n+1)/2.  This optimization in a compiler
was reported to me by a person in the APL community who wishes to remain
anonymous, because he did this optimization prank, yes he did it as a prank,
on a user who used the above benchmark (doing a multiply, divide and one add
instead of 255 adds).

This is the shortest Unix benchmark I know:
	echo 99k2vp8opq | /bin/time dc > /dev/null
Time the arbitrary precision desk calculator in Unix to compute
99 digits of the square root of 2. ("CPU benchmark")  Now you learned
something.  Try it.  Suspend or move to another window and you, too can
run a benchmark {Pause exercise for reader. Users without
BSD style Job control can escape to the shell using '!sh' or '!csh'
non-Unix users, so sorry}  I thought about collecting
timings (see a problem ahead).  It's found in Musbus
among other places.  We used it for some folk at Bell Labs when we
first got an X-MP/12 (actually Cray-1M).

Examples:
VAX-11/780 (convenient):
csh> echo 99k2vp8opq | /bin/time dc > /dev/null
        5.8 real         5.3 user         0.2 sys  
csh> uptime
  1:28am  up 2 days, 18:29,  2 users,  load average: 0.20, 0.06, 0.06

SGI Iris 4D/60 (just me):
echo 99k2vp8opq | /bin/time dc > /dev/null

real        1.9
user        1.2
sys         0.2

Cray Y-MP/8128 (heavily loaded):
echo 99k2vp8opq | /bin/time dc > /dev/null

           seconds          clocks
elapsed    0.62413       104021600
user       0.36663        61104464
sys        0.03782         6302641

A large 32-bit mainframe (in use, maybe 12 users):
echo 99k2vp8opq | /bin/time dc > /dev/null

real        0.5
user        0.3
sys         0.0

You see some of the problems benchmarkers have to go thru (like formatting).
This took about 5 minutes of mousing around.  It would have been much worse
if I had to deal with different operating systems, different timing
techniques, and different languages.  These are all called
"CPU-memory" intensive measurements.  They certainly minimize I/O
but they ignore any potential paging, etc.  You can knock these
programs (too small, too simple, too easily optimized [Gauss himself]).
They are easily transported black boxes.  They can be run so fast
their actual execution time (or the execution time of any benchmark)
improves too well to be a good measurement tool (converging on 0.0 seconds).
You may think I am presenting a straw man, but I am trying to show you that
"play" is an important part of benchmarking.  We cannot explain how the
different machines above behaved.  We took timings, but they can raise more
questions (*Why is the mainframe so close to the Cray?  Sure, the Cray is 
a fast word-oriented, floating-point engine.  How do you get bytes? mask out
portions of the word (expensive).  I think it is important that students
of benchmarking have experience on machines with some architectural
diversity, so that they can see some hardware (and software) artifacts
{this is over simple}. If students don't, if they stay on VAXen, IBM
PCs or the smaller mainframes, they will form some very bised views.
e.g., I am concerned about benchmarking data flow machines. *)

So suppose you are very fed up with the above "dc" benchmark
(I think from Bell Labs).  You go off and naively write you own.
Let's suppose you have heard the term MFLOPS or MIPS as a metric:
Millions of floating point operations per second or millions of
instructions per second.  If you constrain yourself to high-level languages
(at least in these days), again, you should have a smallest and simplest.
You know what a Million is.  Suppose you wrote Fortran and timed:
C This is called a Pre-test initialization condition
      X=1.0
      Y=1.0
Then timed just:
      Z=X+Y
/* for C programmers*/
	float x=1.0, y=1.0, z;
and
	z = x + y;  /* 8^) */
Rmember let's be naive a moment, we don't know about registers,
caches (I'm going to run on a Cray, to those unfamiliar with Crays
it is an exercise for the reader to determine cache size), virtual memory,
etc......  If I run a million of these, that's a MFLOPs right?
[Additive linear models.]  Using an integer real time clock I get:

 1
 STOP  (called by $MAIN )
 CP: 0.001s,  Wallclock: 0.002s,  4.3% of 8-CPU Machine
 HWM mem: 98502, HWM stack: 2048, Stack overflows: 0

That's 1 clock period.  This is a 6.0 nanosecond machine.  If I substitute
a floating point second clock, I get

 2.8800000000002E-6
 STOP  (called by $MAIN )
 CP: 0.001s,  Wallclock: 0.003s,  3.2% of 8-CPU Machine
 HWM mem: 98505, HWM stack: 2048, Stack overflows: 0

That's not 6.0 nanoseconds.  Consider the following, if I had timed that
on a VAX (VMS or Unix) the time would have been zero (or very close,
it will not always be zero unless you synchronize your clock calls).
We assume several things are invariant: variable names, data types,
spaces are insignifcant to performance (this is not always the case).
If we had a 780 we would have to resort to some tricks to make a significant
"tick."  If I have 1 MFLOP machine why can't I run the following in 1 second
(assuming an unloaded machine) [Fortran version]:
      X = 0.0
10    X = X + 1.0
      IF(X.LE.1000000.0) GO TO 10
C equivalent C is left as another reader exercise
That's a million floating-point additions (operations). Right?  Oh, so
you see the comparison, is that an FLOP?  Many people don't think of that.
Suppose it is.  We cut the 1000000 to say 500000.  We see that portions of
of measurement can interfere (I will make a distinction later between
intrusion and interference).  Remember we are trying to keep this small.
Did you know some architectures can execute A*B+C very fast (nearly as fast
as a single + or *)?  Why aren't we assuming time(+) == time (*)?
I have seen one FP measurement based on a machine which I cannot name
for non-disclosure, they gave a sample line of code which included
not only +,* to get fastest speed, but also trig functions (I have that
some where).  That was the way to attain maximum speed on that machine.
There is another interference problem with that loop.  A smart compiler
might eliminate the expression by a compile time evaluation.  Such
loops are common when some programs repeat some unit of work many times
so that their low-resolution clocks can record some time (you then divide
by the number of repetitions to get an average rate [ignores an initial
page fault].  So some of this stuff is subtle,and it's why we need
to start thinking about measurement equivalence.

As a preview of advanced (play) issues to come, I offer:
It helps to have cycle time clocks, but non-intrusive (or minimally
intrusive) hardware monitoring tools will be needed.  The real estate
is very expensive.  Again, the Y-MP and the X-MP are good for this.
Can I post an hpm(1) man page without violating CRI's copyright?  I
suspect not.  But I think I can post simple output, to give you an idea
what it does, and we can get into details later.  This is an output
for a simple FORTRAN program:

 STOP  (called by $MAIN )
 CP: 0.001s,  Wallclock: 0.002s,  4.1% of 8-CPU Machine
 HWM mem: 97697, HWM stack: 2048, Stack overflows: 0
Group 0:  CPU seconds   :       0.00      CP executing     :         193512

Million inst/sec (MIPS) :      44.30      Instructions     :          51437
Avg. clock periods/inst :       3.76
% CP holding issue      :      42.81      CP holding issue :          82845
Inst.buffer fetches/sec :       0.77M     Inst.buf. fetches:            897
Floating adds/sec       :       0.21M     F.P. adds        :            246
Floating multiplies/sec :       0.23M     F.P. multiplies  :            267
Floating reciprocal/sec :       0.05M     F.P. reciprocals :             54
I/O mem. references/sec :       0.00M     I/O references   :              0
CPU mem. references/sec :      14.72M     CPU references   :          17092

Floating ops/CPU second :       0.49M

For a C program of equivalent functionality:
Group 0:  CPU seconds   :       0.00      CP executing     :          35317

Million inst/sec (MIPS) :      46.83      Instructions     :           9923
Avg. clock periods/inst :       3.56
% CP holding issue      :      43.34      CP holding issue :          15308
Inst.buffer fetches/sec :       0.66M     Inst.buf. fetches:            140
Floating adds/sec       :       0.00M     F.P. adds        :              1
Floating multiplies/sec :       0.00M     F.P. multiplies  :              0
Floating reciprocal/sec :       0.00M     F.P. reciprocals :              0
I/O mem. references/sec :       0.00M     I/O references   :              0
CPU mem. references/sec :      17.20M     CPU references   :           3645

Floating ops/CPU second :       0.00M

ls -l on the executables:
-rwxr-xr-x   1 eugene   xxx       717264 Nov 20 23:55 xxx
-rwxr-xr-x   1 eugene   xxx       108128 Nov 20 23:56 xx
You would probably be shocked what these programs do for so much storage.
I should compile VAX and SUN equivalents. (I will tell in time.)
VAX:
-rwxr-xr-x  1 eugene      25600 Nov 21 02:29 xxx
-rwxr-xr-x  1 eugene       4096 Nov 21 02:31 xx
SUN
/* can't get the fortran compiler on my sun, NFS down*/
-rwxr-xr-x  1 eugene      24576 Nov 21 02:39 xx

If you don't see the concept, it doesn't exist  --Arthur Pyster
	paraphasing Gerge Orwell.
We need more tools like the HPM.  Software alone will not help the performance
measurement problem.  Students MUST have exposure to tools of this type.
We need to develop frames of reference and calibration tools to better
understand our benchmarks and what they are doing.  Then you too
can do a `man hpm`.

Let me investigate more specific, less naive details next week.
Whole bunch of issues to cover.

--e.n. miya, NASA Ames Research Center, eugene@eos.arc.nasa.gov
  {uunet,mailrus,most gateways}!ames!eugene
  AMERICA: CHANGE IT OR LOSE IT.
  It's 3 AM and I need to get at least 2 sleep periods.
  Resident cynic, Rock of Ages Home for Retired Hackers