Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!apple!agate!shelby!eos!eugene
From: eugene@eos.arc.nasa.gov (Eugene Miya)
Newsgroups: comp.benchmarks
Subject: Re: benchmarks (SPECmarks)
Message-ID: <7589@eos.arc.nasa.gov>
Date: 15 Nov 90 06:54:35 GMT
References: <7581@eos.arc.nasa.gov> <1146@dg.dg.com>
Reply-To: eugene@eos.UUCP (Eugene Miya)
Organization: NASA Ames Research Center, Calif.
Lines: 123

In article <1146@dg.dg.com> uunet!dg!lewine writes:
>	But SPEC took a particular VAX-11/780.  The 11/780 time for
>	gcc is 1482 seconds.  It is not what you get on your particular
>	VAX.  This is more like taking a gold bar in Paris and saying
>	that is the standard meter.  As look as there is only one gold
>	bar, that is not a problem.

Particular: that's right.  Two things to add: 1) DEC knew that the
performance of 780 models varied by as much as 10%.  Is 10% acceptable?
In some cases yes, others no.  Using your bar analogy (I recall it's
really platnium-iridum) that's why I gave the Metrology paper as a
reference.  The former NBS director used the term "Gold plating."
John Mash[ey]@mips.com said "VAX under glass" at one time (Cute, I
like it!).  Will you use a ruler which maybe as much as 10% off?
I think our society is beyond that.  That is why the US NIST (was NBS)
maintains an atomic clock.  A highly instrumented multi-million $$
piece of hardware.

Taken to the extreme 2) why a VAX (or PC), well, why not an ENIAC?
You don't want an ENIAC for the same reason you won't want a VAX in the
future.  You are cutting your own long-term throat.
That's what a platnium bar is.  That's why the NIST uses the frequency
of Kp atoms to not only specify, but also length (distance).
We must go beyond that.  Your H/W engineers use the best oscilloscopes? 
Right?  Yet we software types are in the dark ages.

>	I think that 29 SPECmarks is more understandable than saying
>	that the Geometric Mean of the benchmark times is 133.3 
>	seconds.

John Hennessy came today and blasted geometric mean (in favor of
weighted arithmetic mean.  I will commit to no statistic before its time.

My opinion is that we must understand the sample before applying
any statistic.  I hate to say it: give me raw numbers and then I will think
about sending them to S (or BMDP or whatever).

About 29 (or 42), I don't think it's the number of benchmarks.
I had a talk at one time entitled "The Next 700 Benchmarks."
[If you didn't know there have been a string of papers beginning with
"The Next 700 Programming Languages."]  And in fact Carl Ponder (LLNL)
gave a talk about adding benchmark information can just cloud the issue.
It's not just the number of measurements or observations you take.

--e.n. miya, NASA Ames Research Center, eugene@eos.arc.nasa.gov
  {uunet,mailrus,most gateways}!ames!eugene
  AMERICA: CHANGE IT OR LOSE IT.

I copy Hennessy's five viewgraphs, I do not think John will mind since
he brought Net disucssion up.  I generally support most of what he had to say.
Only 1-3 were presented, 4 and 5 were left over and covered orally:

#1
	Some Comments to SPEC
+ Means for summarizing performance
+ Choosing benchmarks
+ Guidelines for running benchmarks
	[Comment: "You (SPEC members) have a responsibility for what
	your marketing people say."  I agree.]
+ The dangerous of SPECthroughput
[Hennessy expressed worry about ideas cast in concrete.  I really fear
this as well, and it may be too late.]

#2	
	Why not geometric mean?
+Example

	Absolute time			Relative performance
	M1  M2  M3			M1  M2  M3
B1	5   10  10			1  0.5 0.5
B2	10   5  10			1   2   1
				GM	1   1  0.7
B means benchmark, M means Machine, GM is geometric mean

+To replace the summary indicated by geometroc mean for
  M1 and M2: run each 50% of total workload
  M1 and M3: run B1 57% and B2 43% of total workload!
  M2 and M3: run B1 43% and B2 57% of total workload!
  ----------------------------------------------------

#3
	Why weighted arithmetic mean

+ Single weighting yield results proportional to execution time!
+ Suggested weighting: equal time on base machine.
Results: weights for earlier example are 2/3, 1/3.

	Weighted execution times
	M1	M2	M3
B1	10/3	20/3	20/3		can also use a weighted harmonic
B2	10/3	5/3	10/3		mean [I know this ref to be Worlton]
AM	20/3	25/3	30/3
Perf.	1.0	0.8	0.67		--inverse of execution time

Not shown but discussed:
#4
	Choosing benchmarks
+ Some evaluation procedurs need to be established to choose benchmarks.

+ These need to focus on questions like:
 - is this a real program
 - how many lines constitute the 90% or 95% point
 -is the input appropriate
+ How will you know the potential defects before choosing the benchmark?
[I have thought of some of these questions and I wish to discuss them
and some ideas and will try to prsent them in the coming days and weeks.]

#5
	Guidelines for running programs
+ Serious problems can arise because guidelines for running benchmarks
(typo) are not precise.
	[No kidding, this was a point in one SPEC discussion,
	I am not a SPEC member but was invited.  Maybe I should
	post a few notes or impressions.  Basically SPEC is kinda of
	a good thing; only I wish it had been ANSI instead [some minuses]]
+Some examples
 - what routines can be replaced by libraries?
 - what are the requirements for runtime checks such as bounds checking
   and FP exception checks.

I should not that I am not innocent, and one of the SPEC benchmarks came
from me (and we have serious contraints on running that program,
it was renamed).