Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!ucsd!ucbvax!ucbarpa.Berkeley.EDU!rafael
From: rafael@ucbarpa.Berkeley.EDU (Rafael H. Saavedra-Barrera)
Newsgroups: comp.benchmarks
Subject: Re: benchmarks (SPECmarks)
Message-ID: <39622@ucbvax.BERKELEY.EDU>
Date: 16 Nov 90 07:18:41 GMT
References: <7581@eos.arc.nasa.gov> <1146@dg.dg.com> <7589@eos.arc.nasa.gov>
Sender: usenet@ucbvax.BERKELEY.EDU
Reply-To: rafael@ucbarpa.Berkeley.EDU.UUCP (Rafael H. Saavedra-Barrera)
Organization: University of California, Berkeley
Lines: 104

In article <7581@eos.arc.nasa.gov> ucbvax!agate!shelby!eos!eugene writes:
> Particular: that's right.  Two things to add: 1) DEC knew that the
> performance of 780 models varied by as much as 10%.  Is 10% acceptable?
> In some cases yes, others no.  ...

Gene, you are completely missing the point. The 10% variation on the 
VAX 780 has nothing to do with defining a unit of performance and using
it to measure. If you look at the SPEC reports, you'll notice that they 
use the SAME execution times for the reference machine in all reports 
and for all machines. There is no 10% variation. What is needed is 
something we can use to measure and that everyone uses period. From 
the Enciclopedia Britannica:

	Measuring a quantity means acertaining its ratio to 
	some other fixed quantity of the same kind, known as 
	the unit of that kind of quantity. A unit is an *abstract 
	conception*, defined either by reference to some arbitrary 
	material or to natural phenomena.
	
The key words are: ratio, fixed, and arbitrary. The SPEC people made 
a reasonable, but arbitrary definition of what represents their fixed 
quantity. Once you have done that, everything else follows, and there 
is no nothing else to discuss. All you require from a unit of 
measurement is: 1) that it is fixed; 2) that has validity for some 
significant group of people; 3) that can be verified. The SPECratio 
certainly satisfies the 3 conditions. As long as the SPEC people keep 
the *same* VAX 780, with the same software, and run the programs under 
the same conditions, there is nothing to object. I don't know if they 
are doing this, but they should, if they want to avoid problems in 
the future.

> Taken to the extreme 2) why a VAX (or PC), well, why not an ENIAC?
> You don't want an ENIAC for the same reason you won't want a VAX in the
> future.  You are cutting your own long-term throat.

Wrong! The SPEC people didn't choose an ENIAC, because there are no
ENIACS that can be used to run benchmarks and very few people alive
ever use an ENIAC. Why not a CRAY, because it is very expensive to
keep a CRAY in glass (same software, etc) just to run benchmarks 
in it. But in priciple an ENIAC, a PC, or a CRAY is as good as a 
VAX 780 for the purposes of what the SPEC people are doing. 

One of the nice properties of the SPECmark is that is INVARIANT to
the machine you use as your reference point. The relative performance 
between a MIPS M/2000 and a Sparcstation I, or between a DEC 3100 and
a IBM RS/6000 530 is the same, independent of whether you use a VAX 
780, an ENIAC, or any other machine. So the issue of using a 780 
dissapears.

Your arguments sound very similar to the ones, one of your
fictitious ancestors the Marquis Eugene De La Milla made, with respect
to the use of an Earth's quadrant to define the meter in 1790. He asked, 
why use a quadrant of the earth as reference when there are bigger planets, 
that may be more relevant to future generations of human beings?
Why 1/10,000,000th of the quadrant instead of 1/3,141,592th which looks 
more like pi? Where is te center of Paris? The center of the Ille de la 
Cite, or Notre Dame? [I bet you didn't know you has a french ancestor].

Do you know how the french measured the particular quadrant of the earth 
from the North Pole to the Equator and passes through Paris, one hundred
years before the first man reached the North Pole? Did it make a 
difference?

There are a lot of more interesting questions to ask about the SPEC 
benchmarks. For example, What does each program measures? Are the 
programs really exercising different aspects of the machine? How 
representative is matrix300 of typical linear algebra codes? Can a 
clever compiler writer make minimal changes to the compiler that will 
improve significantly the SPECmark for a particular machine, but will
have a marginal benefit in most users' workloads? How do I estimate 
the performance of my workload by looking at the SPEC results? Why is the 
SPECratio of spice2g6 low on most machines when other double precision 
codes have better performance? Is the geometric mean a good statistic?
This are a few questions I like to know the answer (some I know).

> About 29 (or 42), I don't think it's the number of benchmarks.
> I had a talk at one time entitled "The Next 700 Benchmarks."
> [If you didn't know there have been a string of papers beginning with
> "The Next 700 Programming Languages."]  And in fact Carl Ponder (LLNL)
> gave a talk about adding benchmark information can just cloud the issue.
> It's not just the number of measurements or observations you take.

I don't agree. 29 benchmarks are better than 10 benchmarks, *if* the 
29 benchmarks are well chosen. Every benchmark represents an empirical
observation of the performance of the machine. More observations are
better than few, especially after seeing the results for the Stardent
3010. Here all benchmarks have SPECratios between 14.7 and 62.9, except 
matrix300 that has a ratio of 108.5! Is this an isolated point or
are there many more programs that give similar results? However, you 
are right in saying that everytime we add a new benchmark we have to 
know what it measures and why we are including it? What new information 
it provides?

I like the SPEC methodology for measuring SPECmarks, but I agree with
J. Hennessy about the SPECthruput, the SPEC guys erred here.

I don't agree with J. Hennessy that the weighted arithmetic mean (WAM) 
is better than the geometric mean for the SPEC benchmarks, but I
agree with him when he says that the WAM is the correct statistic to 
use in the example he presented. I am contradicting myself? No, the 
two problems are different and therefore different statistics should 
be used. More on this later.

rafael