Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!watmath!clyde!bellcore!petrus!hammond
From: hammond@petrus.UUCP (Rich A. Hammond)
Newsgroups: net.arch
Subject: Re: The correct mean to use when comparing benchmark performance
Message-ID: <327@petrus.UUCP>
Date: Thu, 2-Oct-86 09:07:46 EDT
Article-I.D.: petrus.327
Posted: Thu Oct  2 09:07:46 1986
Date-Received: Sat, 4-Oct-86 05:41:48 EDT
References: <549@cubsvax.UUCP>
Organization: Bell Communications Research, Inc
Lines: 66

Peter S. Shenkin writes:

> ...
> HOW TO NORMALIZE:
> 
> Suppose this is the raw data:
> 		Machine A	Machine B
> Benchmark 1	10.0		 5.0
> Benchmark 2	10.0		20.0
> -----------------------------------------
> arith mean	10.0		12.5
> 
> Now, it is DUMB to normalize each benchmark separately.  THAT, and not
> arithmetic mean, is what gives rise to artifacts.  ...
 ... (read original or do arithmetic yourself)...
> 
> The results are identical throughout, despite the use of the arithmetic
> mean throughout.  Any other mean used throughout would, I believe, also
> give identical results.
> 
> SO WHY SHOULD WE PREFER ARITHMETIC MEAN?:
> 
> However, the arithmetic mean is directly related to the application
> of benchmark timings to real-world (nebulous though this entire field may
> be).  These averages would reflect real-word performance -- modulo this
> cloudiness -- if the application set were well-represented equally by
> the two benchmarks;  but this approach also works for weighted averages.
> Other means, such at the geometric, do not have this property
> 
> SUMMARY:  arithmetic mean wins, if normalization is properly performed.

I respectfully disagree, if you work the arithmetic out, you don't need to
normalize at all using your method, just compare the sums of the benchmark
times.  This is because you assume that:
a) The benchmarks are a representative sample of actual load,
and
b) that comparison of the component times is unimportant.

I claim that for the environment of most network news readers the first
is in fact false, most people have no idea what the load on their system
is composed of or even where a given application program spends its time.
E.g. - I asked programmers on our system for FORTRAN programs that would
"vectorize" well.  Of the 4 programs submitted, half were dominated by
subroutine calls and not vector/matrix arithmetic.

The second is often false, in that many benchmark suites are composed of
programs which stress one particular aspect of a system, e.g. ackermann's
function gives a picture of subroutine call/return costs.  In this case
one would like to compare the individual programs run times.  With
normalization to the arithmetic mean of a processor, one is still left
with taking the ratio of the times of interest and the normalization to
the arithmetic mean can be factored out and not done.  Normalization to
the individual component time, on the other hand, gives cases where the
ratio is trivial to compute because you're dividing by 1.

In the context of the CACM article, both assumptions are false:
the benchmarks aren't representative of the load(no system calls),
and the comparison of interest was individual program times and not the
sum.  What the CACM article pointed out was that under those conditions,
the geometric mean was the only one to use to get ratios of machine
performance that were independent of the machine normalized to.  What
the CACM article didn't say (and should have) was that the performance
ratio was pretty worhless anyway, so that computing it "correctly" is
a moot point.

Rich Hammond	Bell Communications Research	hammond@bellcore.com