Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!ut-sally!husc6!cmcl2!rna!cubsvax!peters
From: peters@cubsvax.UUCP (Peter S. Shenkin)
Newsgroups: net.arch
Subject: Re: The correct mean to use when comparing benchmark performance
Message-ID: <552@cubsvax.UUCP>
Date: Sat, 4-Oct-86 18:41:03 EDT
Article-I.D.: cubsvax.552
Posted: Sat Oct  4 18:41:03 1986
Date-Received: Tue, 7-Oct-86 19:16:19 EDT
References: <549@cubsvax.UUCP> <tekchips.698>
Reply-To: peters@cubsvax.UUCP (Peter S. Shenkin)
Organization: Columbia Univ. Bio. CG Fac., NY
Lines: 76

In article <tekchips.698> willc@tekchips.UUCP (Will Clinger) writes:
>In article <549@cubsvax.UUCP> peters@cubsvax.UUCP (Peter S. Shenkin) writes:
>>HOW TO NORMALIZE:
>>
>>Suppose this is the raw data:
>>		Machine A	Machine B
>>Benchmark 1	10.0		 5.0
>>Benchmark 2	10.0		20.0
>>-----------------------------------------
>>arith mean	10.0		12.5
>>

[ I'm deleting the rest of my quoted original article;  I showed that if one
normalizes all benchmarks to the arithmetic mean of EITHER machine, the
relative performance of the two machines is identical, no matter which
of the two machines is chosen as the norm. ]

>So the advertising manager for Machine B notices that Benchmark 1 consists
>of 1 iteration, while Benchmark 2 consists of 1000 iterations.  That
>doesn't seem quite fair, so he/she re-runs benchmark 1 with 1000 iterations
>instead of 1 to obtain the raw data:
>
>		Machine A	Machine B
>Benchmark 1	10000.0		 5000.0
>Benchmark 2	   10.0		   20.0
>-----------------------------------------
>arith mean	 5005.0		 2560.0

[ WIlliam goes on to point out, correctly, that even though following my 
recommended procedure continues to give the same relative performance of A and 
B, it now appears that B is faster...]

...and I reply, OF COURSE now B is faster... the benchmark has changed!  And
B would also be faster using the geometric mean, or any other mean, with this
altered data.  Therefore this is not an issue of which mean is better, but
one of which benchmarks are the fair or applicable ones to use.  And adver-
tising managers will always pick the ones to make their machines look better.
If it's not the advertising manager picking the benchmarks, however, but the
end-user, then if the benchmarks in my article represent the proposed machine
usage, then A is faster;  if William's benchmarks represent the proposed usage,
then B is faster.  The arithmetic means support this conclusion.

>Artifacts are neither art nor facts.

I agree; and probably the difficulty of choosing good benchmarks and/or
predicting the use of the machine contributes more to artifacts than the
type of mean one uses;  except that if you use arithmetic mean, you MUST
normalize the way I've shown, and if you don't your results don't mean
anything.

>By the way, I see no flaw in the proof that appears in Philip J Fleming
>and John J Wallace, "How not to lie with statistics: the correct way to
>summarize benchmark results", CACM Volume 29 Number 3 (March 1986),
>pages 218-221.  I'm not very happy with their presentation, primarily
>because they never give a clear statement of their theorem, which I
>paraphrase:  The geometric mean is the only function of n positive real
>arguments that is reflexive, symmetric, and multiplicative.  It's fair
>to take issue with their proof, but if you're going to do so I'd like to
>know which step(s) of their proof you find unconvincing, or which of the
>three properties you feel is dispensable for an unweighted average of
>normalized benchmark results.

Well, here I have to admit that I've been talking through my hat all along;
I've not read the article.  I suppose I will, now.  I probably object to
the relevance of the multiplicative property.  Since the actual time it
will take for a machine to perform a series of tasks is the SUM of the
times it takes for the tasks, one wants a mean which predicts this SUM.
The (weighted, if necessary) arithmetic mean of the types of tasks which
the machine will carry out is directly proportional to this SUM.  Geometric
and other means may require less care in calculation, but give a number, in
the end, which bears no direct relation to the time it will take a machine
to perform the tasks for which it is intended.  And I believe this time is
the desired performance criterion.

Peter S. Shenkin	 Columbia Univ. Biology Dept., NY, NY  10027
{philabs,rna}!cubsvax!peters		cubsvax!peters@columbia.ARPA