Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!ut-sally!husc6!cmcl2!rna!cubsvax!peters From: peters@cubsvax.UUCP (Peter S. Shenkin) Newsgroups: net.arch Subject: Re: The correct mean to use when comparing benchmark performance Message-ID: <552@cubsvax.UUCP> Date: Sat, 4-Oct-86 18:41:03 EDT Article-I.D.: cubsvax.552 Posted: Sat Oct 4 18:41:03 1986 Date-Received: Tue, 7-Oct-86 19:16:19 EDT References: <549@cubsvax.UUCP> Reply-To: peters@cubsvax.UUCP (Peter S. Shenkin) Organization: Columbia Univ. Bio. CG Fac., NY Lines: 76 In article willc@tekchips.UUCP (Will Clinger) writes: >In article <549@cubsvax.UUCP> peters@cubsvax.UUCP (Peter S. Shenkin) writes: >>HOW TO NORMALIZE: >> >>Suppose this is the raw data: >> Machine A Machine B >>Benchmark 1 10.0 5.0 >>Benchmark 2 10.0 20.0 >>----------------------------------------- >>arith mean 10.0 12.5 >> [ I'm deleting the rest of my quoted original article; I showed that if one normalizes all benchmarks to the arithmetic mean of EITHER machine, the relative performance of the two machines is identical, no matter which of the two machines is chosen as the norm. ] >So the advertising manager for Machine B notices that Benchmark 1 consists >of 1 iteration, while Benchmark 2 consists of 1000 iterations. That >doesn't seem quite fair, so he/she re-runs benchmark 1 with 1000 iterations >instead of 1 to obtain the raw data: > > Machine A Machine B >Benchmark 1 10000.0 5000.0 >Benchmark 2 10.0 20.0 >----------------------------------------- >arith mean 5005.0 2560.0 [ WIlliam goes on to point out, correctly, that even though following my recommended procedure continues to give the same relative performance of A and B, it now appears that B is faster...] ...and I reply, OF COURSE now B is faster... the benchmark has changed! And B would also be faster using the geometric mean, or any other mean, with this altered data. Therefore this is not an issue of which mean is better, but one of which benchmarks are the fair or applicable ones to use. And adver- tising managers will always pick the ones to make their machines look better. If it's not the advertising manager picking the benchmarks, however, but the end-user, then if the benchmarks in my article represent the proposed machine usage, then A is faster; if William's benchmarks represent the proposed usage, then B is faster. The arithmetic means support this conclusion. >Artifacts are neither art nor facts. I agree; and probably the difficulty of choosing good benchmarks and/or predicting the use of the machine contributes more to artifacts than the type of mean one uses; except that if you use arithmetic mean, you MUST normalize the way I've shown, and if you don't your results don't mean anything. >By the way, I see no flaw in the proof that appears in Philip J Fleming >and John J Wallace, "How not to lie with statistics: the correct way to >summarize benchmark results", CACM Volume 29 Number 3 (March 1986), >pages 218-221. I'm not very happy with their presentation, primarily >because they never give a clear statement of their theorem, which I >paraphrase: The geometric mean is the only function of n positive real >arguments that is reflexive, symmetric, and multiplicative. It's fair >to take issue with their proof, but if you're going to do so I'd like to >know which step(s) of their proof you find unconvincing, or which of the >three properties you feel is dispensable for an unweighted average of >normalized benchmark results. Well, here I have to admit that I've been talking through my hat all along; I've not read the article. I suppose I will, now. I probably object to the relevance of the multiplicative property. Since the actual time it will take for a machine to perform a series of tasks is the SUM of the times it takes for the tasks, one wants a mean which predicts this SUM. The (weighted, if necessary) arithmetic mean of the types of tasks which the machine will carry out is directly proportional to this SUM. Geometric and other means may require less care in calculation, but give a number, in the end, which bears no direct relation to the time it will take a machine to perform the tasks for which it is intended. And I believe this time is the desired performance criterion. Peter S. Shenkin Columbia Univ. Biology Dept., NY, NY 10027 {philabs,rna}!cubsvax!peters cubsvax!peters@columbia.ARPA