Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!watmath!clyde!bellcore!petrus!hammond From: hammond@petrus.UUCP (Rich A. Hammond) Newsgroups: net.arch Subject: Re: The correct mean to use when comparing benchmark performance Message-ID: <327@petrus.UUCP> Date: Thu, 2-Oct-86 09:07:46 EDT Article-I.D.: petrus.327 Posted: Thu Oct 2 09:07:46 1986 Date-Received: Sat, 4-Oct-86 05:41:48 EDT References: <549@cubsvax.UUCP> Organization: Bell Communications Research, Inc Lines: 66 Peter S. Shenkin writes: > ... > HOW TO NORMALIZE: > > Suppose this is the raw data: > Machine A Machine B > Benchmark 1 10.0 5.0 > Benchmark 2 10.0 20.0 > ----------------------------------------- > arith mean 10.0 12.5 > > Now, it is DUMB to normalize each benchmark separately. THAT, and not > arithmetic mean, is what gives rise to artifacts. ... ... (read original or do arithmetic yourself)... > > The results are identical throughout, despite the use of the arithmetic > mean throughout. Any other mean used throughout would, I believe, also > give identical results. > > SO WHY SHOULD WE PREFER ARITHMETIC MEAN?: > > However, the arithmetic mean is directly related to the application > of benchmark timings to real-world (nebulous though this entire field may > be). These averages would reflect real-word performance -- modulo this > cloudiness -- if the application set were well-represented equally by > the two benchmarks; but this approach also works for weighted averages. > Other means, such at the geometric, do not have this property > > SUMMARY: arithmetic mean wins, if normalization is properly performed. I respectfully disagree, if you work the arithmetic out, you don't need to normalize at all using your method, just compare the sums of the benchmark times. This is because you assume that: a) The benchmarks are a representative sample of actual load, and b) that comparison of the component times is unimportant. I claim that for the environment of most network news readers the first is in fact false, most people have no idea what the load on their system is composed of or even where a given application program spends its time. E.g. - I asked programmers on our system for FORTRAN programs that would "vectorize" well. Of the 4 programs submitted, half were dominated by subroutine calls and not vector/matrix arithmetic. The second is often false, in that many benchmark suites are composed of programs which stress one particular aspect of a system, e.g. ackermann's function gives a picture of subroutine call/return costs. In this case one would like to compare the individual programs run times. With normalization to the arithmetic mean of a processor, one is still left with taking the ratio of the times of interest and the normalization to the arithmetic mean can be factored out and not done. Normalization to the individual component time, on the other hand, gives cases where the ratio is trivial to compute because you're dividing by 1. In the context of the CACM article, both assumptions are false: the benchmarks aren't representative of the load(no system calls), and the comparison of interest was individual program times and not the sum. What the CACM article pointed out was that under those conditions, the geometric mean was the only one to use to get ratios of machine performance that were independent of the machine normalized to. What the CACM article didn't say (and should have) was that the performance ratio was pretty worhless anyway, so that computing it "correctly" is a moot point. Rich Hammond Bell Communications Research hammond@bellcore.com