Path: utzoo!attcan!uunet!know!zaphod.mps.ohio-state.edu!usc!apple!amdcad!mozart.amd.com!proton!tim
From: tim@proton.amd.com (Tim Olson)
Newsgroups: comp.arch
Subject: Re: Benchmark performance ratios
Message-ID: <1990Nov19.170400.12437@mozart.amd.com>
Date: 19 Nov 90 17:04:00 GMT
References: <39896@ut-emx.uucp>
Sender: usenet@mozart.amd.com (Usenet News)
Reply-To: tim@amd.com (Tim Olson)
Organization: Advanced Micro Devices; Sunnyvale, CA
Lines: 54

In article <39896@ut-emx.uucp> guru@ut-emx.uucp (chen  liehgong) writes:
| I have a few queries regarding benchmark performance ratios.
| 
| 1. If the benchmark consists of a set of programs (eg. the 
| livermore loops) is the overall performance ratio of the architecture
| under test (as compared to a standard one) calculated as the 
| harmonic mean of the performance ratios (say speed-ups) obtained 
| for each program (or livermore loop)? If so, Why is the harmonic mean
| used instead of the arithmetic or geometric means?

Fleming and Wallace, in their paper entitled "How Not to Lie With
Statistics: The Correct Way to Summarize Benchmark Results" [CACM
March 1986, Voluem 29 #3] say that the arithmetic mean should be used
when the individual benchmarks are reported in absolute time, while
the geometric mean should be used when individual benchmarks are
normalized to some "known machine."  James Smith, in the paper
entitled "Characterizing Computer Performance With a Single Number"
[CACM, October 1988, #10] argues that the harmonic mean should be
used, but only again with absolute quantities such as MFLOPS
(normalization should occur after the mean has been calculated).

The problem with mean calculations based upon absolute quantities
(seconds, MFLOPS, etc.) is that there is an implicit weighting of the
benchmarks based upon how long they run.  This is fine if the
benchmarks are designed such that the relative runtimes of the
benchmarks correspond to the actual runtime ratios expected in the
real application(s).  However, this is rarely the case -- a benchmark
suite typically contains a large number of varied programs that don't
have an overall relationship.  Because of this, I think that the best
thing that can be done is to give each benchmark equal weighting.  If
this is done, then the geometric mean of the normalized performances
should be used (e.g. SPEC).

| 2. If different kinds of benchmarks (eg. integer performance, floating-
| point performance or livermore loops, whetstones and dhrystones) are 
| used, how is the overall performance ratio (speed-up) calculated? i.e.,
| Which mean (AM, GM or HM) should be used?

The type of benchmark makes no difference, as long as it is measured
consistantly among each of the machines to get a normalized performance.

| 3. If the performance ratio is changed (say from speed-up to percentage
| decrease in execution time - in clock cycles) do the answers to 1 and 2
| above, remain the same?

I don't believe you can average using %increase/decrease -- you must
convert this into normalized performance first, average using the
geometric mean, then re-convert into %increase/decrease.


--
	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)