Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!ucsd!ucbvax!ucbarpa.Berkeley.EDU!rafael From: rafael@ucbarpa.Berkeley.EDU (Rafael H. Saavedra-Barrera) Newsgroups: comp.benchmarks Subject: Re: benchmarks (SPECmarks) Message-ID: <39622@ucbvax.BERKELEY.EDU> Date: 16 Nov 90 07:18:41 GMT References: <7581@eos.arc.nasa.gov> <1146@dg.dg.com> <7589@eos.arc.nasa.gov> Sender: usenet@ucbvax.BERKELEY.EDU Reply-To: rafael@ucbarpa.Berkeley.EDU.UUCP (Rafael H. Saavedra-Barrera) Organization: University of California, Berkeley Lines: 104 In article <7581@eos.arc.nasa.gov> ucbvax!agate!shelby!eos!eugene writes: > Particular: that's right. Two things to add: 1) DEC knew that the > performance of 780 models varied by as much as 10%. Is 10% acceptable? > In some cases yes, others no. ... Gene, you are completely missing the point. The 10% variation on the VAX 780 has nothing to do with defining a unit of performance and using it to measure. If you look at the SPEC reports, you'll notice that they use the SAME execution times for the reference machine in all reports and for all machines. There is no 10% variation. What is needed is something we can use to measure and that everyone uses period. From the Enciclopedia Britannica: Measuring a quantity means acertaining its ratio to some other fixed quantity of the same kind, known as the unit of that kind of quantity. A unit is an *abstract conception*, defined either by reference to some arbitrary material or to natural phenomena. The key words are: ratio, fixed, and arbitrary. The SPEC people made a reasonable, but arbitrary definition of what represents their fixed quantity. Once you have done that, everything else follows, and there is no nothing else to discuss. All you require from a unit of measurement is: 1) that it is fixed; 2) that has validity for some significant group of people; 3) that can be verified. The SPECratio certainly satisfies the 3 conditions. As long as the SPEC people keep the *same* VAX 780, with the same software, and run the programs under the same conditions, there is nothing to object. I don't know if they are doing this, but they should, if they want to avoid problems in the future. > Taken to the extreme 2) why a VAX (or PC), well, why not an ENIAC? > You don't want an ENIAC for the same reason you won't want a VAX in the > future. You are cutting your own long-term throat. Wrong! The SPEC people didn't choose an ENIAC, because there are no ENIACS that can be used to run benchmarks and very few people alive ever use an ENIAC. Why not a CRAY, because it is very expensive to keep a CRAY in glass (same software, etc) just to run benchmarks in it. But in priciple an ENIAC, a PC, or a CRAY is as good as a VAX 780 for the purposes of what the SPEC people are doing. One of the nice properties of the SPECmark is that is INVARIANT to the machine you use as your reference point. The relative performance between a MIPS M/2000 and a Sparcstation I, or between a DEC 3100 and a IBM RS/6000 530 is the same, independent of whether you use a VAX 780, an ENIAC, or any other machine. So the issue of using a 780 dissapears. Your arguments sound very similar to the ones, one of your fictitious ancestors the Marquis Eugene De La Milla made, with respect to the use of an Earth's quadrant to define the meter in 1790. He asked, why use a quadrant of the earth as reference when there are bigger planets, that may be more relevant to future generations of human beings? Why 1/10,000,000th of the quadrant instead of 1/3,141,592th which looks more like pi? Where is te center of Paris? The center of the Ille de la Cite, or Notre Dame? [I bet you didn't know you has a french ancestor]. Do you know how the french measured the particular quadrant of the earth from the North Pole to the Equator and passes through Paris, one hundred years before the first man reached the North Pole? Did it make a difference? There are a lot of more interesting questions to ask about the SPEC benchmarks. For example, What does each program measures? Are the programs really exercising different aspects of the machine? How representative is matrix300 of typical linear algebra codes? Can a clever compiler writer make minimal changes to the compiler that will improve significantly the SPECmark for a particular machine, but will have a marginal benefit in most users' workloads? How do I estimate the performance of my workload by looking at the SPEC results? Why is the SPECratio of spice2g6 low on most machines when other double precision codes have better performance? Is the geometric mean a good statistic? This are a few questions I like to know the answer (some I know). > About 29 (or 42), I don't think it's the number of benchmarks. > I had a talk at one time entitled "The Next 700 Benchmarks." > [If you didn't know there have been a string of papers beginning with > "The Next 700 Programming Languages."] And in fact Carl Ponder (LLNL) > gave a talk about adding benchmark information can just cloud the issue. > It's not just the number of measurements or observations you take. I don't agree. 29 benchmarks are better than 10 benchmarks, *if* the 29 benchmarks are well chosen. Every benchmark represents an empirical observation of the performance of the machine. More observations are better than few, especially after seeing the results for the Stardent 3010. Here all benchmarks have SPECratios between 14.7 and 62.9, except matrix300 that has a ratio of 108.5! Is this an isolated point or are there many more programs that give similar results? However, you are right in saying that everytime we add a new benchmark we have to know what it measures and why we are including it? What new information it provides? I like the SPEC methodology for measuring SPECmarks, but I agree with J. Hennessy about the SPECthruput, the SPEC guys erred here. I don't agree with J. Hennessy that the weighted arithmetic mean (WAM) is better than the geometric mean for the SPEC benchmarks, but I agree with him when he says that the WAM is the correct statistic to use in the example he presented. I am contradicting myself? No, the two problems are different and therefore different statistics should be used. More on this later. rafael