Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!apple!agate!shelby!eos!eugene From: eugene@eos.arc.nasa.gov (Eugene Miya) Newsgroups: comp.benchmarks Subject: Re: benchmarks (SPECmarks) Message-ID: <7589@eos.arc.nasa.gov> Date: 15 Nov 90 06:54:35 GMT References: <7581@eos.arc.nasa.gov> <1146@dg.dg.com> Reply-To: eugene@eos.UUCP (Eugene Miya) Organization: NASA Ames Research Center, Calif. Lines: 123 In article <1146@dg.dg.com> uunet!dg!lewine writes: > But SPEC took a particular VAX-11/780. The 11/780 time for > gcc is 1482 seconds. It is not what you get on your particular > VAX. This is more like taking a gold bar in Paris and saying > that is the standard meter. As look as there is only one gold > bar, that is not a problem. Particular: that's right. Two things to add: 1) DEC knew that the performance of 780 models varied by as much as 10%. Is 10% acceptable? In some cases yes, others no. Using your bar analogy (I recall it's really platnium-iridum) that's why I gave the Metrology paper as a reference. The former NBS director used the term "Gold plating." John Mash[ey]@mips.com said "VAX under glass" at one time (Cute, I like it!). Will you use a ruler which maybe as much as 10% off? I think our society is beyond that. That is why the US NIST (was NBS) maintains an atomic clock. A highly instrumented multi-million $$ piece of hardware. Taken to the extreme 2) why a VAX (or PC), well, why not an ENIAC? You don't want an ENIAC for the same reason you won't want a VAX in the future. You are cutting your own long-term throat. That's what a platnium bar is. That's why the NIST uses the frequency of Kp atoms to not only specify, but also length (distance). We must go beyond that. Your H/W engineers use the best oscilloscopes? Right? Yet we software types are in the dark ages. > I think that 29 SPECmarks is more understandable than saying > that the Geometric Mean of the benchmark times is 133.3 > seconds. John Hennessy came today and blasted geometric mean (in favor of weighted arithmetic mean. I will commit to no statistic before its time. My opinion is that we must understand the sample before applying any statistic. I hate to say it: give me raw numbers and then I will think about sending them to S (or BMDP or whatever). About 29 (or 42), I don't think it's the number of benchmarks. I had a talk at one time entitled "The Next 700 Benchmarks." [If you didn't know there have been a string of papers beginning with "The Next 700 Programming Languages."] And in fact Carl Ponder (LLNL) gave a talk about adding benchmark information can just cloud the issue. It's not just the number of measurements or observations you take. --e.n. miya, NASA Ames Research Center, eugene@eos.arc.nasa.gov {uunet,mailrus,most gateways}!ames!eugene AMERICA: CHANGE IT OR LOSE IT. I copy Hennessy's five viewgraphs, I do not think John will mind since he brought Net disucssion up. I generally support most of what he had to say. Only 1-3 were presented, 4 and 5 were left over and covered orally: #1 Some Comments to SPEC + Means for summarizing performance + Choosing benchmarks + Guidelines for running benchmarks [Comment: "You (SPEC members) have a responsibility for what your marketing people say." I agree.] + The dangerous of SPECthroughput [Hennessy expressed worry about ideas cast in concrete. I really fear this as well, and it may be too late.] #2 Why not geometric mean? +Example Absolute time Relative performance M1 M2 M3 M1 M2 M3 B1 5 10 10 1 0.5 0.5 B2 10 5 10 1 2 1 GM 1 1 0.7 B means benchmark, M means Machine, GM is geometric mean +To replace the summary indicated by geometroc mean for M1 and M2: run each 50% of total workload M1 and M3: run B1 57% and B2 43% of total workload! M2 and M3: run B1 43% and B2 57% of total workload! ---------------------------------------------------- #3 Why weighted arithmetic mean + Single weighting yield results proportional to execution time! + Suggested weighting: equal time on base machine. Results: weights for earlier example are 2/3, 1/3. Weighted execution times M1 M2 M3 B1 10/3 20/3 20/3 can also use a weighted harmonic B2 10/3 5/3 10/3 mean [I know this ref to be Worlton] AM 20/3 25/3 30/3 Perf. 1.0 0.8 0.67 --inverse of execution time Not shown but discussed: #4 Choosing benchmarks + Some evaluation procedurs need to be established to choose benchmarks. + These need to focus on questions like: - is this a real program - how many lines constitute the 90% or 95% point -is the input appropriate + How will you know the potential defects before choosing the benchmark? [I have thought of some of these questions and I wish to discuss them and some ideas and will try to prsent them in the coming days and weeks.] #5 Guidelines for running programs + Serious problems can arise because guidelines for running benchmarks (typo) are not precise. [No kidding, this was a point in one SPEC discussion, I am not a SPEC member but was invited. Maybe I should post a few notes or impressions. Basically SPEC is kinda of a good thing; only I wish it had been ANSI instead [some minuses]] +Some examples - what routines can be replaced by libraries? - what are the requirements for runtime checks such as bounds checking and FP exception checks. I should not that I am not innocent, and one of the SPEC benchmarks came from me (and we have serious contraints on running that program, it was renamed).