Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!sol.ctr.columbia.edu!emory!utkcs2!de5
From: de5@ornl.gov (Dave Sill)
Newsgroups: comp.benchmarks
Subject: Re: benchmarks (SPECmarks)
Keywords: validity meaning
Message-ID: <1990Nov20.210922.4090@cs.utk.edu>
Date: 20 Nov 90 21:09:22 GMT
References: <7581@eos.arc.nasa.gov> <1146@dg.dg.com> <7589@eos.arc.nasa.gov> <1148@dg.dg.com> <108988@convex.convex.com>
Sender: news@cs.utk.edu (USENET News System)
Reply-To: Dave Sill <de5@ornl.gov>
Organization: Oak Ridge National Laboratory
Lines: 136

In article <108988@convex.convex.com>, rosenkra@convex.com (William Rosencranz) writes:
>
>first off: what are SPEC ratings (or any standard bm ratings for that
>matter) meant to do? answer this question in your mind first before
>proceeding...

They're meant to make it possible to get some idea of the performance
one can expect a system to provide, without requiring that one observe
the performance directly.

>i really see no point whatsoever in relating an execution time on one
>machine to that of another "standard" machine, no matter how standard,
>(except possibly the old "that's the way we've ALWAYS done it before",
>e.g. "MIPS"), just to come up with some single "standard" unit of
>performance.

The reason for relating performance on an unknown system to that of a
known one is to give the numbers some relevance.  If you tell me your
system does 15 gigafloogles/second, that tells me nothing unless I
know what a floogle is.  But if you tell me your system scored 11.2
SPECfloogles, I can get a handle on whether 15 GF/s is fast or not, at
least if I have any VAX experience--or another machine whos
SPECfloogle score I know.

>if I were buying (instead of selling :-), i'd want to see wallclock and
>cpu times, because i, as a human being, can relate to time far easier
>the "SPECs" or whatever. 

Sure, but the absolute wall clock isn't going to tell you anything.
It when you compare the values for different systems that gain
information from the results.  So what if the floogle benchmark runs
in 1:26?  That means nothing.  Give me a list of floogle times, and
I'll probably normalize them on some machine I'm familiar with (or
maybe the slowest machine in the list).  It's the relative performance
that's important.

>if something runs in 10 seconds, compared to
>100 seconds, i know i can sit and wait, call it "interactive". if
>something runs in 10 min vs 1 hour, i know i can go out to lunch in the
>latter case. a SPEC of 1.345 vs a SPEC of 4.345 means nothing, until
>i translate to time anyway. time is easier to "heft", as it were.

Only if you are intimately familiar with what's being done.  What's
the difference between 10 minutes versus 100 minutes and 10
SPECfloogles versus 100 SPECfloogles?  Both indicate the same relative
performace, and both measure the same absolute performance.  The
difference is that with the former you need two numbers to compare,
but with the latter you have the built-in VAX value: 10 SPECfloogles
is 10 times faster than SPEC's VAX 11/780.  Not perfect, but better
than nothing.

>further, i'd want to see how the "standard" bm results scale with
>problem size, especially on cache-based memory systems. because a
>buy decision based on a single number could come back to haunt me.

This is a valid point, but has nothing to do with whether wall clock
or relative-to-known values are reported.

>i'd also want to know what sort of performance enhancements i could
>expect if i wanted to put 1 hour, 1 day, and 1 week's effort into
>the optimization of any particular code, if possible.

Lotsa luck.  I don't know of any benchmarks that attempt to anticipate
what gains could be made by optimization, by you or anyone else.

>i'd also want to compare a vendor's peak performance with how well
>it did on standard bm's or on my own.

Just ask the vendors, they'll be glad to give you peak performance
figures.  :-)

>finally, i'd want to see what sort of support i can expect from the
>vendor. granted, pre-sales and post-sales activities can vary
>greatly, but i think i can shake out a vendor during the sales
>cycle, as most saavy buyers can.

This isn't a benchmarking issue at all.  Benchmarking can't and
shouldn't attempt to prevent the foolish buyer from buying foolishly.
Raw performance is just one criterion that should be part of a
procurement effort.

>and
>believe me, if i see 2 or 3 systems with uni-number ratings within
>say 5% of each other, i sure as heck would not say "these machines
>are identical, so let's buy the cheaper one becasue it has better
>price/SPECperformance".

I couldn't agree more.

>i'd want to look at the raw data anyway, and
>probably run my sort of workload on them to really get an idea of
>what i can expect.

SPEC provides the real data.  The SPECmark is just a handy single
figure of merit.  Better that than dhrystone-mips.  As for testing
them yourself: have at it.  Sometimes that's not feasible, and that's
what benchmarks are for.

>similarly, if i see two machines that differ by
>alot in some particular individual tests, i' want to know why.

Again, I agree.  But identifying the reason is not benchmarking issue.
Identifying the difference *is*.

>the basic problem i see with these uni-number ratings is that
>people can make up their minds, even subconsciously, based on a
>first impression.

So what do you propose?  Outlawing single figures of merit?  Better to
have one subject to much scrutiny and well-understood than to have
something ad hoc, unreliable, informal, etc.

>this is human nature. you always have that in
>the back of your mind. and it is easy to just say "2 > 1.5"
>rather than "based on some real workload, and on problem size,
>and on vendor support, and on application availability, and on
>whatever, 2 is not necessarily > 1.5".

No, 2 *is* greater than 1.5.  Always.  The problem is that there may
be more important issues that aren't so easily quantifiable.

>surely we can give more credit to the intellect of people making
>buy decisions than that?

You're the one who seems to think people are going to base their
decisions solely on a SFM.  The detailed data is available to those
who want it.

I don't mean to come across as some kind of SPEC apologist, I just
think what they're doing is better than what was done before they
existed. 

-- 
Dave Sill (de5@ornl.gov)
Martin Marietta Energy Systems
Workstation Support