Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!samsung!uunet!convex!newsadm
From: patrick@convex.COM (Patrick F. McGehearty)
Newsgroups: comp.benchmarks
Subject: Re: Which benchmarks are useless?
Keywords: benchmarks  date  statistical correlation
Message-ID: <1991Apr29.225104.26828@convex.com>
Date: 29 Apr 91 22:51:04 GMT
References: <1749@marlin.NOSC.MIL> <2717@spim.mips.COM> <1751@marlin.NOSC.MIL>
Sender: newsadm@convex.com (news access account)
Reply-To: patrick@convex.COM (Patrick F. McGehearty)
Distribution: comp.benchmarks
Organization: Convex Computer Corporation, Richardson, Tx.
Lines: 53
Nntp-Posting-Host: mozart.convex.com

In article <1751@marlin.NOSC.MIL> aburto@marlin.NOSC.MIL (Alfred A. Aburto) writes:
>
>As indicated in the table Dhrystone 1.1 ratio results are greater than the
>Integer SPEC ratio results by 14% to 24% with an average of 21% greater.
...
Of the systems listed, I believe all but the DEC VAX 11/780 came after
Dhrystone started to be a widely quoted benchmark.  Because it is a fairly
small code, it is easy to use as a tool for focusing compiler tuning
efforts.  [After all, if you only have limited resources for tuning,
you might as well use them where they can be shown to make a difference]

I suspect that if DEC were to spend a few months tuning their C compiler
(or maybe just rerun their latest release on a 11/780), they also could
get another 20% out of the Dhrystone benchmarks.

I also would not be surprised to see the SPEC numbers improve in the next
few years for existing hardware with better compilers.  What gets measured
gets worked on.  The advantage of benchmark suites like SPEC (and for you
number crunchers, the Perfect Club and Slalom benchmarks) is that there is
such a variety of coding styles and usages that the improvements are likely
to benefit many real codes.  In some cases, special tricks will be found
that only benefit those codes, but for the most part, improvements will be
made that help many programs run faster.

Back to the issue that started this discussion:

When you propose a new benchmark, consider how vendors will respond if
people start using it for serious competitive evaluation.  [If you don't
want people to use it, why are you proposing it??]  Will it encourage
the vendors to improve the things you want improved?  If not, can it
be changed to do so?  Show it to a few people (with DRAFT, DO NOT DUPLICATE
marked all over it), and get their feedback.  Then ask yourself again if it
is useful.

The 'date' benchmark has a number of serious flaws.  A key one is that
if the date operation were added to the command shell, it would go many
times faster.  Since the intent is to measure process spawning time,
... well, you get the point.
A similar thing happened to the getpid system call.  Some people at Berkeley
wanted to know how fast a trivial system call was, so they could tune the
syscall interface.  They wrote a loop to call getpid() many times.  This
test was appropriate for their purposes.  Later, this test (and many others)
was made generally available.  Some vendors chose to speed up this test by
caching the process id in user space on the first getpid(), and avoiding the
system call overhead for subsequent getpid()'s.  There is nothing wrong with
that optimization, just that it does nothing for real user programs.

The main reason I don't like the date benchmark is that it encourages me (as
vendor) to fix the wrong things.  In addition, as a user, the benchmark does
me little good, because I have little confidence that it will measure the
same things I care about (at least it won't after the vendors start working
on it if they take it seriously).  The same reasoning applies to the 'bc'
benchmark which ran through this news stream a while ago.