Path: utzoo!censor!geac!torsqnt!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!ucsd!nosc!marlin!aburto
From: aburto@marlin.NOSC.MIL (Alfred A. Aburto)
Newsgroups: comp.benchmarks
Subject: Re: bc benchmark [really: One Number]
Message-ID: <1685@marlin.NOSC.MIL>
Date: 2 Jan 91 21:45:19 GMT
References: <44342@mips.mips.COM> <15379@ogicse.ogi.edu> <44353@mips.mips.COM>
Reply-To: aburto@marlin.nosc.mil.UUCP (Alfred A. Aburto)
Organization: Naval Ocean Systems Center, San Diego
Lines: 59
Distribution:comp.benchmarks

In article <44353@mips.mips.COM> mash@mips.COM (John Mashey) writes:

>(Note, for example, that published Dhrystone results easily mis-predict
>SPEC integer benchmarks pretty badly, i.e., it is quite easy for machine
>"a" to be 25% faster on Dhrystone than "b", and end up 25% SLOWER on more
>realistic integer benchmarks.)
>-- 
>-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>

This is an interesting observation (result).

Dhrystone was intended to be REPRESENTATIVE of TYPICAL integer
programs. That is, hundreds (I believe) of programs were 
analyzed to come up with the (ahem) 'typical' high level
language instructions and their frequency of usage. In view of this 
I would, at first sight, suspect the Dhrystone to be more accurate
than SPEC as SPEC is based upon only a few integer programs. 
What happened? Why does Dhrystone fail? 
Is it due to:

 (a) Instruction Mix is WRONG?

 (b) Optimization Problems? 
     This is not a problem in my view --- we just need people to
     report results using various compiler options then we gain 
     a more proper perspective of the variation in performance. 
     Of course, in general, people tend to publically report the
     'Max' or 'Best' performance.  The 'Min' or 'Mean' results 
     are more difficult to find. I know Dhrystone (1.0, 1.1, 2.0, 
     2.1) can all be optimized a great deal (up to a factor of 2
     or so because I've done it) but this should not be a problem
     as long as we know what result corresponds to what compiler
     options --- this helps to define the RANGE of expected
     performance (Min, Max and/or Std. Dev.) with a certain compiler
     and system, and also the 'Mean' or 'Median' performance.

 (c) Program Size TOO small?
     I suppose that if it were not for cacheing (cache size) 
     effects then program size should not be a problem, but I'm
     no expert ...

 (d) Something else?

Why should one expect the integer SPEC results to be more 'accurate'
than the Dhrystone?  I'm just wondering.  What is a 'typical' program
or 'typical' frequency of instruction usage?  Seems to me there is no 
one real 'typical' anything but a wide variety of 'typical' programs,
instruction mixes, and frequency of usages depending upon application.

Real programs also show a great variation in performance.  I noticed
this recently in a Scientific American article (Jan 1991) which
showed the comparison of 13 different real programs on a wide
variety of supercomputers.  The program 'megflop' variation in 
perfromance was truly tremendous especially for the fastest systems
(Cray and a NEC computer I think).

Al Aburto
aburto@marlin.nosc.mil