Path: utzoo!mnetor!tmsoft!torsqnt!news-server.csri.toronto.edu!clyde.concordia.ca!thunder.mcrcim.mcgill.edu!snorkelwacker.mit.edu!apple!mips!winchester!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.benchmarks
Subject: Re: bc benchmark [sigh]
Message-ID: <44342@mips.mips.COM>
Date: 26 Dec 90 06:37:57 GMT
Sender: news@mips.COM
Reply-To: mash@mips.COM (John Mashey)
Organization: MIPS Computer Systems, Inc.
Lines: 152
References:

Having finally caught up with the net after a long trip, I'm sad to see
that 1 out of 3 postings in this newsgroup concern the "bc" benchmark
or some variety of thereof.  I had higher hopes for this, especially as
at least some people have read previous discussions in comp.arch.
This %#@!$% thing is like a vampire: every time you think you've finally
put a stake thru its heart, it returns one more time.

1. Small benchmarks are very prone to misinterpretation, prone to
compiler gimmickry, and seldome excercise modern machines very well.
About their only even-slightly-rational use is to compare machines
with the same chips running at different clock rates.
Small, synthetic benchmarks can easily over- or under- emphasize language and/or
machine features out of all proportion to mixtures found in more realistic
benchmarks.

As a matter of faith, I consider small benchmarks guilty until proven
innocent, i.e., if you can prove their results correlate well, across
product lines, with much more substantial real programs, then maybe
you have something (and in fact, this is a good thing to have;
for instance, I've often thought of offering a small prize for anyone
who can create a small program that predicts performance on the 10
SPEC benchmarks across machine lines, but I haven't figured out
how to describe this well enough to figure out if someone has achieved it.)


2. Filling the net with timings for a benchmark where no one even explains
what code is being executed, how big it is, whether or not it correlates
with ANYTHING, etc, etc, is like trying to predict the speed of automobiles
by ripping out their steering wheels, and seeing how fast they roll.

3.  NOW, here are SOME FACTS about this benchmark:
	1) It is tiny:
		99.57% of the instruction cycles (on a MIPS machine)
		are accounted for by 10 LINES OF CODE
		71% of the cycles are consumed in 3 LINES OF CODE
		In addition, unlike matrix kernels, whose code is small,
		but whose data references are big, this doesn't even
		have that property: all the code & data fit in tiny caches.
	2) Its instruction usage bears little resemblance to much of
	anything: see Hennessy & patterson for typical characteristics
	of code.  In particular, this code almost never makes function calls,
	and ((on a MIPS machine, which HAS integer multiply and divide)
	spends 50% of the total cycles doing integer multiply and divide.
	I assure you, this is typical of very few programs; this is NOT
	the kind of statistics that any computer architect I know designs
	machines around, etc, etc.  (Of course, I should love this benchmark,
	as it REALLY hurts machines with no integer multiply.)

	At the end of this posting are the slices of prof & pixstats output.

4. PLEASE STOP WASTING TIME WITH THIS BENCHMARK
	(Please, let this be the last stake in its heart :-)

5. ABOUT THE ONLY USEFUL THING I CAN THINK OF TO DO WITH THIS is for somebody
to run this benchmark on many of the machines for which SPEC integer benchmarks
exist, plot the two together, and compute a correlation for them;
or even, pick any one of the SPEC integer benchmarks and do it for that.
(Or pick some other realistic integer benchmark for which well-controlled
results exist.)

----------
Profile listing generated Tue Dec 25 13:42:35 1990 with:
   prof -pixie dc 

*  -p[rocedures] using basic-block counts;                                 *
*  sorted in descending order by the number of cycles executed in each     *
*  procedure; unexecuted procedures are excluded                           *

84303520 cycles

    cycles %cycles  cum %     cycles  bytes procedure (file)
                               /call  /line

  84058950   99.71  99.71    1827369     36 mult (dc.c)
    132423    0.16  99.87       4905     37 div (dc.c)
     31153    0.04  99.90        538     21 nalloc (dc.c)
.....

OH GOOD: it spends 99.7% of its time in one function...
IN FACT, going to the next level of detail, where we see the number of
cycles spent in the statements that consumed the time, we discover
that 83.7% of the instruction cycles are spent IN JUST 4 LINES OF C....:


*  -h[eavy] using basic-block counts;                                      *
*  sorted in descending order by the number of cycles executed in each     *
*  line; unexecuted lines are excluded                                     *

procedure (file)                           line bytes     cycles      %  cum %
mult (dc.c)                                1097   100   22754044  26.99  26.99
mult (dc.c)                                1094    96   20317562  24.10  51.09
mult (dc.c)                                1093    68   16755620  19.88  70.97
mult (dc.c)                                1095    36   10771470  12.78  83.74
mult (dc.c)                                1098    40    8383670   9.94  93.69
mult (dc.c)                                1096    16    4787320   5.68  99.37
mult (dc.c)                                1084    80      83600   0.10  99.47
mult (dc.c)                                1102    96      45066   0.05  99.52
mult (dc.c)                                1087    68      41076   0.05  99.57
nalloc (dc.c)                              1974    36      29529   0.04  99.60
div (dc.c)                                  665   144      24070   0.03  99.63
mult (dc.c)                                1101    96      23606   0.03  99.66
div (dc.c)                                  657   124      22139   0.03  99.69
mult (dc.c)                                1104    40      20630   0.02  99.71
......
------------
Following is an analysis of instruction  usage, on MIPS R3000-based
machine:
pixstats dc:
 174126742 (2.065) cycles (6.97s @ 25.0MHz)
  84303520 (1.000) instructions  [# instructions]]
      1283 (0.000) calls  [basicaally: never does function calls]]
  28881440 (0.343) loads  [a little high]
   8458964 (0.100) stores
  89823222 (1.065) multiply/divide interlock cycles (12/35 cycles)
		(amazingly high: 50% of the time in this code is doing
		integer multiply divide.  Real programs do exist
		like this, but this is completely unrepresentative of
		the vast bulk of integer code....]

1.36e+05 cycles per call  ... like I said: hardly ever does function calls
6.57e+04 instructions per call


Instruction concentration:
         1   1.4%
         2   2.8%
         4   5.7%
         8  11.4%
        16  22.7%
        32  45.4%
        64  90.8%
       128  99.6%
       256  99.8%
       512  99.9%
      1024 100.0%
      2048 100.0%
      3697 100.0%

THIS SAYS: in a peerfect full-associative cache, 90.8% ofthe instruction
cycles would be spent in only 64 words (64 instructions), and 99.9% would
fit into 1024 words.... i.e., it fits into almost any machine's cache...

opcode distribution: [dynamic]]
     div    2395317    2.84%
   multu    1197623    1.42%

A PROGRAM WITH TWICE AS MANY INTEGER DIVIDES AS MULTIPLIES....
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086